As is the norm in each yearly RecSys conference, there were several strong papers. But I found that while many of the papers were impressive from an engineering and practical use standpoint, they did not fare as well from a scientific research standpoint. More specifically, I’m singling out articles that simply take an area of research, add in some additional data sources, and then integrate this additional data into existing formulas to elicit some improved accuracy.
We’ve come quite far since the days of RMSE
When the Netflix Prize Competition was still active from 2006 to 2009, there was just one massive dataset (of 100 million ratings) and one target: the root-mean-square error (RMSE). During that time, the research was focused and papers were very comparable to each other. We’ve since come a long way from papers published around the years of the Netflix Prize, it has been determined that algorithms have varying levels of effectiveness depending on which dataset it’s being used against. It turns out that RMSE isn’t a good choice when your purpose is to generate relevant recommendations.
At Gravity, we found that using RMSE is not effective back in 2009 when we were building a public demo based on the Netflix Prize datasheet. Further explanation about how we reached this conclusion can be found in the first section of our RecSys presentation: Neighbor methods vs matrix factorization – case studies of real-life recommendations.
These days, there are plenty of datasets, and many different evaluation metrics are also available. To further drive the complexity of the current state of the RecSys community, researchers often add an additional data source to create even more complex algorithms. Over time, research topics are becoming more diverse, and research papers are no longer comparable.
For Gravity’s customers, item-to-item recommendations (people who viewed this item also viewed) are in higher demand than personalized recommendations. However, it’s really hard to find papers on the topic of item-to-item recommendation.
However, there was one paper from this year’s conference that I did find of interest and which I will explain below.
Top-N Recommendation for Shared Accounts
A paper that stood out to me this year, and that I would identify as my favorite would be from Koen Verstrepen and Bart Goethals: Top-N Recommendation for Shared Accounts. I observed their approach in this paper to be the following:
- Consider a user who viewed Nitems.
- Use a typical item-neighbor method to assign a score to each recommendable item based on what the user has viewed previously.
- Create 2N-1 temporary users, each with a different subset of the original user’s viewing history.
- Generate prediction scores for each of those temporary users, using the item-neighbor method.
- For each temporary user, divide the scores by the temporary user’s history length, or a power of that number (e.g. take the square root of temporary user’s history length).
- When calculating the prediction for item i for the original user, take the maximum score for item i over each temporary user, that will be the final score for item i for the original user.
- Order items by the computed prediction scores
They show that this can be done in O(Nlog(N))time instead of O(2N). This approach (taking the maximum score over each temporary user) has another nice property: it can provide explanations, i.e. the root cause why item i was recommended to the original user. Consider for example, that for item i, the maximum score was generated by a temporary user who viewed items i1, i2 and i3. Then for item i, the recommender algorithm can say that it was recommended because the original user viewed items i1, i2 and i3.
This paper was really interesting because it focused on algorithmic methods, featured a simple yet fast solution, and they show how this method helps when multiple users are using the same account (e.g. a household watching TV), without knowing the number of persons in the household or knowing which person of the household viewed which item. They also propose an elegant way to generate diverse recommendations:
- First, take the highest scored item. It will also have some explanatory items (see above)
- Second, take the highest scored item from the rest, but consider only those items that have at least one explanatory item that is not amongst the highest scored item’s explanatory items
- Third, take the highest scored item from the rest, but consider only those items that have at least one explanatory item that is not amongst the above items’ explanatory items
- And so on
They also show that their method’s accuracy is comparable to the original neighbor method which they operate on, and is also capable of giving good recommendations when multiple people are sharing the same account. In my opinion, this method is a nice way to give users recommendations with the following 3 properties: diverse, accurate and easily explainable, all at the same time. I really enjoyed this paper as it was able to provide new enhancements to a method-familiy (item-based neighbor methods) that has been studied for so many years.
This year’s RecSys was a well organized conference, there were some really strong papers, as usual. But overall, I felt that it lacked the spirit of the old years, when every conference would bring about the announcement of several new breakthroughs in research. There used to be plenty of algorithmic papers every year, and everybody was always curious how research would develop into the future. Now that we already have all those breakthroughs, this area is maturing, and it’s become difficult to make big discoveries. This year’s many papers containing engineering work also indicates the less research oriented direction the conference is now taking.
In the future, I’d like to see more emphasis and research placed on the following topics:
- Correlating offline and online measures (e.g. Recall vs. CTR). There was a paper this year about this topic, hopefully there will be many more papers in the upcoming years.
- Correlating short-term and long-term online measures (e.g. predicting long-term site-income increase from short-term CTR increase). Simple example: if you make a customer buy twice as much water as usual, then this customer may skip buying water next time.
- Item-2-item recommendations: this is a frequent topic in need of more research.
- Matrix factorization methods that deal with really large and sparse matrices (e.g. 50M items x 100M user, 3 events per user). The problem here is that you have to increase the number of latent factors, otherwise totally unrelated items might become similar.
- Content-based filtering methods that are able to find the most relevant items in real-time, even when there are 200M items. Currently, there are approximate solutions (e.g. Locality-Sensitive Hashing), which provide a trade-off between accuracy and running time, but if you need good accuracy, you would be better off running the naive approach.
- AutoML is an interesting new direction, i.e. instead of manually choosing the best algorithms and manually tuning the hyperparameters, the aim is to have this process done automatically. Perhaps the RecSys community should make some step in this direction, e.g. a RecSys challenge with 20 different RecSys problems simultaneously would be something new and challenging.
Here’s to wishing for more breakthroughs and excitement in the future of the RecSys community!
István Pilászy is the Head of Core Development as well as one of the founders at Gravity R&D.