by Balázs Hidasi
Overall I enjoyed this year’s RecSys conference. It was well organized and nice to see what other researchers and people from the industry were up to. However, I was somewhat disappointed to see that the quality and amount of research in this field has slowed down significantly. In my opinion there were only a handful of long papers where the core idea was novel and exciting. (See my top picks below.) I don’t know the exact reason behind this shift and I doubt anyone does. However I find it interesting to speculate on this. Looking at the big picture this slowdown might not be that surprising. Here are my thoughts on this year’s conference..
Evaluation and goals of a recommender system in research and in practice
In the last five or so years, recommender systems research has moved closer and closer to practical systems. Fortunately, the days of rating prediction are pretty much over and the majority of work focuses on the realistic scenario of top-N recommendations. You can also see other signs of this shift, e.g. more papers working with implicit feedback and/or using online evaluations and so on. This is generally a good thing, because it makes the transition of novel methods from research to industry faster. However there is a huge problem with recommender systems in practice: evaluation. In his keynote speech, Igor Perisic talked about “making delight” being the goal of these systems and products. I fully agree with the notion that the final goal of a recommender system is to make its users’ lives easier, help them with their problems (related to finding what they need), and generally make using the system a good experience for them. But from a research point of view you can’t evaluate methods with respect to “delight”. You can try to approximate it through several steps by using different online metrics. However metrics that are good for A/B testing – such as CTR – are not approximating the final goal well. And offline tests are approximations of the online performance, thus they add an additional approximation step. Still their value can’t be discarded as they are useful for prefiltering methods. And from the research point of view offline tests are exact and repeatable: it is clear which algorithm performs better by a concrete metric and running the same test three months later will give the exact same results. Long story short, as recommender systems research transitions towards industry, researchers find that they can’t evaluate their methods in a way that is very meaningful in practice. Therefore majority of practitioners take the reported performance of novel algorithms with a pinch (or lots) of salt and still use very basic methods. This is disheartening, and slows down the progress of research.
Exhausted research topics
Currently popular topics are generally well researched. The same topics have been popular for the last decade. For example: factorization methods, context-awareness, cold-start avoidance, hybrid algorithms, and etcetera, have been around for a while now. Even though the appearance of implicit feedback in research has spiced things up, that itself has become somewhat exhausted. This has naturally caused a slowdown, because additional research can only add a small epsilon to already existing solutions. I think that the community is waiting for the next big thing, something that is fundamentally different and shakes things up. This new area however must deal with a problem that is important in practice and be algorithmically challenging and interesting to researchers. I think that some researchers already have a candidate that could qualify for this. You could hear whispers among the crowd here and there; as well as several researchers I talked with mentioned a certain topic with which they will start working on shortly. 🙂 If we think about it optimistically, perhaps this year was just the calm before the storm and the next few years might be the most exciting period of recommender systems research yet.
The lack of an industry track
There may also have been a conference specific reason for the low output of exciting research papers at this year’s RecSys conference. RecSys is traditionally a conference for academia and industry; for research and application. However papers can only be submitted for a research track. Purely application related presentations are generally in the (invitation based) industry sessions. To my surprise, there were several papers in this year’s research track that I would describe as high quality engineering work. This type of work combines ideas from previous years’ research as components of a system to provide recommendations in a specific scenario of a specific domain. The technical quality of these papers is generally high, but the novelty for research is negligible. I think these papers have a place in a conference like RecSys but I don’t agree with including them in the research track. Did these papers take places from actual research papers? I’m not sure, maybe they were, or maybe it is the other way around; maybe there weren’t enough high quality research papers, so the remaining slots were filled high quality engineering works. Whatever the case may be, I think that the conference would benefit from having a separate industry track for papers of the engineering kind.
I don’t think that there is a single reason behind the slowdown of research. I think all three of the aforementioned theories are correct to some extent. They – as well as other factors I haven’t considered – cumulatively caused this phenomenon.
Best paper picks of RecSys 2015
Despite my complaints, there were several papers at RecSys 2015 that I enjoyed. The following list contains my top picks from the main conference (long and short papers) with some justifications for why I liked them. There are no workshop papers on the list, because I haven’t fully processed those. I restricted myself to select only three papers, the ones that I’ve found the most exciting. There were other papers and ideas I liked, but these were the most interesting ones for me. The items of the list are presented without any particular ordering.
Gaussian Ranking by Matrix Factorization (by Harald Steck)
The paper proposes a framework for directly optimizing for ranking metrics, such as AUC and NDCG. While methods optimizing for certain ranking metrics were proposed before, this framework is much more general and promises to be able to handle every metric as long it is differentiable with respect to the ranks of the items. This includes most of the popular metrics, such as NDCG, AUC, MRR, etc. (but unfortunately not recall@N). The key to the framework is the link between the scores and the ranks which makes the ranking loss differentiable with respect to the model parameters. The whole idea is very elegant and has additional potential beyond the scope of the paper. (Also, bonus points for using the NSVD1 model, even if it is referred to as AMF. NSVD1 is a classic method, that is unjustly forgotten, yet I review at least one paper a year that tries to reinvent it.)
Dynamic Poisson Factorization (by Laurent Charlin et. al.)
I would have probably missed the original paper on Poisson factorization have it not been for this presentation at RecSys 2015. That would have been a shame, because the base algorithm is very interesting. It seems to be a better fit for implicit feedback than Gaussian factorization methods (or their frequentist counterparts that optimize for the sum of squared errors). It also has a few additional nice properties. The algorithm presented at RecSys builds on this novel factorization, introduces evolving user and item feature vectors and puts them into the mix. This is an answer to a practical problem: the taste of users and the audience of items change and we should model it somehow. The only thing I miss from the paper is a comparison with some kind of event decay function supported factorization method. (Bonus points for the clear and comprehensible style of the paper. Due to the complexity of the algorithm this wasn’t easy to convey in such a simple way by any means.)
Predicting Online Performance of News Recommender Systems Through Richer Evaluation Metrics (by Andrii Maksai et. al.)
It has been an important and yet unanswered question how offline metrics relate to online KPIs. While several papers suggested that optimizing for (ranking based) accuracy might not be the best course of action, there was no clear alternative. What trade-off is good between accuracy and diversity? Will a 5% percent increase in recall@N translate to a noticeable increase in CTR? These were the questions to which nobody had an exact answer, but if you’ve worked enough with recommenders you know a few rules of thumb. This paper might solve this issue as it proposes to build a predictive model from 17 offline metrics to estimate CTR. Using this model they present interesting findings, e.g. in news recommendation it seems that diversity and serendipity is a little bit more important than accuracy. I think that the validity of the proposed approach depends entirely on how accurate the CTR prediction is over time. At first glance the results are convincing, but it is alarming that the constant CTR prediction also has a low error. Nonetheless, I think this is an interesting direction that is worth exploring on additional domains. (Bonus points for pointing out that different metrics of a type (e.g. accuracy metrics) have high correlation with each other. Maybe reviewers in the future won’t mind if you don’t use their favorite accuracy metric.)
This year’s RecSys conference left me with feelings of ambivalence. Although there were a handful of papers containing substantial research and contributions to the field of RecSys, the overall program felt lacking. The conference was well organized, but not as strong as it has been in the last few years. In the future, I hope to see the rise of one or more completely novel topics to shake up the field and make things within the RecSys research community exciting again. Lastly, the presentation of research and engineering papers side by side this year felt unnatural, and I believe that dividing these topics into separate tracks will benefit the conference in a big way.
Balázs Hidasi is leader of the data science team, and is responsible for research and data mining activities in Gravity. He coordinates the team and also conduct his own research in the field of machine learning and data mining. His research revolves around (1) developing advanced recommender algorithms to make Gravity’s recommender engine even better; and (2) exploring new fields and application areas for recommender systems. Balázs also coordinates and consults for data mining projects (e.g. data analysis, POCs) within the company. He also has a blog.