by Balázs Hidasi
The 10th conference in the ACM RecSys conference series was held in Boston between September 15 and 19. Upon returning from the conference (and a few days of vacation dedicated to discovering Boston), I decided to post about my experiences.
This is the first in a series of three blog posts on RecSys 2016. In this post, I write about the conference from a research perspective and discuss the popular research directions of the conference and the field in general. The next post will discuss Gravity’s contribution, which includes the organization of the Deep Learning for Recommender Systems workshop, presenting a long paper, and more. The final post will conclude this series with my best paper picks from RecSys 2016. Check back next week for the second blog post!
Last year I was kind of disappointed with the technical quality of RecSys and hoped that it was due to everyone working on new exciting research directions and wanted to roll the old stuff out before moving on. It felt like the calm before the storm and you could feel that the community had been already working on novel research projects, but only a few was ready for the public. The only question was whether these exciting topics will be discussed this year?
Fortunately, the answer is definitely yes. There were plenty of interesting and well-executed research papers. Submission figures also suggest that last year the community experienced a temporary setback in the intensity of (publishable) research. But this year, the number of submissions set a new record for the conference. Since the number of accepted papers is roughly the same every year – as this is more determined by the length of the conference and the number of parallel sessions than anything else – this means that acceptance was more competitive, which resulted in the overall better technical quality of the program.
One of the things that made this year’s ACM RecSys conference particularly interesting is that deep learning has finally reached the RecSys community. After having no accepted papers using deep learning last year, this year the conference had 15+ submissions in this topic, with several papers accepted and an entire workshop dedicated to deep learning for recommender systems. The latter was incidentally co-organized by us. Places for the workshop filled up really quickly.
The organization of the conference was good. I liked that presentation slots were given to short papers as well this year. Last year they only had poster presentations, resulting in them blending in and not receiving the deserved attention. Fewer posters and the large enough venue also made the poster reception less crowded. (However, the hall for coffee breaks was small, so it was hard to move around. You win some, you lose some.) I’ve heard a few complaints that it was hard to pay attention to short paper presentations because they had to fit into 7 minutes and thus some of them were too fast. I didn’t have this problem in general, but I agree that due to context-switching it was tiring when 7+ presentations were in one session. Fortunately, this was rather the exception than the rule, as most sessions had fewer presentations.
Since this was the 10th conference in this series, there were celebratory events and a special “Past, Present, Future” (PPF) track. PPF papers look at the big picture and contemplate on where the field is heading or should be heading. I listened to some of these. Some were interesting, some were rather meaningless. I guess, we will see who was right in predicting the future at RecSys 2026.
The conference this year had two fully parallel tracks. Even though this meant that you miss at least half of the presentations, I think this is a good direction. It is still better to miss a few interesting presentations than not having those presentations in the first place. In my opinion, the viability of the parallel tracks depends on finding the tracks whose audience hardly overlaps. For me, this year’s parallelization was almost perfect. I only missed a few presentations I wanted to listen to and there was only one block where I wasn’t interested in either of the two tracks. The whole conference was recorded, so I should be able to watch the video of the missed presentations soon.
All three keynotes this year were interesting and well-presented. Even though two of the three weren’t directly about recommender systems, I think that this is more of a feature than a bug. This community already knows about recommenders. Therefore, different perspectives of researchers of related fields and thought provoking observations are much more useful than hearing about how someone used matrix factorization in their recommender system for the umpteenth time.
Deep learning on the rise
I already mentioned that there was a huge interest in deep learning this year and it seems that this topic will spread in the algorithmic side of the community pretty fast. In fact, deep learning and neural networks combined formed the second largest keyword group in accepted papers, just after collaborative filtering.
However, it was surprising to see, that even though the notion of deep learning was not rejected from the conference, there was some kind of (hidden) hostility towards it from some parts of the community. I assume that this is due to the overhyping of deep learning in both research and the mainstream media. I agree that due to the incredible results achieved in the last 5 years, lots of people started to overhype it. Unfortunately, this happens to everything that is slightly promising in this day and age. According to the hype, deep learning is the solution to every problem and some even think that it is the strong AI. It is natural that this begets the disapproval of knowledgeable people, but I think that it shouldn’t be directed at deep learning itself. Also, this doesn’t invalidate the breakthrough results achieved with deep learning. Deep learning is a useful tool for tackling a wide variety of problems including some that were very hard to deal with before, such as image recognition, NLP, audio- and text-mining. Whether and how deep learning will be useful in recommender systems is still up to question. But we would be foolish not to try it in this domain. Even if recommenders can benefit nothing from deep learning in the end – which I highly doubt – the community should at least explore this area in depth before making judgments on its usefulness.
A few of the deep learning papers seemed to apply deep learning because of its trendiness only. It is very weird to see papers using cutting edge machine learning to solve a standard task, already solved well by hundreds of other algorithms; and evaluating on 10+ years old datasets w.r.t. rating prediction, using RMSE. I think it is a serious problem that papers with this kind of evaluation are still accepted. Fortunately, this was only specific to a few deep learning, papers; the majority was well-done and had good original ideas behind them.
Looking at the deep learning papers this year, one can see that this community is at the very beginning of exploring deep learning. The most popular topics were word2vec and autoencoders: these networks are fairly old and very well-known in the deep learning community. Even though, I’m not complaining: the community has to start somewhere and these networks are good starting points, and it is interesting to see how they can be applied in this domain. Outside of our paper, only a few other ones used more advanced networks or came up with their own architecture.
Life outside of DL
Most of the talks this year, of course, had nothing to do with deep learning. The topics covered a wide range of research areas, but three directions were more prominent than the others.
Context was one of these. The contextual challenges session had the most papers in it. Context gained new momentum with the ongoing shift towards context-driven recommenders. Context has started to ascend from auxiliary information besides the user-item interactions to information on the situation, which determines what to recommend. The “context-driven” term is not yet widely used, but several papers showed traits of this idea, even if it wasn’t fully embraced most of the time. I’ll write more on this topic in the next post.
The other wide-spread topic seemed to be centered on the interaction with the human users and considering the user in the system. I can’t really say much about this because I attended the other parallel sessions, but it was apparent that lots of people consider this to be an important topic. In principle, I agree that tailoring recommenders to better fit the users, by e.g. providing explanations to gain trust, etc. is worthwhile to discuss. The problem is that these concepts are very hard to properly evaluate. User studies are basically useless because they are usually small and either reflects the opinion of a vocal minority or that of the students of the professor who conducts the research. A/B tests on these topics can easily be misleading. For example, the change in the KPI might not be because you added personalized explanations, but because you generally show more information and the user can easily determine that the product is not relevant to her. It’s not a big surprise that you can easily get contradicting results when dealing with this topic.
The third popular topic is the hardest to grasp and describe but is very interesting for practitioners. Papers of this kind discussed algorithms that focus on something else beside pure accuracy. Diversity, coverage, response times, technological and resource constraints have been represented amongst papers of previous years as well, but I felt that this year this topic was more significant than before. Whether it is due to it being increasingly harder to improve on accuracy or the recent shift towards practice; it is good to see that some of these papers turned out to be pretty good.
Since this blog post is already very long, I won’t list the other, less visible, but nonetheless interesting topics one-by-one. Instead, I just highlight my favorite presentation from the Large Scale Recommender Systems (LSRS) Workshop. In his talk, Yves Raimond from Netflix talked about whether distributed training of recommender algorithms is worth the effort or not. The conclusion was, that unless your data is really huge, it is just a waste of resources and it can even be slower to run it distributed than on a single machine. It is worth to optimize your code and see how far you can go, before mindlessly switching to using a large cluster for distributed computations.
This observation is very much in line with my own opinion. I’m generally annoyed with the big data hype – which is even worse than the deep learning hype because it has been around for longer. While I don’t deny that big data technology is useful for certain tasks, it annoys me when data “scientists” use huge Spark/Hadoop/etc. clusters and waste resources training on a few (tens of) gigs of data. Meanwhile, if your code is well optimized, you can run it with this amount of data using a mid to high-end laptop or desktop PC, not to mention servers. For example, if you can’t run an item-kNN on 1 billion events with hundreds of thousands of items in reasonable time on a single machine, you are doing something wrong. Yves demonstrated this point really well and without condemning big data or distributed algorithms. This sober take on this otherwise controversial topic was very appealing to me.
Ghosts of the past (10 years)
Before concluding this post, I’d like to highlight one of the long-standing problems of RecSys and recommender systems research in general. Some parts of this community are stuck in the past. In 2016, we still had papers that worked on explicit feedback data, did the rating prediction task and thus evaluated w.r.t. RMSE or MAE. This is the classic task that was popularized by the Netflix Prize 10 years ago.
The goal of a recommender system is to rank the items for the user (or in a situation, or to an item, etc.) and show the most relevant ones. This task is usually referred to as the top-N recommendation task. Several research papers showed that good rating prediction doesn’t necessarily mean good top-N recommendation and vice versa. In fact, the order of algorithms on these two tasks can be quite the opposite. Rating prediction is pretty much useless in 99% of the cases because a good recommender has to solve the top-N task. Of course, solving the top-N task is just a part of the whole recommender system; there are also other things to consider. It is also true that the results of offline evaluation, in general, should be handled with a grain of salt; but as I wrote in my post on RecSys 2015: to do research, you need some kind of well-defined evaluation, even if it is just an approximation of the final goal. The thing is: rating prediction is not an approximation of the final goal and is therefore now obsolete. Any paper that focuses on this task in 2016, shows that its authors have no clue about real recommender systems. That’s why this practice is constantly called out by researchers in the industry.
Note, that this doesn’t mean that explicit feedback is necessarily bad. You can do top-N recommendations based on explicit feedback as well. It will be less interesting for practitioners, because explicit feedback is usually hard to gather in the wild, and even if you have it in large quantities, you will only have it for a small portion of your user base. Now, there are several public implicit feedback datasets, so everyone may choose to switch to those. But doing top-N recommendations on explicit data is fine.
The worst thing about this is not that rating prediction papers are written, per se. The authors might be outside of the community or just getting into the field. They might think – based on the vast literature on rating prediction – that this is the problem they should try to solve. The problem is that these papers receive good reviews and get accepted to conferences like RecSys or published in journals. This depends on the reviewers, who should know better. I hope that we won’t see any rating prediction papers next year. I’ll do my part by calling out this practice because I think the whole community would benefit from banishing rating prediction papers.
RecSys 2016 was a good conference, both in terms of organization and general technical quality of the papers. While there were a few lower points in both of these, I can say with conviction that we should hope never to have a worse RecSys than this one. As for my obscure predictions for RecSys 2017: I think that deep learning will continue to grow within the community and most of the papers in this topic will be interesting and worth reading. However, I have no doubts that we will see some papers that only use deep learning for the sake of using deep learning. I don’t think that the deep learning hype will peak next year in this community, but rather grow in the next few years until we come to a conclusion how to best use this tool in recommender systems. I hope for a good conference and if the organizers and the community will follow the example set this year, we have every chance to get it.
Balázs Hidasi is the Head of Data Mining and Research in Gravity R&D. He is responsible for coordinating his team’s and conducting his own research on advanced recommender algorithms. His areas of expertise include deep learning, context-aware recommender systems, tensor- and matrix factorization. Balázs also coordinates and consults for data mining projects within the company. He has a PhD in computer science from the Budapest University of Technology.