Clive Thompson has an excellent piece today in the New York Times Magazine about recommendation: If you liked this, You're sure to love that. The article gives a good overview of the Netflix prize and some of the problems that the competitors face in dealing with trying to predict whether you would give "Michael Clayton" 2.2 stars or 2.3 stars. As Steve Krause points out, Clive Thompson even tries to explain how singular value decomposition works - not something you see everyday in a newspaper article.

Clive does seem to to fall into a a common trap that assumes that computers must be seeing nuances and connections that humans can't:

    Possibly the algorithms are finding connections so deep and subconscious that customers themselves wouldn’t even recognize them. At one point, Chabbert showed me a list of movies that his algorithm had discovered share some ineffable similarity; it includes a historical movie, “Joan of Arc,” a wrestling video, “W.W.E.: SummerSlam 2004,” the comedy “It Had to Be You” and a version of Charles Dickens’s “Bleak House.” For the life of me, I can’t figure out what possible connection they have, but Chabbert assures me that this singular value decomposition scored 4 percent higher than Cinematch — so it must be doing something right. As Volinsky surmised, “They’re able to tease out all of these things that we would never, ever think of ourselves.” The machine may be understanding something about us that we do not understand ourselves.
Or they may just be overfitting the data.

I was hoping to see Clive talk about the problems with the Netflix prize - how it over emphasizes the importance of relevance in recommendation at the expense of novelty and transparency. The teams involved in the Netflix prize spend all of their time trying to predict how many stars each of the many thousands of Netflix customers would apply to movies. This skews the recommendations away from novel and diverse recommendations.

Similarly, the Netflix prize pays no attention to helping people understand why something is being recommended. There are some good papers that show that recommenders that can explain why something is being recommended can improve a users trust in the recommender and its recommendations.

The short and accessible paper: Being accurate is not enough: how accuracy metrics have hurt recommender systems provides an excellent counterpoint to the approach taken by the Netflix prize. Some highlights from this paper:

  • Item-Item similarity can bury the user in a 'similarity hole' of like items.
  • Recommendations with higher diversity are preferred by users even when the lists perform worse on Netflix-prize style accuracy measures.
The New York Times article describes the 'Napoleon Dynamite' problem - this is a film that people either love (five stars) or hate (1 star) and it is really hard to predict. One researcher says that this single movie, of the 100,000 movies in the Netflix collection, accounts for 15% of the error in their recommender. I suggest that a better way to deal with the Napoleon Dynamite problem is incorporate this uncertainty into the recommendation directly. A recommendation such as "Napoleon Dynamite is a quirky film that appeals to a certain sense of humor - you may love this movie, or you may hate this movie - but whichever, it will certainly be something you will remember." - will be much more informative than a recommendation of "3 stars".

When people learn that I work with recommender systems, they will often ask me if I am working on the Netflix prize - I tell them no, I am not - because of two reasons - first, there are some people who are way smarter than me who are already working on this problem - and they will certainly get better results than I would ever be able to, and second, and perhaps more importantly, - I don't think it is a very relevant problem to solve - there are other aspects of recommendation: novelty, diversity, transparency steerability, coverage, and trust that are as important - and a good recommender can't just optimize one aspect, it has to look at all of these aspects.


I agree with your comments about accuracy not being enough. I have not worked directly on the "Prize" problem, but I have looked some at the data. Statistically significant relationships may be an entry point to generate hypotheses, but they are essentially meaningless without an associated explanation. I am not sure why Netflix focused the competition on the accuracy of error. Unless I am missing something, I find the question of predicting "whether you would give Michael Clayton 2.2 stars or 2.3 stars" not meaningful. And even the best possible algorithm probably will have an average error that is much worse. So has anything been learned from the prize competition?

Posted by Fred Annexstein on November 24, 2008 at 12:08 PM EST #

Nice blog post! :-)
I also enjoyed that NYT article very much (and I totally agree with everything you wrote).

Despite the obvious overfitting problems and all the flaws of the evaluation metric: It was a genius idea to set up the Netflix prize and I congratulate Netflix for setting the bar to 10%. So close and yet still so far away after such a long time and so many attempts.

Btw, it's great to see a team of Austrians on rank 2. And it's even nicer to see that they seem linked to one of my all time favorite professors!* :-)


Posted by Elias on November 24, 2008 at 05:16 PM EST #

Post a Comment:
Comments are closed for this entry.

This blog copyright 2010 by plamere