Similarity based on rating data
Yahoo Research's Malcolm Slaney presented a paper at ISMIR on
Wednesday about how they are using user rating data to create song
similarity data. Yahoo is in the enviable position of having billions of
user-taste data points about music. This data, naturally, can be
used to generate item to item similarities that would be extremely
useful as ground truth for any number of MIR tasks. Malcolm's
motivation for the talk was to propose an alternative to the rather
time-consuming and painful process of human evaluations that are
used in the music similarity task in MIREX. Malcolm presented a
rather traditional item-to-item collaborative filtering system - nothing
new in the approach, I was hoping that at the end Malcolm would say
that they are giving a big wad of the taste data or the item-item
similarity data to the MIR community, but alas, Malcolm says that it is
just too hard to give away such data - especially after the AOL shared data fiasco of last year.
That's all fair. And now I understand your question about the not-so-similar results. Nobody, including me, has shown that item-to-item similarity forms a metric space. Any ideas?
As far as data goes... we have released a large dataset of music rating data. The dataset contains over 717 million ratings of 136 thousand songs given by 1.8 million users of Yahoo! Music services. The data was collected between 2002 and 2006. Each song in the data set is accompanied by artist, album, and genre attributes. The users, songs, artists, and albums are represented by randomly assigned numeric id's so that no identifying information is revealed. Alas, since the media's id's are randomized, there is no way to connect this to content.
Send email to me at [email protected] for more information.
- Malcolm
Posted by Malcolm Slaney on September 29, 2007 at 03:29 AM EDT #