I have a number of conversations with researchers about playlist generation and how best to evaluate how good a playlist is. It is not an easy problem, some are skeptical about whether it can be done at all without thousands and thousands of real music listeners as evaluators - but lots of people have ideas. A number of researchers seem keen on figuring out how best to evaluate playlists in a more formal setting like MIREX.

Ben Fields has started the ball rolling with a post on his blog: introductory thoughts on a playlist generation task in MIREX 2009. If you are interested in playlist generation, then join the conversation.

Some ideas that have been floated for a playlist evaluation:

  • The traditional IR approach - use a large database of human generated playlists (from webjay, musicmobs etc), randomly remove tracks from the playlist - calculate precision and recall for systems that try to predict what tracks were removed.
  • Human evaluation - Have experts (DJs, music critics,) and non-experts evaluate the playlists.
  • Create a reverse turing test - present each system with a set of playlists - some human created, some created at random - systems try to predict which playlists are human generated.
Any more ideas?

Post a Comment:
Comments are closed for this entry.

This blog copyright 2010 by plamere