How to evaluate a playlist?
I
have a number of conversations with researchers about playlist
generation and how best to evaluate how good a playlist is. It is not an
easy problem, some are skeptical about whether it can be done at all
without thousands and thousands of real music listeners as evaluators -
but lots of people have ideas. A number of researchers seem keen on
figuring out how best to evaluate playlists in a more formal setting
like MIREX.
Ben Fields has started the ball rolling with a post on his blog: introductory thoughts on a playlist generation task in MIREX 2009. If you are interested in playlist generation, then join the conversation.
Some ideas that have been floated for a playlist evaluation:
- The traditional IR approach - use a large database of human generated playlists (from webjay, musicmobs etc), randomly remove tracks from the playlist - calculate precision and recall for systems that try to predict what tracks were removed.
- Human evaluation - Have experts (DJs, music critics,) and non-experts evaluate the playlists.
- Create a reverse turing test - present each system with a set of playlists - some human created, some created at random - systems try to predict which playlists are human generated.