How to evaluate a playlist?

I have a number of conversations with researchers about playlist generation and how best to evaluate how good a playlist is. It is not an easy problem, some are skeptical about whether it can be done at all without thousands and thousands of real music listeners as evaluators - but lots of people have ideas. A number of researchers seem keen on figuring out how best to evaluate playlists in a more formal setting like MIREX.

Ben Fields has started the ball rolling with a post on his blog: introductory thoughts on a playlist generation task in MIREX 2009. If you are interested in playlist generation, then join the conversation.

Some ideas that have been floated for a playlist evaluation:

The traditional IR approach - use a large database of human generated playlists (from webjay, musicmobs etc), randomly remove tracks from the playlist - calculate precision and recall for systems that try to predict what tracks were removed.
Human evaluation - Have experts (DJs, music critics,) and non-experts evaluate the playlists.
Create a reverse turing test - present each system with a set of playlists - some human created, some created at random - systems try to predict which playlists are human generated.

Any more ideas?

Duke Listens!: Visit my main blog at MusicMachinery.com

How to evaluate a playlist?

About this weblog

Index

Your Current Location