A first peek at the data

The LastFM-ArtistTags2007 dataset has 100,000 unique artist tags for about 20,000 artists. Let's take a closer look at some of the data.

First, here's a plot of tag frequency. First thing to notice is the power distribution. Not surprisingly, it looks like tags follow Zipf's law (where the frequency of a tag is inversely proportional to its rank). Interestingly, 45,000 of the 100,000 or so tags have been applied only once.

Looking at a log-log plot of the frequency vs. rank data, we see that the data is linear starting at a rank around 25. For tags at rank less than 25 we see a tailing off from Zipf's law. I think this tailing off is to be expected. The most applied tag 'rock' is not really very descriptive. I suspect that many taggers will self-edit and not apply obvious tags.

This next plot gives a closer look at the 5,000 most frequently applied tags. In this dataset, about 2,500 tags have been applied 100 times or more.

The top 25 tags applied are:

440854 rock
343901 seen live
277747 indie
245259 alternative
184491 metal
158252 electronic
136691 punk
124599 pop
119930 indie rock
102937 classic rock
97264 alternative rock
89277 female vocalists
79497 emo
77455 death metal
76898 Hip-Hop
76668 hardcore
74650 electronica
73034 singer-songwriter
69169 black metal
62284 jazz
60559 hard rock
59763 folk
59729 punk rock
58135 Progressive rock
57860 heavy metal
54398 industrial

In the top 25, almost all of the tags are genre related (with the exception of 'seen live' and 'female vocalists'). The data appears to skew away from what one would expect from a general listening population. There's no 'Country' in the top 25, but there are 4 kinds of metal.

In future posts, I'll take a closer look at the various types of tags.

Comments:

So do you think we can treat tags the same way we treat terms in unstructured text? I.e., do you think we can throw away the top n ranked tags, and call them "stoptags"?

Or do the tags "rock" and "indie", etc. have more statistical discriminatory ability than do the terms "the" and "and"?

Yes, there is more semantic richness to the word "rock" than to the word "the". But from a statistical, usefulness, system-building, recommender perspective.. can and/or should we throw away "rock", the same way we throw away "the"?

P.s. minor typo: <i>In this dataset, about <b>2,5000</b> tags have been applied 100 times or more.</i>

Posted by jeremy on June 10, 2008 at 01:34 PM EDT #

Jeremy - typo fixed, thanks. As for treating the top N terms as stop words, it will be an interesting experiment to try. I've been having good success with just using traditional tf-idf term weighting to de-emphasize tags like 'Rock' and 'Indie' without throwing them out altogether. tf-idf weighting eliminates the problem of deciding how many to throw out - so we don't have to decide whether or not to toss the top 5, 10 or 100 tags.

Posted by Paul on June 10, 2008 at 01:43 PM EDT #

Using something similar to tf-idf (which doesn't give very obscure tags too much of a boost) and not considering rock etc as stop words sounds like a good idea :-)

Posted by elias on June 10, 2008 at 06:35 PM EDT #

Yup, the idf approach does make sense.

Just wondering if there would be a statistically significant difference, in the end task/end goal, between a low-idf rock tag, and no rock tag at all. Guess we'd have to run that experiment :-)

Posted by jeremy on June 10, 2008 at 08:18 PM EDT #

Duke Listens!: Visit my main blog at MusicMachinery.com

A first peek at the data

About this weblog

Index

Your Current Location