A first peek at the data
First, here's a plot of tag frequency. First thing to notice is the power distribution. Not surprisingly, it looks like tags follow Zipf's law (where the frequency of a tag is inversely proportional to its rank). Interestingly, 45,000 of the 100,000 or so tags have been applied only once.
Looking at a log-log plot of the frequency vs. rank data, we see that the data is linear starting at a rank around 25. For tags at rank less than 25 we see a tailing off from Zipf's law. I think this tailing off is to be expected. The most applied tag 'rock' is not really very descriptive. I suspect that many taggers will self-edit and not apply obvious tags.
This next plot gives a closer look at the 5,000 most frequently applied tags. In this dataset, about 2,500 tags have been applied 100 times or more.
The top 25 tags applied are:
440854 rock 343901 seen live 277747 indie 245259 alternative 184491 metal 158252 electronic 136691 punk 124599 pop 119930 indie rock 102937 classic rock 97264 alternative rock 89277 female vocalists 79497 emo 77455 death metal 76898 Hip-Hop 76668 hardcore 74650 electronica 73034 singer-songwriter 69169 black metal 62284 jazz 60559 hard rock 59763 folk 59729 punk rock 58135 Progressive rock 57860 heavy metal 54398 industrial
In the top 25, almost all of the tags are genre related (with the exception of 'seen live' and 'female vocalists'). The data appears to skew away from what one would expect from a general listening population. There's no 'Country' in the top 25, but there are 4 kinds of metal.
In future posts, I'll take a closer look at the various types of tags.
So do you think we can treat tags the same way we treat terms in unstructured text? I.e., do you think we can throw away the top n ranked tags, and call them "stoptags"?
Or do the tags "rock" and "indie", etc. have more statistical discriminatory ability than do the terms "the" and "and"?
Yes, there is more semantic richness to the word "rock" than to the word "the". But from a statistical, usefulness, system-building, recommender perspective.. can and/or should we throw away "rock", the same way we throw away "the"?
P.s. minor typo: <i>In this dataset, about <b>2,5000</b> tags have been applied 100 times or more.</i>
Posted by jeremy on June 10, 2008 at 01:34 PM EDT #
Jeremy - typo fixed, thanks. As for treating the top N terms as stop words, it will be an interesting experiment to try. I've been having good success with just using traditional tf-idf term weighting to de-emphasize tags like 'Rock' and 'Indie' without throwing them out altogether. tf-idf weighting eliminates the problem of deciding how many to throw out - so we don't have to decide whether or not to toss the top 5, 10 or 100 tags.
Posted by Paul on June 10, 2008 at 01:43 PM EDT #
Using something similar to tf-idf (which doesn't give very obscure tags too much of a boost) and not considering rock etc as stop words sounds like a good idea :-)
Posted by elias on June 10, 2008 at 06:35 PM EDT #
Yup, the idf approach does make sense.
Just wondering if there would be a statistically significant difference, in the end task/end goal, between a low-idf rock tag, and no rock tag at all. Guess we'd have to run that experiment :-)
Posted by jeremy on June 10, 2008 at 08:18 PM EDT #