Tuesday Jun 24, 2008

One issue that is encountered when working with social tags is synonymy, that is, taggers have lots of ways to say the same thing. For instance, looking at the tags that have been applied to an artist like Jay-Z we see tags such as "Hip-hop", "hiphop", "hip hop" and "rap". Now it it pretty clear that "Hip-hop" and "hiphop" probably mean the same thing. This diversity can be good sometimes - for instance, if we know that "hip-hop" and "hip hop" usually mean the same thing then if someone is searching for "hip-hop" we can include results that match "hip-hop" and "hip hop". Of course, we could figure out that "hip-hop" and "hip hop" are closely related just by using text methods - but things are not usually so easy. What we'd like to do is develop a method that can look at the set of tags and determine which tags are synonyms, and which tags are closely related but are probably not synonyms (such as "hip hop" and "rap").

For a first experiment, I'd like to see if I can automatically find synonyms for the tag "female vocalists". To do this, I need to establish some ground truth. By hand, I've gone through the 5,000 most frequently applied artist tags, looking for tags that may be related to "female vocalists". I found 59 of them shown here (along with the tag rank and tag frequency).

12 89277 female vocalists
80 15874 female
113 8281 Female fronted metal
128 7150 female vocalist
227 2841 riot grrrl
236 2716 female vocals
278 2180 Female Voices
365 1424 female artists
480 955 female vocal
569 698 female singers
571 691 female fronted
619 619 Girl Groups
633 600 Girl Rock
722 501 girls
739 475 diva
786 433 riot grrl
880 374 female singer-songwriter
885 370 women
1023 301 girl group
1064 289 chick rock
1067 287 girl power
1119 269 female singer-songwriters
1202 245 female singer
1224 239 female voice
1544 179 female-vocalists
1587 173 female rock
1625 167 female dance vocals
1650 163 girl music
1727 154 french female
1757 151 chick music
2113 120 Girl
2130 118 Female-fronted Metal
2246 110 chicks
2252 109 woman singer
2681 89 female fronted rock
2757 86 Female Artist
2803 84 girl bands
2893 81 girlie rock
2895 81 female fronted band
3077 75 japanese female vocalists
3082 75 divas
3135 73 girly
3136 73 girl band
3194 71 girl pop
3201 71 eleCtro grrls
3263 69 solo female
3405 65 with female singers
3426 65 front girl band
3609 62 60s girls
3676 60 grrl
3788 58 females
3884 56 woman
3957 55 female vocal trance
4044 54 Female country
4188 51 grrls
4370 49 Female solo artists
4778 44 Favourite Females
4828 43 girl punk
4923 42 girls aloud
Now I want to place the tags into 3 separate buckets - In bucket 1, I'll put tags that I think are synonyms for "female vocalists". In bucket 2, I'll put tags that are related but not synonyms, and in bucket 3, I'll place tags that are not related to "female vocalists" at all.

Bucket #1 - Synonyms for "female vocalists"

These are female oriented tags (singular or plural), that don't imply any type of genre.
12 89277 female vocalists
80 15874 female
128 7150 female vocalist
236 2716 female vocals
278 2180 Female Voices
365 1424 female artists
480 955 female vocal
569 698 female singers
571 691 female fronted
619 619 Girl Groups
722 501 girls
739 475 diva
885 370 women
1023 301 girl group
1202 245 female singer
1224 239 female voice
1544 179 female-vocalists
1650 163 girl music
1757 151 chick music
2113 120 Girl
2246 110 chicks
2252 109 woman singer
2757 86 Female Artist
2803 84 girl bands
2895 81 female fronted band
3082 75 divas
3135 73 girly
3136 73 girl band
3263 69 solo female
3405 65 with female singers
3426 65 front girl band
3788 58 females
3884 56 woman
4370 49 Female solo artists

Bucket #2 - Related but not synonyms to "Female Vocalists"

female-oriented tags that imply genre or include another type of qualifier such as 'favorite'.
113 8281 Female fronted metal
227 2841 riot grrrl
633 600 Girl Rock
786 433 riot grrl
880 374 female singer-songwriter
1064 289 chick rock
1067 287 girl power
1119 269 female singer-songwriters
1587 173 female rock
1625 167 female dance vocals
1727 154 french female
2130 118 Female-fronted Metal
2681 89 female fronted rock
2893 81 girlie rock
3077 75 japanese female vocalists
3194 71 girl pop
3201 71 eleCtro grrls
3609 62 60s girls
3676 60 grrl
3957 55 female vocal trance
4044 54 Female country
4188 51 grrls
4778 44 Favourite Females
4828 43 girl punk
4923 42 girls aloud

Bucket #3 - Not related to "Female Vocalists"

(all of the rest of the 5,000 tags).
Dividing the female-oriented tags like this is not so clear cut, but we have to start somewhere ... I'm open to any other suggestions as to how to divide this space up. Once we have some ground-truth (imperfect as it may be), we can develop an evaluation criteria that will let us determine how well our synonym detector works.

The next step is to figure out how we can evaluate our synonym predictor. That will be the next post.

Sunday Jun 22, 2008

I've noticed a few articles and resources around open research in the last few weeks. Last week, when I was visiting my parents in Florida, I noticed my Dad's copy of Scientific America had an excellent article called Science 2.0 -- Is Open Access Science the Future? that talks about the advantages and disadvantages of carrying out science in the open, on the web. The article suggests advantages such as a better dialog with people, and more opportunities for collaboration. The article points to the OpenWetWare project at MIT, which is a a wiki for biology researchers.

Another excellent resource is the podcast interview by Jon Udell of Jean-Claude Bradley. Jean-Claude is a professor of chemistry at Drexel University who started to make the scientific process as transparent as possible by publishing all research work in real time to a collection of public blogs, wikis and other web pages. He coined the term Open Notebook Science which he describes as: "... there is a URL to a laboratory notebook (like this) that is freely available and indexed on common search engines. It does not necessarily have to look like a paper notebook but it is essential that all of the information available to the researchers to make their conclusions is equally available to the rest of the world. Basically, no insider information."

I really like the idea of 'no insider information' - to put all your successes and failures, every experiment, every bad result out there for everyone to see. Your lab notebook is open for the whole world to see.

Jean-Claude has a presentation that outlines how they use Open Notebook science at Drexel, that is pretty interesting (albeit, quite focused on chemistry). They use a Wiki to serve as the lab notebook. They rely on the automatic versioning of the wiki software to maintain track of edits, so they always know 'who did what' and 'when'. This addresses some of the concerns Elias raised about making the research record be permanent. They will then use their blog to highlight interesting results or questions.

To me, this is all pretty interesting, especially for the Music Information Retrieval community. The MIR community has lots of disciplines: musicology, signal processing, machine learning, library science, text IR, coding, user interface, and on on. No one can know all there is to know in this field, so anything that can help increase the opportunities for sharing ideas can really push forward the whole research community.

There are already a number of MIR researchers that are putting their research on line. ( Mark Godfrey Yves Raimond to name just a couple). I suspect that more researchers would work this way if the tools were available and the advantages were laid out. Perhaps this would be a good topic for a panel at ISMIR this year. Since Jean-Claude Bradley is at Drexel there could even be an opportunity to have Jean-Claude sit on the panel to serve as the 'expert'. This panel could be about how to do 'open notebook science' with some feedback from folks who are already doing that in the MIR community. I, myself, would find this panel to be very interesting.

Of course, we are way past the time to submit panels to ISMIR, so there may not be any chance to have such a panel, but it doesn't hurt to try .. so if enough MIR folks express some interest to me (just add a comment to this post or send me an email), I'll talk with the ISMIR organizers to see if this would be possible.

Saturday Jun 21, 2008

The organizers for ISMIR 2008 have just posted the schedule for tutorials for this year's conference. Among them is a tutorial that will be presented by Jean-Julien Aucouturier, Elias Pampalk and myself entitled Social Tags and Music Information Retrieval. Here are the details:

Social Tags are free text labels that are applied to items such as artists, playlists and songs. These tags have the potential to have a positive impact on music information retrieval research. In this tutorial we describe the state of the art in commercial and research social tagging systems for music. We explore some of the motivations for tagging. We describe the factors that affect the quantity and quality of collected tags. We present a toolkit that MIR researchers can use to harvest and process tags. We look at how tags are collected and used in current commercial and research systems. We explore some of the issues and problems that are encountered when using tags. We present current MIR-related research centered on social tags and suggest possible areas of exploration for future research.

I am really excited about working on this tutorial with Elias and JJ. One of the highlights of 2007 for me was presenting a tutorial on music recommendation with Oscar Celma. I learned so much about the subject matter while preparing the tutorial, and it was great fun to work with someone as smart as Oscar. I am really looking forward to repeating the experience.

Friday Jun 20, 2008

Netflix suggests that if you liked "Freaks & Geeks" clearly you will like the "Lost Boys of Sudan". And why not, they both have that same fish-out-of-water sense to them.

90A84C5B-A5FA-496D-B320-AE924E3A65DA.jpg

From Comedy Central insider (Thanks, Zac!)

Thursday Jun 19, 2008

In this podcast, Jon Udell interviews WebJay founder Lucas Gonze about the issues surrounding music discovery. It's an interesting, rambling discussion that touches on curated, expert-driven discovery as opposed to machine-driven discovery, problems with music metadata, as well as Lucas' hobby of exploring 19th century parlour music. - Thanks for the tip, Jeremy.

Monday Jun 16, 2008

Since you like National Lampoon's Animal House, you may like this movie about how the economic conditions in 1930s Germany led to the rise of the Nazis.

(Via Andrew Huff.)

Friday Jun 13, 2008

Just for fun, I wrote a little program that takes the set of genres and mixes them up to give us a whole new set of imaginary genres. Some of my favorite imaginary genres are:
  • Acid Rockabilly
  • British Hip Pop
  • Blackened Broken Metal
  • British Stoner Rap
  • Coast Rock
  • Depressive political rock
  • Dirty Neo-Prog
  • Emo Metal
  • German uplifting death-grind
  • Gothic country grrrl
  • Hardcore chick trance
  • Indie anarcho pop
  • Industrial old northern house
  • Liquid Psychedelia
  • New symphonic hip-hop
  • Organic Metal
  • Progressive teen Hardcore
  • Suicidal Swedish viking rock
  • Traditional ambient deutschpunk
  • Twee Afrobeat
I think would be really cool to listen to some of these genres (such as 'Twee Afrobeat' or 'Liquid Psychedelia') and I think some (such as 'new symphonic hip-hop') would be just horrid. The weird thing is that I can imagine what many of these genres would sound like, even though they don't exist. Just try to imagine what 'Gothic Country Grrrl' would sound like.
Here's another Netflix Freakomendation. If you like the HBO Series Big Love (which is an adult show about how a modern-day Utah polygamist who lives in suburban Salt Lake City balances his three wives, seven children, and a mounting avalanche of debt and demands) Netflix recommends the Backyardigans; The Legend of the Volcano Sisters. Strange.

(From zacechola on Flickr)

Thursday Jun 12, 2008

I made this Tag Cloud by pasting the Beatles Wikipedia entry into wordle.

Wordle is pretty cool (via Information Aesthetics)

I took a tour by hand through the top 1,500 last.fm artist tags to identify which ones were related to genre. I found about 700 of them, ranging from the familiar 'rock', 'indie' and 'jazz' to 'aggrotech', 'true Norwegian black metal', and 'terror ebm'. It is actually quite a it of work filtering the tags like this. I had to look up many tags like 'angura kei' and 'juggalo' to see if they were genre related. And I am sure I made many mistakes. Some of my favorites are:

My edited hand-edited list of genres can be found here: genre.txt

Update: Zac suggested making the genres links to the last.fm tag page .. so you can see the full list with links after the jump:

Update #2:  Oscar went one better, he made the genres link page with links sized by frequency of occurrence:  Oscar's last.fm genre tag cloud

[Read More]

Tuesday Jun 10, 2008

Steve asked me how many words are in a typical Last.fm artist tag. So lets answer that question. Of the 100,000 or so unique tags, about 33% of them are single word tags (like 'rock', 'emo' and 'metal'), 36% are two word tags (like 'heavy metal', 'punk rock' and 'seen live'), 16% are three word tags ('brutal death metal'), 6% are 4 word tags ('asian drum and bass'), 3% are 5 word tags ('sounds awesome in my car'). The longest tag is 49 words long. It is:
i am a child in a field and i grow things in my dreams to wake and water them to go to sleep and raise them from the earth into existence from an idea to physical existence what power what madness a gentle madness an eleven year olds madness
The longest tag that has been applied multiple times is:
songs that i sing along to but i always forget the words so i say duh duh while trying to sound like i do know the words and no one is falling for it but they keep quiet because they are embarassed for me

Some long tags that seem to be about music :

bands or people from india or who sound like they might be from india or who sound like they might be from around india or who sound like they might enjoy bands or people who sound like they are from around india
A number of the longer tags seem to be the result of someone being confused about how to enter separate tags. For example, this single tag was probably meant to be 20 individual tags:
rock - metal - industrial - electronic - punk - emo - ebm - alternative - punk - dark electro - japanese - metalcore - female fronted metal - techno - psytrance - love metal - new wave - synth pop - indie

Some long single word tags:

63 Eksprimentell-teknisk-avantgarde-saer-dop-steikbra-metal-musikk
55 SkankledoodleskattysuperSKAalifragilisticexpialidocious
55 No-break-twitch-screaming-grindcore-ninja-commando-team
47 indiepostteletronicrockistsbaroquechamberaltpop
47 hatebreedlookalikessoundsalikeswannabesgoodshit
46 rrrrrrrruuuuuuuuuuuuuuuuuuuuuuggggghhhhhhhhhhh
43 up-down-up-down-left-right-A-B-select-start
40 put-your-brain-in-a-blender-and-drink-it
40 baseballbatmusicforcoolindiekidswithbats
39 fffffffffffffffffffffffffffffffffffffff
38 where-do-thoughts-like-those-come-from
37 good-old-animation-guitar-syntheziser
36 gothic-techno-industrial-dance-disco
36 MAurrrrriiiiittttttiiiiiuuuuuussssss
36 21AAD6B6-A1E6-4e07-B2E9-9F512F446E4B
35 onemightbecomebarrenlisteningtothis
35 not-to-be-confused-with-THE-Citadel
35 Supercallifragilisticexpialidocious
35 Lie-on-the-floor-and-smoke-to-music
35 BatmanUcanBeMySupermanSaveMeHereIam
34 Progressive-Industrial-Atmospheric
34 Industrial-Electroica-Jungle-Death
Once again, many of these seem to be the results of the confused tagger not understanding how to apply multiple tags at once.

317 Unique tags that start with "I ..." such as:

11 I would like to own or listen to more music by these bands and artists
11 I got a patch
11 I dig it
11 I dig
10 I must try
10 I can be cool sometimes
10 I LOVE THIS MUSIC
10 I Have Seen Them
9 I think its a doom metal band
9 I like this stuff
9 I cant account for these
9 I am so sad
8 I love this bands
8 I love them all
8 I like this music
8 I like this
8 I like these guys a lot
My favorite is: I was never a Swedish teenager with swedish teen angst so now I will attempt at having it. Speaking of 'favorite', Last.fm taggers have hundreds of ways of tagging their favorite artists. Among them are:
FavoriteArtists
favoritebands
FAVORTIEEESSS
Favoritmusik
Favoritizims
favoritttes
favoiteness
faveartists
favorites1
favorieten
faveorites
Favourites
Favoritter
favroites
favoutite
favourits
favourite
favour-01
favorites
favoriten
favoritas
faveorite
favaurite
fav0urit3
Favouites
Favoritos
Favoriter
favorito
favoriet
favoirte
favbands
Favoritt
Favorits
Favorite
favorit
Favoris
FavBand
favour
favori
favor8
favies
favela
Favvis
Favess
FavArt
favvy
favou
favos
Favvo
Favor
Faves
favs
favo
fava
Fave
fav
Tags can be crazy - there are just so many different reasons why people tag, there's tons of noise, there are tag abusers - but because there are so many tags, they can also be extremely useful. Next up, we'll try to sift through the tags and categorize them into big buckets like genre, mood, opinion and so on.
The LastFM-ArtistTags2007 dataset has 100,000 unique artist tags for about 20,000 artists. Let's take a closer look at some of the data.

First, here's a plot of tag frequency. First thing to notice is the power distribution. Not surprisingly, it looks like tags follow Zipf's law (where the frequency of a tag is inversely proportional to its rank). Interestingly, 45,000 of the 100,000 or so tags have been applied only once.

tag.freq.full.png

Looking at a log-log plot of the frequency vs. rank data, we see that the data is linear starting at a rank around 25. For tags at rank less than 25 we see a tailing off from Zipf's law. I think this tailing off is to be expected. The most applied tag 'rock' is not really very descriptive. I suspect that many taggers will self-edit and not apply obvious tags.

tag.freq.log.log.png

This next plot gives a closer look at the 5,000 most frequently applied tags. In this dataset, about 2,500 tags have been applied 100 times or more.

tag.freq.detail.png

The top 25 tags applied are:

440854 rock
343901 seen live
277747 indie
245259 alternative
184491 metal
158252 electronic
136691 punk
124599 pop
119930 indie rock
102937 classic rock
97264 alternative rock
89277 female vocalists
79497 emo
77455 death metal
76898 Hip-Hop
76668 hardcore
74650 electronica
73034 singer-songwriter
69169 black metal
62284 jazz
60559 hard rock
59763 folk
59729 punk rock
58135 Progressive rock
57860 heavy metal
54398 industrial

In the top 25, almost all of the tags are genre related (with the exception of 'seen live' and 'female vocalists'). The data appears to skew away from what one would expect from a general listening population. There's no 'Country' in the top 25, but there are 4 kinds of metal.

In future posts, I'll take a closer look at the various types of tags.

For this Open Research experiment, I shall be working with a set of social tag collected from Last.fm via Audioscrobbler web services. Since it might be handy for anyone following along to have access to the same data set, I am making this data set available directly for any researcher who wants to use it.

The dataset is available for download here: Lastfm-ArtistTags2007

Here are the details as told in the README file:

The LastFM-ArtistTags2007 Data set
Version 1.0
June 2008

What is this?

    This is a set of artist tag data collected from Last.fm using
    the Audioscrobbler webservice during the spring of 2007.

    The data consists of the raw tag counts for the 100 most
    frequently occuring tags that Last.fm listeners have applied
    to over 20,000 artists.

    An undocumented (and deprecated) option of the audioscrobbler
    web service was used to bypass the Last.fm normalization of tag
    counts.  This data set provides raw tag counts.

Data Format:

  The data is formatted one entry per line as follows:

  musicbrainz-artist-id<sep>artist-name<sep>tag-name<sep>raw-tag-count

Example:

    11eabe0c-2638-4808-92f9-1dbd9c453429<sep>Deerhoof<sep>american<sep>14
    11eabe0c-2638-4808-92f9-1dbd9c453429<sep>Deerhoof<sep>animals<sep>5
    11eabe0c-2638-4808-92f9-1dbd9c453429<sep>Deerhoof<sep>art punk<sep>21
    11eabe0c-2638-4808-92f9-1dbd9c453429<sep>Deerhoof<sep>art rock<sep>18
    11eabe0c-2638-4808-92f9-1dbd9c453429<sep>Deerhoof<sep>atmospheric<sep>4
    11eabe0c-2638-4808-92f9-1dbd9c453429<sep>Deerhoof<sep>avantgarde<sep>3


Data Statistics:

    Total Lines:      952810
    Unique Artists:    20907
    Unique Tags:      100784
    Total Tags:      7178442

Filtering:

    Some minor filtering has been applied to the tag data.  Last.fm will
    report tag with counts of zero or less on occasion. These tags have
    been removed.

    Artists with no tags have not been included in this data set.
    Of the nearly quarter million artists that were inspected, 20,907
    artists had 1 or more tags.

Files:

    ArtistTags.dat  - the tag data
    README.txt      - this file
    artists.txt     - artists ordered by tag count
    tags.txt        - tags ordered by tag count

License:

    The data in LastFM-ArtistTags2007 is distributed with permission of
    Last.fm.  The data is made available for non-commercial use only under
    the Creative Commons Attribution-NonCommercial-ShareAlike UK License.
    Those interested in using the data or web services in a commercial
    context should contact partners at last dot fm. For more information
    see http://www.audioscrobbler.net/data/

Acknowledgements:

    Thanks to Last.fm for providing the access to this tag data via their
    web services

Contact:

    This data was collected, filtered and by Paul Lamere of Sun Labs. Send
    questions or comments to [email protected]

What's all this then? This is an experiment in 'open research' - I'm going to blog my research on a particular topic. Suggestions are welcome

Table of Contents

Wednesday Jun 04, 2008

Last.fm is extending its reach with a new program called Last.fm in a Box. Last.fm in a box will allow Last.fm partners to embed the Last.fm ad-supported player in their site. From the Press Release: The service is dubbed "Last.fm in a Box" because it’s a complete "soundtrack for the Web" experience for users—featuring millions of tracks from Last.fm’s unparalleled music catalogue. Based on the open platform of Last.fm, partner sites can adopt "Last.fm in a Box" simply, easily and quickly. The service will be ad-supported allowing brands and sponsors the opportunity to reach millions of highly engaged music fans across the Web beyond the Last.fm site.

I found it interesting to see the teen/pre-teen site Stardoll.com as one of the "last.fm in a box" partners. This could be an interesting clash of cultures as the 13 year old girls start pushing their scrobbles into last.fm's music brain, while the old-school last.fm users push back with their nefarious tagging of pop music.

My buddy Sten, is a long time Java developer as well as the only developer on the planet to contribute to three of the coolest Java projects ever: Project Darkstar, Search Inside the Music and the RCI - a cockpit interface for a tactical reconnaissance camera (which when it went airborne in 2000 may have been one of the first instances of Java flying in a cockpit).

Sten is now blogging about Java and development. Sten has a dry sense of humor that make these informative posts fun to read. An example, Sten talks about IntelliJ IDEA: I approached this IDE with suspicion, mostly because I didn’t trust its CamelCase spelling. However, I figured that they were trying to spell “Intelligent”, which I could certainly relate to, but made a typo with the “J”, got flustered, stopped, and tried to distract from this blunder by shouting the second word.

This blog copyright 2010 by plamere