For one of the Search Inside the Music demos that I'll be showing at the Sun Labs open house next week, I've been building up a database of music artists and related info.  I've been gathering this data  from many different places on the web using a crawler.  It's not a fast process - it takes about a month to build up enough data to make it useful.   My crawler collects the data and writes it out to a set of text files so that later it can be indexed with the our nifty search engine.

I tested the whole process using a small crawl of the web on my Linux laptop.  When I was happy that everything was working fine, I started the crawl running on one of  our large Solaris servers.  Unfortunately, there was a lurking bug ... a Java programming 101 kind of bug that would make the data collected from the month long crawl be wrong.

Music artist names very international.  There's Björk, there's José Feliciano there's Mötley Crüe and Motörhead,  (there's even a whole genre of music called umlaut metal).  My mistake was forgetting that when  writing a text file in Java (using  a PrintWriter for instance), the default encoding used is the encoding of the operating system.  Now for my Linux laptop, the default encoding is UTF-8 which can handle all of the umlauts and accents.  But for our Solaris server, the default encoding is plain, old ASCII.  With its 7 bits, ASCII can't represent any of the rich characters that are needed to represent all of the artist names.  When I indexed my 30 days of data and started looking at the results I was very sad to see 'bj?rk"  and "m?tley cr?e". 

With our open house demo just 5 days from now, there's no way for me to recrawl the data and save it to disk using the proper encoding.  Luckily, when I did the initial crawl, I resolved all of the artists to a MusicBrainz ID.  I'm able to turn this ID back into the canonical name for the artist, so I am able to patch the names without having to do a recrawl.  Whew!

So my lesson for the day is ... don't rely on the default encoding when reading and writing text. Now, back to getting the rest of the demo to work.
 

Comments:

It'd be handy if IDEs could detect reliance on the default encoding and flag it up as a potential problem.

Posted by Ricky Clarkson on April 20, 2007 at 09:30 AM EDT #

Hi Paul! Just found out you left a link for this site. I would love to tell you some more about my project. I'm pretty busy for the next week or so but after that I can sum up and write the a bit more if you'd like? Anyway your blog seems pretty interesting too! -K

Posted by Kasper Nørlund on April 20, 2007 at 05:58 PM EDT #

Post a Comment:
Comments are closed for this entry.

This blog copyright 2010 by plamere