I make a Java newbie mistake - and almost lose a month of data
For one of the Search Inside the Music demos that I'll be showing at the Sun Labs open house next week, I've been building up a database of music artists and related info. I've been gathering this data from many different places on the web using a crawler. It's not a fast process - it takes about a month to build up enough data to make it useful. My crawler collects the data and writes it out to a set of text files so that later it can be indexed with the our nifty search engine.
I tested the whole process using a small
crawl of the web on my Linux laptop. When I was happy that
everything was working fine, I started the crawl running on one of
our large Solaris servers. Unfortunately, there was a lurking bug
... a Java programming 101 kind of bug that would make the data
collected from the month long crawl be wrong.
Music artist names very international. There's Björk, there's José Feliciano there's Mötley Crüe and Motörhead, (there's even a whole genre of music called umlaut metal). My mistake was forgetting that when writing a text file in Java (using a PrintWriter for instance), the default encoding used is the encoding of the operating system. Now for my Linux laptop, the default encoding is UTF-8 which can handle all of the umlauts and accents. But for our Solaris server, the default encoding is plain, old ASCII. With its 7 bits, ASCII can't represent any of the rich characters that are needed to represent all of the artist names. When I indexed my 30 days of data and started looking at the results I was very sad to see 'bj?rk" and "m?tley cr?e".
With our open house demo just 5 days from now, there's no way for me to recrawl the data and save it to disk using the proper encoding. Luckily, when I did the initial crawl, I resolved all of the artists to a MusicBrainz ID. I'm able to turn this ID back into the canonical name for the artist, so I am able to patch the names without having to do a recrawl. Whew!
So my lesson for the day is ... don't rely
on the default encoding when reading and writing text. Now, back to
getting the rest of the demo to work.
Posted by Ricky Clarkson on April 20, 2007 at 09:30 AM EDT #
Posted by Kasper Nørlund on April 20, 2007 at 05:58 PM EDT #