In JNN (The Juicy News Network) , FreeTTS announces new articles on various news feeds. As with most things on the web, news feeds are not always named with words that are found in the dictionary. In my current set of feeds, I have words like slashdot, javapedia, MacRumors, eJournal, salon.com. None of these words are in the FreeTTS lexicon. Furthermore, FreeTTS will look up words after tokenization, so text like 'salon.com' gets turned into 'salon com' ... there is no more 'dot' in '.com'.

Certainly, it would be helpful if an application could easily extend the lexicon of FreeTTS. FreeTTS does have a lexicon called the addenda that is for this, but, unfortunately, it is somewhat buried, and hard to use it that way. So to fix up the JNN pronunciations, I had to write a little function that would filter the text and fix up the pronunciations.

To do this I used the java.util.regex package to search and fixup the text. It is incredibly easy to use and quite powerful. For instance here are some code snippets:

Here's a snippet that turns 'salon.com' to 'salon dot com'.

private static Pattern dotComPattern = Pattern.compile("(\\S)\\.(\\S)");
description = dotComPattern.matcher(description).replaceAll("$1 dot $2");
Here's a snippet that replaces wiki words with space separated words. TheOneRing would become The One Ring. Another two liner:
private static Pattern wikiPattern =    Pattern.compile("([a-z])([A-Z])");
m = wikiPattern.matcher(description).replaceAll("$1 $2");

Finally, in a bit of irony, FreeTTS can't pronounce its own name, so I fix that up to turning it into 'free TTS' .

With this set of changes, JNN is reading and announcing articles with much more understandable pronunciations, It's tongue is less tied. Still, it would be better to have a more systematic method of addressing such unusual pronunciations. Two possiblities: a user editable lexicon, or annotating rss feeds with pronunciation information. You can check out this code in the RSSAnnouncer.java - part of JNN code repository.

Comments:

Post a Comment:
Comments are closed for this entry.

This blog copyright 2010 by plamere