Friday Aug 20, 2004

The key to sounding like an american when speaking English is not how well you pronounce the words, but, instead how well you mis-pronounce the words. Intead of saying better say bedder, instead of talking say talkin', instead of saying got you say gotchew. The site SpokenAmericanEnglish.com has a number of cases and examples of how best to mispronounce our fair language. So today, when it is time for lunch you can say to your office mate: "Jeet Jet?".

Saturday Aug 14, 2004

The folks in the Cognitive Machines Group at The MIT Media Lab have built a robot that allows them to "investigate connections between natural language semantics, perception, and action". The robot, called Ripley, is a seven-degree-of-freedom robot with many sensors, including touch, vision, gravity, position and sound. You can talk to Ripley about its world of objects, and Ripley will understand you. There's an interesting paper on how they use a visually grounded language model to allow descriptions of scenes in a natural way. Some sample phrases are:

the green one on the left that's hidden by a purple one

on the left the purple one that's all the way in the corner and it's separate

in the middle towards the right there's a line of purple ones and then
there's a kink in the line and the one that's right where the lines turns

the purple one all the way on the right in the front

Using phrases such as this, a person can command Ripley to select and manipulate a single object amongst a group of objects. Ripley can distinguish objects based on size, color, weight (it will weigh objects if it needs to), position, and relation to other objects.

Ripley uses Sphinx-4 for speech recognition. The folks in the Cognitive Machines Group have been great folks to work with and have contributed many good features and enhancements to Sphinx-4.

Wednesday Aug 11, 2004

I see over at JavaSound Interest archive that Florian Bomers (aka Mr. JavaSound) has left Sun. Florian says:

Hi Javasounders!

I have left my position at Sun. I've had sufficient time (and I worked until the last minute) to ensure that Java Sound in 1.5 is in good shape, and that my successor will find enough and complete information for a smooth start. I also offered Sun to assist where needed. I wouldn't want Java Sound to receive less attention than before!

The decision to leave is part of my original plan to stay at Sun for the duration of the development cycles of 1.4.1, 1.4.2 and 1.5.

Now I will take up my own business, in the field of music software (but a specific plan is not yet set in stone). I will continue my engagement with this mailing list, and I plan to increase my activity for the tritonus and jsresources.org projects.

Therefore "goodbye" as official Sun representative, and "hello" from here!

Florian

Florian was a great resource and worked very hard to make the JavaSound API work well. He will be greatly missed.

Now, I wonder what is going to happen with JavaSound. Recall Josh Simon's post of last year calling for a major overhaul of JavaSound. Maybe now is the time to take a long hard look at JavaSound and see where it should go next.

I'd like to see, at the very least, an effort to ensure equivalent behavior across all supported platforms. Our speech engines make extensive use of JavaSound and the biggest issue we have in supporting Solaris, Linux, OS X and Win32 is differences in behavior of JavaSound across the different platforms.

Monday Aug 09, 2004

GigNews as a good article called Speech Interfaces for Games, that gives a good basic overview of how speech recognition works and gives a hint about some of the difficulties in writing good speech apps. A good read. Here's a snippet:

Scenario #1
Computer: "Captain, do you want to open hailing frequencies, 
open fire or activate the Jump Drive?"

User: "Fire all weapons on the nearest enemy frigate."

Computer: "Fire lasers, missiles or all weapons?"

User: "All weapons."

Computer: "Fire on a frigate, a battleship or a freighter?"

User: "I said the nearest enemy frigate!"

Computer: "There are three enemy frigates within range: the Saratoga, 
the Seville and the San Diego. Which one do you want to target?"

User: "The nearest, damn you!"

Computer: "I'm sorry, I did not understand. Please try again."

User: "Malefice! You foul silicon-based denizen of the tar
 pits of Hell! A plague on you and your unnatural family!"

Ship is broad-sided by an enemy battle cruiser and destroyed.

Scenario #1 is an example of a fixed-initiative system, and a rather bad one. (Effective fixed-initiative systems are in common use in 411 directory assistance; they are still somewhat brittle and work without human intervention in a mere fraction of calls, but the phone company still saves fortunes thanks to them.)

Scenario #1's computer asks all the questions, handles very specific answers, and doesn't even listen to anything else. If the user volunteers information to try to speed up the interaction, the computer will ignore it until the "proper" time for this bit of data is reached. Only in the simplest of cases (and with the most docile of users) will a fixed-initiative dialogue achieve good results quickly and naturally.

Every morning I check the Sphinx-4 support forums for questions that people have about using Sphinx-4. There are usually a handful of questions that have magically surfaced overnight. I dutifully spend anywhere from 5 minutes to an hour or more providing answers to the questions posed by Sphinx-4 users. Some of the questions are very basic. Those are easy to answer, I just point them to the FAQ or some other document on-line that provides the answers. Sometimes, though, the questions are more difficult. These are questions that push the envelope. These questions reveal:

  • Parts of Sphinx-4 with poor documentation
  • Areas in Sphinx-4 with a cumbersome API
  • Common use cases that are not easily implemented with the current API
  • Poor error messages

My daily dose of support gives me a strong hint about where we need to improve Sphinx-4. This close interaction with users helps me understand how people are using Sphinx-4, what areas are confusing, what features need to be supported. I get a good idea how to make the system better. I think this one of the best attributes of the open source model. The tight feedback loop between users and developers leads to a much better system.

Thursday Aug 05, 2004

Willie pointed out this interesting Newsday article Spanish callers get lost in translation about the added difficulties in designing an automated voice recognition system that is to be used by multi-lingual speakers. Here's a snippet:

Part of the reason is that people don't always use pure Spanish, speech experts said, but a mix of Spanish and English. Some Hispanics might say, "uno dos tres Seventh Ave." Others might say, "one twenty-three Seventh Avenue." Still others might say "uno dos tres Septima Avenida," or "uno dos tres Avenida Siete," and other possibilities.

Building a voice recognition system that works well with native English speakers is challenging enough, but that is not going to good enough for a large (and growing) segment of the population.

Tuesday Aug 03, 2004

The Graph of the Month for August over at AiSee is one of our Sphinx-4 JSGF graphs. AiSee is a great tool for visualizing large data sets. We've instrumented Sphinx-4 to dump out all sorts of graphs (in AiSee's GDL format). You can read more about that here.

Evandro has started a new forum on the Sphinx Website at SourceForge called Sphinx4 Sightings where people can post a description of their project or paper that uses Sphinx-4. If you are a user of Sphinx-4 and want to share information about your project, please post it at the Sphinx-4 Sightings Forum.

Monday Aug 02, 2004

I'm just back from a trip to the west coast. Willie and I spent three days in Mountain View giving demos of FreeTTS and Sphinx-4 at the Sun Labs open house. For a quick lunch one day, we stopped into In-N-Out Burger, a fast food burger joint. They don't have these on the east coast. The In-N-Out menu is very simple, three kinds of hamburgers, french fries, and drinks. That's it. The place was clean, the burger was good, what more could you ask for.

This serves as a good model for API design, unlike McDs, In-N-Out has tried to keep their interface (the menu) nice and simple. And since their API is so simple (just burgers and fries, no fajitas, no apple pies, no chicken, no fish, no salads), they can do a good job. It's The Unix Philosophy applied to hamburgers: Make each program do one thing well. The advantages of a small menu are obvious. Shorter lines since customers don't spend too much time deciding on what to have, and better food (since they are just cooking burgers, they can make them to order).

I have seen APIs that use the McDs approach ... give the customer all the choices. Look at the log4j API. It is a logging and tracing package. Now, in my mind, logging packages should be pretty straight forward. They need to take a message, classify it and send it somewhere to be output. Looking at the log4j API I see over 120 classes in the API. This API is so big there's a whole book describing it. In my mind, that is way too complicated, too many choices. All I really want to do is this:
    Logger.log(WARN, "Coffee is too hot");

API designers need to walk the fine line between functionality and usability. Too much functionality can limit the appeal of an API. Just give me a burger with fries please.

Thursday Jul 29, 2004

Tony Printezis, a researcher here at Sun Labs has developed a tool for visualizing the java heap. GCSpy is an adaptable heap visualization tool that lets you look at what is in your heap, how the different heap regions are being used, as well as size of the different heap regions overtime. It is a good tool to help you tune the GC in your app.

Wednesday Jul 28, 2004

RJ mentions that the next working draft for VoiceXML 2.1 is available at the W3C.

Sunday Jul 25, 2004

A while back, Bryan posted a program that renders a spoken version of the classic "99 bottles of beer on the wall" using FreeTTS 1.1. Well, Bryan was using lower level (non-JSAPI) FreeTTS API and things have changed in this API a bit with the release of FreeTTS 1.2, so I thought I'd update the source so it will work with 1.2. This little code snippet, when coupled with FreeTTS will indeed sing the whole song, without ever getting tired.

import com.sun.speech.freetts.Voice;
import com.sun.speech.freetts.VoiceManager;

public class BottlesOfBeer  {

    public static void main(String[] args) {

        VoiceManager voiceManager = VoiceManager.getInstance();
        Voice voice = voiceManager.getVoice("kevin16");

        if (voice == null) {
            System.err.println( "Cannot find kevin16");
            System.exit(1);
        }

        voice.allocate();
        for (int i = 99; i > 0; i--) {
            String verse = i + getBottle(i) + " of beer on the wall, " +
                    i + getBottle(i) + " of beer. " +
                    "Take one down,  Pass it around, " +
                    (i - 1) + getBottle(i - 1) + "of beer on the wall.";
            voice.speak(verse);
        }
        voice.speak("Thank you everyone.  You've been a great audience.  " +
                "I'll be here all week!");

        /* Clean up and leave.
         */
        voice.deallocate();
        System.exit(0);
    }

    public static String getBottle(int numberOfBottles) {
        return numberOfBottles != 1 ? " bottles" : " bottle";
    }
}

I figured with all those kids heading out to camp this summer with their laptops, they can just fire this up on those bus rides instead of having to sing it themselves.

Oh, and be sure to check out www.99-bottles-of-beer.net for 621 variations of this program in just about every programming language ever written. A most valuable resource.

Thursday Jul 22, 2004

Steve Pietrowicz just left a comment saying "nice demo" for ZipCity. Steve has to be the King of the Cool demos. Check out this Smart Environments project that he did at the National Center for Supercomputing Applications. Steve used Java and Jini to create an environment of hardware and software services that was able to interact with people in a variety of ways. This room included a video wall with a 8192x3840 display. Check out the picture.

At JavaOne 2002 Steve won the Helix software competition with a program that used Java3D and FreeTTS to simulate a scanner to discover DNA sequence errors. The program shows the nucleotides being scanned, and FreeTTS announced when the program encounters errors.

So Steve, King of the Demos ... thanks for the kudos.

Wednesday Jul 21, 2004

Willie has started a new forum on the FreeTTS Website at SourceForge called FreeTTS Sightings where people can post a description of their project or paper that uses FreeTTS. There's already a dozen or so projects listed.

If you are a user of FreeTTS and want to share information about your project, please post it at the FreeTTS Sightings Forum

We'll be setting up a similar area for Sphinx-4 soon.

Tuesday Jul 20, 2004

In preparation for next week's dog and pony show, I've revamped our ZipCity demo of Sphinx4. Now when you speak a zip code, ZipCity will show the location of the city on the map of the continental USA. Here's a screenshot:

You can try it out here: ZipCity Version 2.

This is still, as far as I know, the only downloaded app in the world that includes a speech recognizer.

The new look of ZipCity was heavily inspired by Ben Fry's zipdecode. Mr. Fry specializes in developing systems for visualizing large and changing sets of data. His MIT website has numerous novel examples. Well worth checking out.

This blog copyright 2010 by plamere