Friday Sep 17, 2004

OnJava.com is featuring this article called IRC Text to Speech with Java that shows how to make an IRC client speak by using FreeTTS and PircBot. It's less than a page of code. Pretty cool.

Sprague points out that MIT has placed the course materials for its Automatic Speech Recognition course online at its OpenCourseWare site. This course covers quite a bit of Speech Recognition theory. Rita Singh, one of the guest lecturers for this course, was one of the principal speech architects for Sphinx-4. In assignment 8 of this course, Rita uses Sphinx-3 and Sphinx-4 as part of a lab that shows how an HMM decoder works.

Thursday Sep 16, 2004

This article at the Inquirer is quoting the NY times as saying that IBM is going to open up the source code for their speech engines (my emphasis added). But in fact, the NY Times article says no such thing. As far as I can tell IBM is only opening up some of their speech tools and widgets, but, unfortunately, their speech recognition source will remain closed.

The only open source speech engines as far as I know are the Sphinx family of recognizers from CMU including Sphinx-4 which is written all in Java and the ISIP system developed a Mississippi state.

Wednesday Sep 15, 2004

I've been spending a bit of time lately sorting legos. First, I am getting the middle school collection sorted out for the kick-off of First Lego League. For FLL the school has a collection of two full mindstorm kits plus three years of competition add-ons. This collection barely fits into 4 large tackle boxes (plano). At the same time, our group at work has inherited an extremely large lego collection, lots of mindstorms, scouts, really cool kits (that car is amazing!). We've been sorting this since the beginning of August. Well, I've been concentrating on the kits for school, while Tony and Willie have been doing the bulk of the work.

All this sorting has got me thinking if there's a better way to do this. Well, it turns out that someone has documented the 27 stages of lego sorting. I think I'm on stage 14 or 15. Sigh ...

Anyway, here's a shot of the sorting progress at work (Tony setup the baggy system).

Tuesday Sep 14, 2004

There a good interview with Simon Ritter on java.sun.com about Simon's appearance at JavaOne where he demonstrated a 'wearable computer' that was using Sphinx-4, our speech recognizer written in Java, and FreeTTS our speech synthesiser written in Java. (By way of Mary Mary).

Willie, our fearless leader in the Speech Group here in Sun Labs, just added support for ECMAScript action tags to Sphinx-4. With Action Tags the parsing of recognition results can be greatly simplified. This is similar to Semantic Interpretation work done by the W3C as part of the VoiceXML effort.

Here's a JSGF grammar that shows how you could use action tags to support a fast food restaurant.

<pizzaTopping> = cheese
               | pepperoni
               | mushrooms
               | mushroom
               | onions
               | onion
               | sausage;

<pizza> = <NULL> { this.$value = new Packages.demo.jsapi.tags.Pizza(); }
          ([and] <pizzaTopping> { this.$value.addTopping($); })*
          (pizza | pie)
          [with] ([and] <pizzaTopping> { this.$value.addTopping($); })*;

// Burger toppings and command.
//
<burgerTopping> = onions
                | pickles
                | tomatoes
                | lettuce
                | cheese;

<burgerCondiment> = mayo
                  | relish
                  | ketchup
                  | mustard
                  | special sauce;

<burger> = ((burger | hamburger) { this.$value
                                    = new Packages.demo.jsapi.tags.Burger(); }
            | cheeseburger { this.$value
                              = new Packages.demo.jsapi.tags.Burger();
                             this.$value.addTopping("cheese")})
          [with]
          ( [and] ( <burgerTopping> { this.$value.addTopping($); }
                  | <burgerCondiment> { this.$value.addCondiment($); }
                  )
          )*;

public <order> = [I (want | would like) a]
                 (<pizza> | <burger> { appObj.submitOrder($.$value); };

With this grammar you can say things like:
"I want a pizza with pepperoni and onions and sausage and mushrooms'

"I would like a cheeseburger with relish, onions and special sauce"

and the proper Java objects (Burger or Pizza) will be created with all of the proper toppings, and condiments. This eliminates a whole level of parsing that would have to be done in the Java code. You can read more about Action Tags here: ECMAScript Action Tags for JSGF

Monday Sep 13, 2004

The first lego league season kicks off on Wednesday. That's when the FLL anounces the season's challenge. Starting Wednesday, I'll be spending a couple of days a week at the local middle school coaching a team that is trying to build a lego mindstorms robot that will solve the challenge. This year's theme is No Limits where teams will utilize robotics technologies to assist people with various levels of abilities

Uber-coach Skye Sweeney's team, the Teckno Devils already have a competition table built and ready to roll. Check out the 2004 Table Setup Pictures

A story at CNET indicates that IBM will be opening up some of its speech software. Here's the interesting bit:

IBM is donating code that it estimates cost the company $10 million to develop. One collection of speech software for handling basic words for dates, time and locations, like cities and states, will go to the Apache Software Foundation. The company is also contributing speech-editing tools to a second open-source group, the Eclipse Foundation.

It is unclear whether IBM is opening up any of its ViaVoice product. Time will tell.

A number of researchers have been looking at 'wearable computing' to understand the implications and role of a ubiquitous computer. One of the issues that occurs frequently for folks with a 'wearable computer' is dealing with the context shift as the user shifts from dealing with a person, then the computer and then back to the person again.

At the Georgia Institute of Technology they are researching ways to eliminate this context shift by employing 'dual-purpose speech', that is, speech that is part of the conversation with another person, but also is being used to control the wearable computer. For instance, while negotiating the day and time of a meeting with a co-worker the wearble computer listens for key phrases and manipulates the calendar as necessary. So if Alice says to Bob "Can I meet with you sometime next week?", and Bob who is using a wearble computer replies "When would you like to meet next week?" the wearable computer will spot the phrase "next week" and automatically display Bob's calendar for next week. As the meeting negotiations continue, Bob is able to simultaneously talk with Alice while manipulating his calendar.

The Gatech researchers have written an interesting paper called Augmenting Conversations Using Dual-Purpose Speech that describes this work. They used Sphinx-4 for this research.

Saturday Sep 11, 2004

I was a bit curious as to which Java classes I used most often so I figured I count them. Instead of counting each individual use I decided that I'd just count the number of times a class was imported. Here's the command line:

$ find . -name "*.java" | xargs cat | grep "^import java" |\
 sort | uniq -c | sort -nr | head -20
I ran this on the 100,000 lines or so of code that make up Sphinx-4 with the following results:
    122 import java.io.IOException;
    108 import java.util.List;
    101 import java.util.Iterator;
     62 import java.util.Map;
     50 import java.util.HashMap;
     50 import java.net.URL;
     48 import java.util.ArrayList;
     42 import java.util.logging.Logger;
     41 import java.io.File;
     37 import java.util.LinkedList;
     33 import java.util.Set;
     28 import java.util.HashSet;
     26 import java.io.InputStream;
     23 import java.util.Collections;
     22 import java.util.logging.Level;
     22 import java.io.BufferedReader;
     21 import java.util.Properties;
     21 import java.io.Serializable;
     20 import java.util.StringTokenizer;
     20 import java.util.Collection;

I was surpised to see IOException on top of the list, but the rest confirmed what I expected, that the Collections api dominates. The Collections API is my friend.

Wednesday Sep 08, 2004

World Wide Web Consortium Issues SSML 1.0 as a W3C Recommendation. The Speech Synthesis Markup Language is an XML-based markup language for speech synthesis based upon the Java Speech Markup Language (JSML). Update: There's a good article about SSML at InternetNews.com.

Students Thomas Horf, Roland Roller, Sabrina Wilske have built a robotic barkeep that will make a drink at your spoken request. As the robot makes your drink it will tell you a joke or two.

This robotic barkeep is the product of the Talking Robots with LEGO MindStorms course offered by the Computational Linguistics and Phonetics department at the University des Saarlandes. The Talking Robots course is designed to help students explore robotics and computational linguistics by creating robots that can communicate. They use Lego Mindstorms, the Lejos Java virtual machine as well as the Java speech API to implement these robots. Other interesting robots are the blackjack dealer and the Logistics Robot

They also have a Resource Page that serves as a great starting point for anyone thinking of building communicating robots.

Thursday Sep 02, 2004

In some areas of the world, especially areas with low populations and difficult terrain, a whistled form of language has been developed to allow communication when using ordinary language would be difficult. The whistled speech is not a new language, but is instead a literal translation to the new form. In a typical translation:

  • Each phoneme has a whistled equivalent.
  • Vowel aperture is replaced by a set of more or less stable pitch ranges.
  • Consonants are produced by pitch transitions between vowels.
  • Stress is expressed by higher pitch or increased length
  • Intonation exists, but conflicts with segmental pitch changes.
Here's an interesting anecdote:
"My brother was once hiking around Gomera with a friend. They ran out of drinking water and asked a local person for some. This person said she didn't have any (it was a very dry area!) but her neighbor up the mountain could help. "I'll let her know you're coming" she said, and whistled up the mountain. They walked up the mountain. My brother walked ahead and arrived first. When he got to the house, a stranger sitting there said: "Ah, there you are. The water's right around the corner there; but where is your friend?"

Read more in this posting on the Linguist List and in these set of papers.

Friday Aug 27, 2004

KTTS is a developing standard for the KDE desktop that will allow KDE apps to provide speech output. There is also a KDE Text-to-Speech API

KTTS will be shipping soon as part of the KDE-Accessiblity package

KTTS can use FreeTTS as well as other synthesis engines including Festival, Hadifax, Flite and Epos.

Monday Aug 23, 2004

I recently stumbled across this excellent list of VIM Tips. Lots of stuff in there that I didn't know about. For instance, I didn't know that vim has an 'explorer' mode that lets you navigate through your file system from within the editor. So much more to learn ...

This blog copyright 2010 by plamere