« Riya beta available | Main | The baby safari »

The audio equivalent of Riya?

Thinking about the Riya photo face recognizer got me back to thinking about the speech recognition problem. I am willing, although I am also a little bit nuts, to train the Riya web site to recognize the faces in my pictures in the hope of the benefit that I’ll be able to automate filing the pictures into who is in each one which would be a big problem if I actually did it. So with speech recognition I would also be willing to train a web site to understand what I say for similar benefits. The two in particular that I was thinking of is for podcasting and for translation while talking to someone in a foreign country through a Skype or similar VoIP product. So if a site were to be able to be trained to a voice and then be able to take inbound audio streams/files two potential markets would be the podcast to text (so that the casts could be searched better than what is out there today) and the text to translation. The web site would be able to consolidate voice prints and then be useful in other contexts were they to arise later – dictation, research notes, future commerce tools tying my phone # to the translation service so that those bots at American Airlines could better understand me, etc. Unfortunately it will take a lot of energy to build such a system including a good API, probably a better one than the one that Microsoft offers, a smart team of engineers, and a good team to market and sell the solution even when the technology doesn't work. Maybe if I wait three years all the pieces will come together and it will be worth doing then.

TrackBack

Listed below are links to weblogs that reference The audio equivalent of Riya?:

» Audio Version of Riya from Recognizing Deven
Dan Housman put up an interesting conjecture today about creating an audio version of Riya here . Can you do voice recognition inside of streams. I spent quite a bit of time looking at this problem. It seemed to me [Read More]

Comments

sun just announced their "sungrid" computing utility and the first service they offered as a demo was free text-to-speech conversion utility - the example they gave was creating podcasts.

random stream of consciousness to follow...

I was looking into this area (mostly for searching speech) over the last 3 months or so.

With speech I believe you will see more hybrid systems appearing which combine 1-best transcription ( podzinger?) with phonetic indexes (podscope,nexidia,aurix).

The 1-best transcription system takes care of stock words and there are many large vocabulary systems out there but falls over when you have new terminology, names etc (which podcasts are full of). This is why we've seen a rise in phonetic based index system such as nexidia, aurix etc because they are so much better at mining audio data. The problem with phonetic based indexes is that performance lags text based indexes by a large margin.

As I see it the next phase of audio search will involve this:

1. Language detection (excellent results)

2. Accent detection (not sure how well this works).

1 & 2 are done to select the models you use. The accent detection is useful for english where the variances from US -> UK English confuse many systems because most of them are trained on US english.


3. Speaker detection
More metadata for search. I want to know when Britney Spears talked about quantum mechanics. Right now, with most systems it'll just tell you that quantum mechanics was said but not by whom.

4. Topic Extraction
auto-tagging along with 1/2.

The knowledge is out there but what you need is the will power of guys like Munjal , Burak and Azhar to pull a team together to make it happen. Most of players are chasing military and intillgence dollars (which is true of face recognition also) that still leaves the consumer space wide open.

All the current players (podzinger, podscope, blinkx?) are not innovating with their search. They seem to be showcasing their existing technology. The directory guys have lots of content but you can only search metadata.

Who will truely liberate the content in audio? For a while I thought it would be me but alas that’s a story best told over beers :)

http://www.nist.gov/speech/index.htm - gives you a good sense of what’s going on in the world of speech.

Prasanna,

I agree with you that the current players are just showcasing existing technology.

In addition to combining 1-best transcription and phonetic based indexes, it should also be possible to combine face/text recognition with speaker recognition.

I am willing to buy you all the beer to hear the story. Drop me a line at sudbhoja at yahoo dot com if you are interested.

Great reading, keep up the great posts.
Peace, JiggaDigga

Hi,
I just thought ill add to the conversation with a mention of the projects that im working on that have got pretty good response from the developer community (got listed on ajaxian recently!). its at http://bosky101.blogspot.com/2006/04/my-final-year-project-speech.html

Keep Clicking,
Bhasker

Awesome blog. Peace out until next time TabathaOster

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)