- From: Bjorn Bringert <bringert@google.com>
- Date: Tue, 15 Dec 2009 20:25:54 +0000
It seems like there is enough interest in speech to start developing experimental implementations. There appear to be two general directions that we could take: - A general microphone API + streaming API + audio tag - Pro: Useful for non-speech recognition / synthesis applications. E.g. audio chat, sound recording. - Pro: Allows JavaScript libraries for third-party network speech services. E.g. an AJAX API for Google's speech services. Web app developers that don't have their own speech servers could use that. - Pro: Consistent recognition / synthesis user experience across user agents in the same web app. - Con: No support for on-device recognition / synthesis, only network services. - Con: Varying recognition / synthesis user experience across different web apps in a single user agent. - Con: Possibly higher overhead because the audio data needs to pass through JavaScript. - Con: Requires dealing with audio encodings, endpointing, buffer sizes etc in the microphone API. - A speech-specific back-end neutral API - Pro: Simple API, basically just two methods: listen() and speak(). - Pro: Can use local recognition / synthesis. - Pro: Consistent recognition / synthesis user experience across different web apps in a single user agent. - Con: Varying recognition / synthesis user experience across user agents in the same web app. - Con: Only works for speech, not general audio. /Bjorn On Sun, Dec 13, 2009 at 6:46 PM, Ian McGraw <imcgraw at mit.edu> wrote: > I'm new to this list, but as a speech-scientist and web developer, I wanted > to add my 2 cents. ?Personally, I believe the future of speech recognition > is in the cloud. > Here are two services which provide Javascript APIs for speech recognition > (and TTS) today: > http://wami.csail.mit.edu/ > http://www.research.att.com/projects/SpeechMashup/index.html > Both of these are research systems, and as such they are really just > proof-of-concepts. > That said, Wami's JSONP-like implementation allows Quizlet.com to use speech > recognition today on a relatively large scale, with just a few lines of > Javascript code: > http://quizlet.com/voicetest/415/?scatter > Since there are a lot of Google folks on this list, I recommend you talk to > Alex Gruenstein (in your speech group) who was one of the lead developers of > WAMI while at MIT. > The major limitation we found when building the system was that we had to > develop a new audio controller for every client (Java for the desktop, > custom browsers for iPhone and Android). ?It would have been much simpler if > browsers came with standard microphone capture and audio streaming > capabilities. > -Ian > > > On Sun, Dec 13, 2009 at 12:07 PM, Weston Ruter <westonruter at gmail.com> > wrote: >> >> I blogged yesterday about this topic (including a text-to-speech demo >> using HTML5 Audio and Google Translate's TTS service); the more relevant >> part for this thread: >> >>> I am really excited at the prospect of text-to-speech being made >>> available on >>> the Web! It's just too bad that fetching MP3s on an remote web service is >>> the >>> only standard way of doing so currently; modern operating systems all >>> have TTS >>> capabilities, so it's a shame that web apps and can't utilize them via >>> client-side scripting. I posted to the WHATWG mailing list about such a >>> Text-To-Speech (TTS) Web API for JavaScript, and I was directed to a >>> recent >>> thread about a Web API for speech recognition and synthesis. >>> >>> Perhaps there is some momentum building here? Having TTS available in the >>> browser would boost accessibility for the seeing-impaired and improve >>> usability >>> for people on-the-go. TTS is just another technology that has >>> traditionally been >>> relegated to desktop applications, but as the open Web advances as the >>> preferred >>> platform for application development, it is an essential service to make >>> available (as with Geolocation API, Device API, etc.). And besides, I >>> want to >>> build TTS applications and my motto is: "If it can't be done on the open >>> web, >>> it's not worth doing at all"! >> >> http://weston.ruter.net/projects/google-tts/ >> >> Weston >> >> On Fri, Dec 11, 2009 at 1:35 PM, Weston Ruter <westonruter at gmail.com> >> wrote: >>> >>> I was just alerted about this thread from my post "Text-To-Speech (TTS) >>> Web API for JavaScript" at >>> <http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-December/024453.html>. >>> Amazing how shared ideas like these seem to arise independently at the same >>> time. >>> >>> I have a use-case and an additional requirement, that the time indices be >>> made available for when each word is spoken in the TTS-generated audio: >>> >>>> I've been working on a web app which reads text in a web page, >>>> highlighting each word as it is read. For this to be possible, a >>>> Text-To-Speech API is needed which is able to: >>>> (1) generate the speech audio from some text, and >>>> (2) include the time indicies for when each of the words in the text is >>>> spoken. >>> >>> I foresee that a TTS API should integrate closely with the HTML5 Audio >>> API. For example, invoking a call to the API could return a "TTS" object >>> which has an instance of Audio, whose interface could be used to navigate >>> through the TTS output. For example: >>> >>> var tts = new TextToSpeech("Hello, World!"); >>> tts.audio.addEventListener("canplaythrough", function(e){ >>> ??? //tts.indices == [{startTime:0, endTime:500, text:"Hello"}, >>> {startTime:500, endTime:1000, text:"World"}] >>> }, false); >>> tts.read(); //invokes tts.audio.play >>> >>> What would be even cooler, is if the parameter passed to the TextToSpeech >>> constructor could be an Element or TextNode, and the indices would then >>> include a DOM Range in addition to the "text" property. A flag could also be >>> set which would result in each of these DOM ranges to be selected when it is >>> read. For example: >>> >>> var tts = new TextToSpeech(document.querySelector("article")); >>> tts.selectRangesOnRead = true; >>> tts.audio.addEventListener("canplaythrough", function(e){ >>> ??? /* >>> ??? tts.indices == [ >>> ??????? {startTime:0, endTime:500, text:"Hello", range:Range}, >>> ??????? {startTime:500, endTime:1000, text:"World", range:Range} >>> ??? ] >>> ??? */ >>> }, false); >>> tts.read(); >>> >>> In addition to the events fired by the Audio API, more events could be >>> fired when reading TTS, such as a "readrange" event whose event object would >>> include the index (startTime, endTime, text, range) for the range currently >>> being spoken. Such functionality would make the ability to "read along" with >>> the text trivial. >>> >>> What do you think? >>> Weston >>> >>> On Thu, Dec 3, 2009 at 4:06 AM, Bjorn Bringert <bringert at google.com> >>> wrote: >>>> >>>> On Wed, Dec 2, 2009 at 10:20 PM, Jonas Sicking <jonas at sicking.cc> wrote: >>>> > On Wed, Dec 2, 2009 at 11:17 AM, Bjorn Bringert <bringert at google.com> >>>> > wrote: >>>> >> I agree that being able to capture and upload audio to a server would >>>> >> be useful for a lot of applications, and it could be used to do >>>> >> speech >>>> >> recognition. However, for a web app developer who just wants to >>>> >> develop an application that uses speech input and/or output, it >>>> >> doesn't seem very convenient, since it requires server-side >>>> >> infrastructure that is very costly to develop and run. A >>>> >> speech-specific API in the browser gives browser implementors the >>>> >> option to use on-device speech services provided by the OS, or >>>> >> server-side speech synthesis/recognition. >>>> > >>>> > Again, it would help a lot of you could provide use cases and >>>> > requirements. This helps both with designing an API, as well as >>>> > evaluating if the use cases are common enough that a dedicated API is >>>> > the best solution. >>>> > >>>> > / Jonas >>>> >>>> I'm mostly thinking about speech web apps for mobile devices. I think >>>> that's where speech makes most sense as an input and output method, >>>> because of the poor keyboards, small screens, and frequent hands/eyes >>>> busy situations (e.g. while driving). Accessibility is the other big >>>> reason for using speech. >>>> >>>> Some ideas for use cases: >>>> >>>> - Search by speaking a query >>>> - Speech-to-speech translation >>>> - Voice Dialing (could open a tel: URI to actually make the call) >>>> - Dialog systems (e.g. the canonical pizza ordering system) >>>> - Lightweight JavaScript browser extensions (e.g. Greasemonkey / >>>> Chrome extensions) for using speech with any web site, e.g, for >>>> accessibility. >>>> >>>> Requirements: >>>> >>>> - Web app developer side: >>>> ? - Allows both speech recognition and synthesis. >>>> ? - Easy to use API. Makes simple things easy and advanced things >>>> possible. >>>> ? - Doesn't require web app developer to develop / run his own speech >>>> recognition / synthesis servers. >>>> ? - (Natural) language-neutral API. >>>> ? - Allows developer-defined application specific grammars / language >>>> models. >>>> ? - Allows multilingual applications. >>>> ? - Allows easy localization of speech apps. >>>> >>>> - Implementor side: >>>> ? - Easy enough to implement that it can get wide adoption in browsers. >>>> ? - Allows implementor to use either client-side or server-side >>>> recognition and synthesis. >>>> >>>> -- >>>> Bjorn Bringert >>>> Google UK Limited, Registered Office: Belgrave House, 76 Buckingham >>>> Palace Road, London, SW1W 9TQ >>>> Registered in England Number: 3977902 >>> >> > > -- Bjorn Bringert Google UK Limited, Registered Office: Belgrave House, 76 Buckingham Palace Road, London, SW1W 9TQ Registered in England Number: 3977902
Received on Tuesday, 15 December 2009 12:25:54 UTC