[whatwg] Web API for speech recognition and synthesis

It seems like there is enough interest in speech to start developing
experimental implementations. There appear to be two general
directions that we could take:

- A general microphone API + streaming API + audio tag
  - Pro: Useful for non-speech recognition / synthesis applications.
           E.g. audio chat, sound recording.
  - Pro: Allows JavaScript libraries for third-party network speech services.
           E.g. an AJAX API for Google's speech services. Web app developers
           that don't have their own speech servers could use that.
  - Pro: Consistent recognition / synthesis user experience across
            user agents in the same web app.
  - Con: No support for on-device recognition / synthesis, only
            network services.
  - Con: Varying recognition / synthesis user experience across
            different web apps in a single user agent.
  - Con: Possibly higher overhead because the audio data needs to
            pass through JavaScript.
  - Con: Requires dealing with audio encodings, endpointing, buffer
            sizes etc in the microphone API.

- A speech-specific back-end neutral API
  - Pro: Simple API, basically just two methods: listen() and speak().
  - Pro: Can use local recognition / synthesis.
  - Pro: Consistent recognition / synthesis user experience across
           different web apps in a single user agent.
  - Con: Varying recognition / synthesis user experience across user
            agents in the same web app.
  - Con: Only works for speech, not general audio.

/Bjorn

On Sun, Dec 13, 2009 at 6:46 PM, Ian McGraw <imcgraw at mit.edu> wrote:
> I'm new to this list, but as a speech-scientist and web developer, I wanted
> to add my 2 cents. ?Personally, I believe the future of speech recognition
> is in the cloud.
> Here are two services which provide Javascript APIs for speech recognition
> (and TTS) today:
> http://wami.csail.mit.edu/
> http://www.research.att.com/projects/SpeechMashup/index.html
> Both of these are research systems, and as such they are really just
> proof-of-concepts.
> That said, Wami's JSONP-like implementation allows Quizlet.com to use speech
> recognition today on a relatively large scale, with just a few lines of
> Javascript code:
> http://quizlet.com/voicetest/415/?scatter
> Since there are a lot of Google folks on this list, I recommend you talk to
> Alex Gruenstein (in your speech group) who was one of the lead developers of
> WAMI while at MIT.
> The major limitation we found when building the system was that we had to
> develop a new audio controller for every client (Java for the desktop,
> custom browsers for iPhone and Android). ?It would have been much simpler if
> browsers came with standard microphone capture and audio streaming
> capabilities.
> -Ian
>
>
> On Sun, Dec 13, 2009 at 12:07 PM, Weston Ruter <westonruter at gmail.com>
> wrote:
>>
>> I blogged yesterday about this topic (including a text-to-speech demo
>> using HTML5 Audio and Google Translate's TTS service); the more relevant
>> part for this thread:
>>
>>> I am really excited at the prospect of text-to-speech being made
>>> available on
>>> the Web! It's just too bad that fetching MP3s on an remote web service is
>>> the
>>> only standard way of doing so currently; modern operating systems all
>>> have TTS
>>> capabilities, so it's a shame that web apps and can't utilize them via
>>> client-side scripting. I posted to the WHATWG mailing list about such a
>>> Text-To-Speech (TTS) Web API for JavaScript, and I was directed to a
>>> recent
>>> thread about a Web API for speech recognition and synthesis.
>>>
>>> Perhaps there is some momentum building here? Having TTS available in the
>>> browser would boost accessibility for the seeing-impaired and improve
>>> usability
>>> for people on-the-go. TTS is just another technology that has
>>> traditionally been
>>> relegated to desktop applications, but as the open Web advances as the
>>> preferred
>>> platform for application development, it is an essential service to make
>>> available (as with Geolocation API, Device API, etc.). And besides, I
>>> want to
>>> build TTS applications and my motto is: "If it can't be done on the open
>>> web,
>>> it's not worth doing at all"!
>>
>> http://weston.ruter.net/projects/google-tts/
>>
>> Weston
>>
>> On Fri, Dec 11, 2009 at 1:35 PM, Weston Ruter <westonruter at gmail.com>
>> wrote:
>>>
>>> I was just alerted about this thread from my post "Text-To-Speech (TTS)
>>> Web API for JavaScript" at
>>> <http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-December/024453.html>.
>>> Amazing how shared ideas like these seem to arise independently at the same
>>> time.
>>>
>>> I have a use-case and an additional requirement, that the time indices be
>>> made available for when each word is spoken in the TTS-generated audio:
>>>
>>>> I've been working on a web app which reads text in a web page,
>>>> highlighting each word as it is read. For this to be possible, a
>>>> Text-To-Speech API is needed which is able to:
>>>> (1) generate the speech audio from some text, and
>>>> (2) include the time indicies for when each of the words in the text is
>>>> spoken.
>>>
>>> I foresee that a TTS API should integrate closely with the HTML5 Audio
>>> API. For example, invoking a call to the API could return a "TTS" object
>>> which has an instance of Audio, whose interface could be used to navigate
>>> through the TTS output. For example:
>>>
>>> var tts = new TextToSpeech("Hello, World!");
>>> tts.audio.addEventListener("canplaythrough", function(e){
>>> ??? //tts.indices == [{startTime:0, endTime:500, text:"Hello"},
>>> {startTime:500, endTime:1000, text:"World"}]
>>> }, false);
>>> tts.read(); //invokes tts.audio.play
>>>
>>> What would be even cooler, is if the parameter passed to the TextToSpeech
>>> constructor could be an Element or TextNode, and the indices would then
>>> include a DOM Range in addition to the "text" property. A flag could also be
>>> set which would result in each of these DOM ranges to be selected when it is
>>> read. For example:
>>>
>>> var tts = new TextToSpeech(document.querySelector("article"));
>>> tts.selectRangesOnRead = true;
>>> tts.audio.addEventListener("canplaythrough", function(e){
>>> ??? /*
>>> ??? tts.indices == [
>>> ??????? {startTime:0, endTime:500, text:"Hello", range:Range},
>>> ??????? {startTime:500, endTime:1000, text:"World", range:Range}
>>> ??? ]
>>> ??? */
>>> }, false);
>>> tts.read();
>>>
>>> In addition to the events fired by the Audio API, more events could be
>>> fired when reading TTS, such as a "readrange" event whose event object would
>>> include the index (startTime, endTime, text, range) for the range currently
>>> being spoken. Such functionality would make the ability to "read along" with
>>> the text trivial.
>>>
>>> What do you think?
>>> Weston
>>>
>>> On Thu, Dec 3, 2009 at 4:06 AM, Bjorn Bringert <bringert at google.com>
>>> wrote:
>>>>
>>>> On Wed, Dec 2, 2009 at 10:20 PM, Jonas Sicking <jonas at sicking.cc> wrote:
>>>> > On Wed, Dec 2, 2009 at 11:17 AM, Bjorn Bringert <bringert at google.com>
>>>> > wrote:
>>>> >> I agree that being able to capture and upload audio to a server would
>>>> >> be useful for a lot of applications, and it could be used to do
>>>> >> speech
>>>> >> recognition. However, for a web app developer who just wants to
>>>> >> develop an application that uses speech input and/or output, it
>>>> >> doesn't seem very convenient, since it requires server-side
>>>> >> infrastructure that is very costly to develop and run. A
>>>> >> speech-specific API in the browser gives browser implementors the
>>>> >> option to use on-device speech services provided by the OS, or
>>>> >> server-side speech synthesis/recognition.
>>>> >
>>>> > Again, it would help a lot of you could provide use cases and
>>>> > requirements. This helps both with designing an API, as well as
>>>> > evaluating if the use cases are common enough that a dedicated API is
>>>> > the best solution.
>>>> >
>>>> > / Jonas
>>>>
>>>> I'm mostly thinking about speech web apps for mobile devices. I think
>>>> that's where speech makes most sense as an input and output method,
>>>> because of the poor keyboards, small screens, and frequent hands/eyes
>>>> busy situations (e.g. while driving). Accessibility is the other big
>>>> reason for using speech.
>>>>
>>>> Some ideas for use cases:
>>>>
>>>> - Search by speaking a query
>>>> - Speech-to-speech translation
>>>> - Voice Dialing (could open a tel: URI to actually make the call)
>>>> - Dialog systems (e.g. the canonical pizza ordering system)
>>>> - Lightweight JavaScript browser extensions (e.g. Greasemonkey /
>>>> Chrome extensions) for using speech with any web site, e.g, for
>>>> accessibility.
>>>>
>>>> Requirements:
>>>>
>>>> - Web app developer side:
>>>> ? - Allows both speech recognition and synthesis.
>>>> ? - Easy to use API. Makes simple things easy and advanced things
>>>> possible.
>>>> ? - Doesn't require web app developer to develop / run his own speech
>>>> recognition / synthesis servers.
>>>> ? - (Natural) language-neutral API.
>>>> ? - Allows developer-defined application specific grammars / language
>>>> models.
>>>> ? - Allows multilingual applications.
>>>> ? - Allows easy localization of speech apps.
>>>>
>>>> - Implementor side:
>>>> ? - Easy enough to implement that it can get wide adoption in browsers.
>>>> ? - Allows implementor to use either client-side or server-side
>>>> recognition and synthesis.
>>>>
>>>> --
>>>> Bjorn Bringert
>>>> Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
>>>> Palace Road, London, SW1W 9TQ
>>>> Registered in England Number: 3977902
>>>
>>
>
>



-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902

Received on Tuesday, 15 December 2009 12:25:54 UTC