W3C home > Mailing lists > Public > www-multimodal@w3.org > December 2009

Web API for speech recognition and synthesis

From: Deborah Dahl <dahl@conversational-technologies.com>
Date: Wed, 16 Dec 2009 10:17:01 -0500
To: <Olli@pettay.fi>, "'Bjorn Bringert'" <bringert@google.com>
Cc: "'Dave Burke'" <daveburke@google.com>, 'Joćo Eiras' <joaoe@opera.com>, "'whatwg'" <whatwg@whatwg.org>, "'David Singleton'" <davidsingleton@google.com>, "'Gudmundur Hafsteinsson'" <gummi@google.com>, <westonruter@gmail.com>, <www-multimodal@w3.org>
Message-ID: <00e401ca7e62$d2376160$6801a8c0@chimaera>
(resending to include the whatwg list, sorry for multiple postings)
Hi Olli,
Thank you for bringing this interesting thread to the Multimodal
Interaction Working Group's attention.
The working group is in fact very active. Although it is chartered as 
W3C Member-only, we do have a public mailing list, www-multimodal@w3.org, 
available for public discussions. 

In general, we would be very interested in hearing about the kinds of use 
cases for speech recognition and TTS in a browser context that you would 
like to handle. The Multimodal Architecture is our primary draft spec 
that addresses using speech in web pages (although it also addresses 
other modes of input, such as handwriting). A new Working Draft has just 
been published and we would be very interested 
in getting feedback on it. In my opinion, it's probably focused more on 
distributed architectures than on the use cases you might be interested 
in, but we would like our specs to be comprehensive enough to be able to 
address both server-based and client-based speech processing. 

We would also be interested in general discussions of questions about
multimodality. 

Here are some pointers that may be useful.
MMI page: http://www.w3.org/2002/mmi/
MMI Architecture spec: http://www.w3.org/TR/2009/WD-mmi-arch-20091201/

best regards,

Debbie Dahl, MMI Working Group Chair


> -----Original Message-----
> From: Olli Pettay [mailto:Olli.Pettay@helsinki.fi] 
> Sent: Friday, December 11, 2009 4:14 PM
> To: Bjorn Bringert
> Cc: Olli@pettay.fi; Dave Burke; Joćo Eiras; whatwg; David 
> Singleton; Gudmundur Hafsteinsson; westonruter@gmail.com; 
> www-multimodal@w3.org; Deborah Dahl
> Subject: Re: [whatwg] Web API for speech recognition and synthesis
> 
> On 12/11/09 6:05 AM, Bjorn Bringert wrote:
> > Thanks for the discussion - cool to see more interest today also
> > 
>
(http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-December/024453.ht
ml)
> >
> > I've hacked up a proof-of-concept JavaScript API for speech
> > recognition and synthesis. It adds a navigator.speech object with
> > these functions:
> >
> > void listen(ListenCallback callback, ListenOptions options);
> > void speak(DOMString text, SpeakCallback callback, 
> SpeakOptions options);
> 
> 
> So if I read the examples correctly you're not using grammars 
> anywhere.
> I wonder how well does that work in real world cases. Of course if
> the speech recognizer can handle everything well without grammars, the
> result validation could be done in JS after the result is got from the
> recognizer. But I think having support for grammars simplifies coding
> and can make speech dialogs somewhat more manageable.
> 
> W3C has already standardized things like
> http://www.w3.org/TR/speech-grammar/ and
> http://www.w3.org/TR/semantic-interpretation/
> and the latter one works quite nicely with JS.
> 
> Again, I think this kind of discussion should happen in W3C 
> multimodal 
> WG. Though, I'm not sure how actively or how openly that 
> working group 
> works atm.
> 
> -Olli
> 
> 
> >
> > The implementation uses an NPAPI plugin for the Android browser that
> > wraps the existing Android speech APIs. The code is available at
> > http://code.google.com/p/speech-api-browser-plugin/
> >
> > There are some simple demo apps in
> > 
> http://code.google.com/p/speech-api-browser-plugin/source/brow
> se/trunk/android-plugin/demos/
> > including:
> >
> > - English to Spanish speech-to-speech translation
> > - Google search by speaking a query
> > - The obligatory pizza ordering system
> > - A phone number dialer
> >
> > Comments appreciated!
> >
> > /Bjorn
> >
> > On Fri, Dec 4, 2009 at 2:51 PM, Olli 
> Pettay<Olli.Pettay@helsinki.fi>  wrote:
> >> Indeed the API should be something significantly simpler than X+V.
> >> Microsoft has (had?) support for SALT. That API is pretty 
> simple and
> >> provides speech recognition and TTS.
> >> The API could be probably even simpler than SALT.
> >> IIRC, there was an extension for Firefox to support SALT 
> (well, there was
> >> also an extension to support X+V).
> >>
> >> If the platform/OS provides ASR and TTS, adding a JS API 
> for it should
> >> be pretty simple. X+V tries to handle some logic using 
> VoiceXML FIA, but
> >> I think it would be more web-like to give pure JS API 
> (similar to SALT).
> >> Integrating visual and voice input could be done in 
> scripts. I'd assume
> >> there would be some script libraries to handle multimodal 
> input integration
> >> - especially if there will be touch and gestures events 
> too etc. (Classic
> >> multimodal map applications will become possible in web.)
> >>
> >> But this all is something which should be possibly 
> designed in or with W3C
> >> multimodal working group. I know their current 
> architecture is way more
> >> complex, but X+X, SALT and even Multimodal-CSS has been 
> discussed in that
> >> working group.
> >>
> >>
> >> -Olli
> >>
> >>
> >>
> >> On 12/3/09 2:50 AM, Dave Burke wrote:
> >>>
> >>> We're envisaging a simpler programmatic API that looks 
> familiar to the
> >>> modern Web developer but one which avoids the legacy of 
> dialog system
> >>> languages.
> >>>
> >>> Dave
> >>>
> >>> On Wed, Dec 2, 2009 at 7:25 PM, Joćo Eiras<joaoe@opera.com
> >>> <mailto:joaoe@opera.com>>  wrote:
> >>>
> >>>     On Wed, 02 Dec 2009 12:32:07 +0100, Bjorn Bringert
> >>>     <bringert@google.com<mailto:bringert@google.com>>  wrote:
> >>>
> >>>         We've been watching our colleagues build native 
> apps that use
> >>> speech
> >>>         recognition and speech synthesis, and would like 
> to have JavaScript
> >>>         APIs that let us do the same in web apps. We are 
> thinking about
> >>>         creating a lightweight and 
> implementation-independent API that lets
> >>>         web apps use speech services. Is anyone else 
> interested in that?
> >>>
> >>>         Bjorn Bringert, David Singleton, Gummi Hafsteinsson
> >>>
> >>>
> >>>     This exists already, but only Opera supports it, 
> although there are
> >>>     problems with the library we use for speech recognition.
> >>>
> >>>     http://www.w3.org/TR/xhtml+voice/
> >>>
> >>>   
> http://dev.opera.com/articles/view/add-voice-interactivity-to-
> your-site/
> >>>
> >>>     Would be nice to revive that specification and get 
> vendor buy-in.
> >>>
> >>>
> >>>
> >>>     --
> >>>
> >>>     Joćo Eiras
> >>>     Core Developer, Opera Software ASA, http://www.opera.com/
> >>>
> >>>
> >>
> >>
> >
> >
> >
> 
> 
Received on Wednesday, 16 December 2009 15:15:54 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:06:35 UTC