RE: An early draft of a speech API from Young, Milan on 2011-03-17 (public-xg-htmlspeech@w3.org from March 2011)

From: Young, Milan <Milan.Young@nuance.com>
Date: Thu, 17 Mar 2011 16:19:41 -0700
To: "Marc Schroeder" <marc.schroeder@dfki.de>, "Bjorn Bringert" <bringert@google.com>
Cc: "Robert Brown" <Robert.Brown@microsoft.com>, "Satish Sampath" <satish@google.com>, <Olli@pettay.fi>, "Olli Pettay" <Olli.Pettay@helsinki.fi>, <public-xg-htmlspeech@w3.org>
Message-ID: <1AA381D92997964F898DF2A3AA4FF9AD0A863F72@SUN-EXCH01.nuance.com>
Dan B has often said that VoiceXML was designed to make it easy to perform the easy tasks, and possible to perform the difficult ones.  I agree with Marc that we should use this model for our group.

Building upon Marc's email below, perhaps we could publish a set of "macros" that would function over the protocol that Microsoft has suggested.  These macros would be written in ECMA and would convert the verbose process of establishing microphone access, contacting the server, managing events, etc. into something that looked more like the Mozilla API.  The casual web developer could use these macros to quickly speech-enable their page, and yet retain a straightforward migration path to handle additional complexity as it arises.  And as we all know, complexity always follows.

A totally different idea is to build upon Olli's suggestion from a few emails back.  Essentially a webservice API that speech services would publish allowing browsers to expose standardized speech functionality to the web layer.   As long as all speech services agreed on the same API, we'd have an open standard.  My suggestion would be to adapt the existing MRCP standard to an HTTP transport.

I've done a lot of handwaving over both of these suggestions.  I will happily add flesh, but before I do I think it's best to pause and get a sense of the group's reaction.

Comments please!



-----Original Message-----
From: Marc Schroeder [mailto:marc.schroeder@dfki.de] 
Sent: Thursday, March 17, 2011 2:24 AM
To: Bjorn Bringert
Cc: Young, Milan; Robert Brown; Satish Sampath; Olli@pettay.fi; Olli Pettay; public-xg-htmlspeech@w3.org
Subject: Re: An early draft of a speech API

In fact, in my mind the challenge that we are facing is to make the use 
of remote speech services as easy for the non-expert as a default speech 
service built into the browser.

I am not willing to follow the argument that the millions of web 
developers are not likely to want to use, or be able to afford, remote 
speech services (given that we, for example, provide an open source TTS 
for free, which could be running on a virtual server that people can 
rent for 8 EUR/month).

Just to make it explicit, instead of writing something like

  <tts text="Hello world"/>

the page would have to include something like

<tts text="Hello world" tts-server="http://..."/>


That would not be too difficult to use. And regarding remote speech 
recognition services, I don't think we should prejudge possible business 
strategies by commercial speech providers such as nuance or loquendo.


A possible consequence of following this kind of logic would be to split 
the spec in two. One is for the interaction between web app and browser; 
the other is for the interaction between browser and remote speech app. 
The web-app-to-browser API must rely on the 
browser-to-remote-speech-service API doing its job in order to fulfil 
its contract regarding remote speech services.

As far as I understand the Microsoft proposal, it contains a mechanism 
for interacting with a remote speech app; take this method and hide it 
from the web app, and we're basically there... no?

Does this make any sense?

Best,
Marc



On 17.03.11 09:38, Bjorn Bringert wrote:
>> In my mind, the methodology used for a default or local speech service
>> should be similar to the network case.  We've heard from at least a couple
>> others on this list that feel the same.
>
> I initially argued for a single API, but I don't think that's a good
> idea anymore. They are two different use cases, for two different sets
> of web developers.
>
> - The API for default speech services needs to be a lowest common
> denominator. An easy-to-use API is important, since this is the API
> for the millions of web developers who want to build simple speech
> apps and can't run their own speech services. This API should be
> speech-specific to make it easy to design, implement and use.
>
> - The API for app-selected speech services needs to be flexible enough
> to allow for all kinds of service-specific extra features. This API is
> for sophisticated developers who have the expertise and resources to
> tune language models and run or contract their own speech services.
> For maximum flexibility, this API should consist of general components
> that can be composed in different ways, e.g. a microphone capture API
> and an audio streaming API, and ideally not be speech-specific at all.
>
> A single API would complicate the simple API, while restricting the
> flexibility of the API for app-selected speech services.
>
> Two different APIs means two simpler APIs that can be developed in parallel.
>
> The only argument that I can see for having a single API is that it
> makes it easier to switch between default and app-selected speech
> services. But the point of app-specified speech services is that they
> are for apps that need tight control over the implementation, which
> makes them unlikely to work with a default speech service. Assembly
> language and Javascript are different, and merging them into a single
> language because you want to write both high-level and low-level code
> seems like a bad idea.
>
> /Bjorn
>
> On Wed, Mar 16, 2011 at 10:35 PM, Young, Milan<Milan.Young@nuance.com>  wrote:
>> In my mind, the methodology used for a default or local speech service should be similar to the network case.  We've heard from at least a couple others on this list that feel the same.
>>
>> Separating these efforts into v1/v2 or across WGs will almost surely result in divergent/competing APIs.  I'd rather invest in the ounce of prevention.
>>
>>
>>
>> -----Original Message-----
>> From: Robert Brown [mailto:Robert.Brown@microsoft.com]
>> Sent: Wednesday, March 16, 2011 3:16 PM
>> To: Satish Sampath; Olli@pettay.fi
>> Cc: Olli Pettay; Young, Milan; public-xg-htmlspeech@w3.org; Bjorn Bringert
>> Subject: RE: An early draft of a speech API
>>
>>>> Since all 3 proposals address the default recognizer case without any external dependencies, I think it would be ideal to finalise a concrete recommendation for that without getting blocked on remote recognizers.
>>
>> Sorry Satish, I disagree.  Unfortunately, we all agree on the least valuable part.  There's no point recommending that.  We should either commit to solving the valuable problems, or defer the work for another year while the microphone and network specs iron out.
>>
>> -----Original Message-----
>> From: public-xg-htmlspeech-request@w3.org [mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of Satish Sampath
>> Sent: Wednesday, March 16, 2011 3:06 PM
>> To: Olli@pettay.fi
>> Cc: Olli Pettay; Young, Milan; Robert Brown; public-xg-htmlspeech@w3.org; Bjorn Bringert
>> Subject: Re: An early draft of a speech API
>>
>> There is a good momentum behind the recent WHATWG proposal update for real time communication at http://www.whatwg.org/specs/web-apps/current-work/multipage/dnd.html#video-conferencing-and-peer-to-peer-communication.
>> The previous<device>  tag version of this proposal was already being prototyped and implemented by various vendors in the browser space.
>> Notably, Opera released a prototype recently at http://my.opera.com/core/blog/2011/03/14/web-meet-device and Ericsson Labs showed a prototype in webkit at https://labs.ericsson.com/developer-community/blog/beyond-html5-implementing-device-and-stream-management-webkit.
>>
>> The fact that browser vendors are getting involved in this spec proposal should encourage our XG to build upon this spec for the remote recognizer use cases. I think this would be better than the DAP device API which browser vendors have not picked up. However this proposal is still a moving target and will likely evolve quickly.
>>
>> Since all 3 proposals address the default recognizer case without any external dependencies, I think it would be ideal to finalise a concrete recommendation for that without getting blocked on remote recognizers. That will allow browser vendors to implement the default recognizers without having to wait for implementations to pick up the DAP or WHATWG proposal for the audio capture part. We should of course work on the remote recognizer proposal in parallel, but I don't see why it should be a reason to gate a proposal for the simpler use case with the default recognizer.
>>
>> --
>> Cheers
>> Satish
>>
>>
>>
>
>
>

-- 
Dr. Marc Schröder, Senior Researcher at DFKI GmbH
Project leader for DFKI in SSPNet http://sspnet.eu
Team Leader DFKI TTS Group http://mary.dfki.de
Editor W3C EmotionML Working Draft http://www.w3.org/TR/emotionml/
Portal Editor http://emotion-research.net
Associate Editor IEEE Trans. Affective Computing http://computer.org/tac

Homepage: http://www.dfki.de/~schroed
Email: marc.schroeder@dfki.de
Phone: +49-681-85775-5303
Postal address: DFKI GmbH, Campus D3_2, Stuhlsatzenhausweg 3, D-66123 
Saarbrücken, Germany
--
Official DFKI coordinates:
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Trippstadter Strasse 122, D-67663 Kaiserslautern, Germany
Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
Received on Thursday, 17 March 2011 23:20:24 UTC