RE: An early draft of a speech API from Robert Brown on 2011-03-17 (public-xg-htmlspeech@w3.org from March 2011)

From: Robert Brown <Robert.Brown@microsoft.com>
Date: Thu, 17 Mar 2011 17:49:37 +0000
To: Bjorn Bringert <bringert@google.com>, "Young, Milan" <Milan.Young@nuance.com>
CC: Satish Sampath <satish@google.com>, "Olli@pettay.fi" <Olli@pettay.fi>, Olli Pettay <Olli.Pettay@helsinki.fi>, "public-xg-htmlspeech@w3.org" <public-xg-htmlspeech@w3.org>
Message-ID: <113BCF28740AF44989BE7D3F84AE18DD1987846E@TK5EX14MBXC118.redmond.corp.microsoft.>
>> A single API would complicate the simple API, while restricting the flexibility of the API for app-selected speech services.

I'm afraid I disagree (I know, I'm doing that a lot this week).

A "simple" API will turn out to be "simplistic" and generally useless, with all the functional apps having to rely on some other mechanism. 

A more sound approach is to design the API that solves the valuable problems, then make sure there's a fast-path through that API for novices.

>> this is the API for the millions of web developers who want to build simple speech apps and can't run their own speech services

They don't have to run their own speech services.  Plenty of speech vendors will provide services.

>> The API for app-selected speech services needs to be flexible enough to allow for all kinds of service-specific extra features

Let's not overblow this.  IIRC the requirement was to be able to get extra events, and pass in extra parameters.  Not exactly "high level" versus "low level" code.

>> Assembly language and Javascript are different

So are bananas and asteroids. :) I don't think that's the right comparison.  The <style> tag and the "style" attribute are different, but both part of CSS.  The attribute is more accessible to novices, but isn't an island.

>> general components that can be composed in different ways, e.g. a microphone capture API and an audio streaming API, and ideally not be speech-specific at all.

This is a good approach, and very soluble for the "simple" case.  If you want a simple set of defaults, you can put them into the API, but don't remove the option.  Perhaps if we all agreed on a streaming mechanism we could remove the streaming API (MRCP over WebSockets? HTTP via XHR?).

-----Original Message-----
From: Bjorn Bringert [mailto:bringert@google.com] 
Sent: Thursday, March 17, 2011 1:39 AM
To: Young, Milan
Cc: Robert Brown; Satish Sampath; Olli@pettay.fi; Olli Pettay; public-xg-htmlspeech@w3.org
Subject: Re: An early draft of a speech API

> In my mind, the methodology used for a default or local speech service 
> should be similar to the network case.  We've heard from at least a 
> couple others on this list that feel the same.

I initially argued for a single API, but I don't think that's a good idea anymore. They are two different use cases, for two different sets of web developers.

- The API for default speech services needs to be a lowest common denominator. An easy-to-use API is important, since this is the API for the millions of web developers who want to build simple speech apps and can't run their own speech services. This API should be speech-specific to make it easy to design, implement and use.

- The API for app-selected speech services needs to be flexible enough to allow for all kinds of service-specific extra features. This API is for sophisticated developers who have the expertise and resources to tune language models and run or contract their own speech services.
For maximum flexibility, this API should consist of general components that can be composed in different ways, e.g. a microphone capture API and an audio streaming API, and ideally not be speech-specific at all.

A single API would complicate the simple API, while restricting the flexibility of the API for app-selected speech services.

Two different APIs means two simpler APIs that can be developed in parallel.

The only argument that I can see for having a single API is that it makes it easier to switch between default and app-selected speech services. But the point of app-specified speech services is that they are for apps that need tight control over the implementation, which makes them unlikely to work with a default speech service. Assembly language and Javascript are different, and merging them into a single language because you want to write both high-level and low-level code seems like a bad idea.

/Bjorn

On Wed, Mar 16, 2011 at 10:35 PM, Young, Milan <Milan.Young@nuance.com> wrote:
> In my mind, the methodology used for a default or local speech service should be similar to the network case.  We've heard from at least a couple others on this list that feel the same.
>
> Separating these efforts into v1/v2 or across WGs will almost surely result in divergent/competing APIs.  I'd rather invest in the ounce of prevention.
>
>
>
> -----Original Message-----
> From: Robert Brown [mailto:Robert.Brown@microsoft.com]
> Sent: Wednesday, March 16, 2011 3:16 PM
> To: Satish Sampath; Olli@pettay.fi
> Cc: Olli Pettay; Young, Milan; public-xg-htmlspeech@w3.org; Bjorn 
> Bringert
> Subject: RE: An early draft of a speech API
>
>>> Since all 3 proposals address the default recognizer case without any external dependencies, I think it would be ideal to finalise a concrete recommendation for that without getting blocked on remote recognizers.
>
> Sorry Satish, I disagree.  Unfortunately, we all agree on the least valuable part.  There's no point recommending that.  We should either commit to solving the valuable problems, or defer the work for another year while the microphone and network specs iron out.
>
> -----Original Message-----
> From: public-xg-htmlspeech-request@w3.org 
> [mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of Satish 
> Sampath
> Sent: Wednesday, March 16, 2011 3:06 PM
> To: Olli@pettay.fi
> Cc: Olli Pettay; Young, Milan; Robert Brown; 
> public-xg-htmlspeech@w3.org; Bjorn Bringert
> Subject: Re: An early draft of a speech API
>
> There is a good momentum behind the recent WHATWG proposal update for real time communication at http://www.whatwg.org/specs/web-apps/current-work/multipage/dnd.html#video-conferencing-and-peer-to-peer-communication.

> The previous <device> tag version of this proposal was already being prototyped and implemented by various vendors in the browser space.
> Notably, Opera released a prototype recently at http://my.opera.com/core/blog/2011/03/14/web-meet-device and Ericsson Labs showed a prototype in webkit at https://labs.ericsson.com/developer-community/blog/beyond-html5-implementing-device-and-stream-management-webkit.

>
> The fact that browser vendors are getting involved in this spec proposal should encourage our XG to build upon this spec for the remote recognizer use cases. I think this would be better than the DAP device API which browser vendors have not picked up. However this proposal is still a moving target and will likely evolve quickly.
>
> Since all 3 proposals address the default recognizer case without any external dependencies, I think it would be ideal to finalise a concrete recommendation for that without getting blocked on remote recognizers. That will allow browser vendors to implement the default recognizers without having to wait for implementations to pick up the DAP or WHATWG proposal for the audio capture part. We should of course work on the remote recognizer proposal in parallel, but I don't see why it should be a reason to gate a proposal for the simpler use case with the default recognizer.
>
> --
> Cheers
> Satish
>
>
>



--
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham Palace Road, London, SW1W 9TQ Registered in England Number: 3977902
Received on Thursday, 17 March 2011 17:50:11 UTC