Re: An early draft of a speech API

On 03/16/2011 09:58 PM, Young, Milan wrote:
> Inline...
>
> -----Original Message----- From: Olli Pettay
> [mailto:Olli.Pettay@helsinki.fi] Sent: Tuesday, March 15, 2011 2:11
> PM To: Young, Milan Cc: Robert Brown; public-xg-htmlspeech@w3.org
> Subject: Re: An early draft of a speech API
>
> On 03/15/2011 09:57 PM, Young, Milan wrote:
>> I agree with Robert that the Mozilla proposal doesn't feel very
>> "open".  I'd further suggest that the Google speech proposal has
>> similar properties.
>>
>> In both cases, there is a tight coupling between the browser and
>> speech service that is outside of W3C and IETF turf.  This closed
>> model has all of the usual implications such as: * A cross-product
>> of integrations across UA and SS
> If Nuance has a public web based speech service and it exposes the
> API for it, browsers could use it as a default speech engine when the
> device is online. Or browsers could use some other engine.
>
> [Milan] If all speech vendors could agree on the same web services
> API, then I agree that this would eliminate my concerns.  But if
> there were variants, you'd still have the same fundamental problem.
>
> The Microsoft approach seems to eliminate most of these issues by
> having the application layer communicate directly with the speech
> service.

Well, Microsoft has many approaches.
The first approach doesn't provide default speech service.
If speech engine vendors provide public speech servers, which
anyone can use, then the problem is not that bad, although the
API should still support offline web apps (if the device has local engines).
If speech engine vendors don't provide such public services,
the API usage would be limited to those who could run their own
services.

Microsoft's SpeechRecognizer/<reco> are in many ways very close to my
SpeechRequest.
All those support default engines, and also (although I say only in v2)
remote engines.


> The browser is effectively only responsible for setting up
> the pipe and marshaling data.  The downside of that approach is that
> the application layer needs to do potentially custom integrations
> with each speech service, but that's pretty much a given anyway.

Yeah, each *web app* would need to do custom integrations.
With the default speech engine it would be the browser implementation
which would do the integration. That would keep the API simpler for
web developers.


>
> Perhaps you could let me know what you don't like about that aspect
> of the Microsoft methodology.  Perhaps something to do with the API
> exposed to the application layer?
As we know, MS has several approaches.
Most of the "Basic Extensions to Existing HTML Designs" is a possible 
even now by using Capture API or HTML's 
http://www.whatwg.org/specs/web-apps/current-work/multipage/dnd.html#video-conferencing-and-peer-to-peer-communication 
and XHR/WebSockets/Server-Sent-Events.
But it doesn't have support for default engines. That would
limit the possible audience who might use the API.



>
>> * Future spec enhancements (such as v2) occurring in lock step with
>> these custom integrations.
> I'm not sure I understand this.
>
> [Milan] Observing that the more custom integrations you have (browser
> and speech cross product), the harder it becomes to revise a
> standard.

I'm not sure how you make this conclusion. I'm sure browser vendors want
to give the best user experience for the users, and that clearly means
that if there are good use cases (which I think there are) for
remote engines, those will be supported.



>>
>>
>> Regarding the v1/v2 debate I agree with Olli that it's best to
>> issue a limping prototype to gain feedback rather than internally
>> iterations on a "perfect" solution that nobody wants.  But my
>> assumption from the start of this WG, is that we are designing a
>> framework for commercial and academic applications.
> My assumption has been that we're designing a framework or API for
> the Web. And "the Web" includes commercial and academic entities, but
> also individual web developers and users.
>
> [Milan] I didn't phrase that very well.  My point was that the needs
> of commercial and academic are largely a superset of individual web
> developers.

Which is a reason why "v2" would be a superset of "v1".
Again, I'm not against remote engines. I just want that browsers
have a simple API which all the web developers can use.
Remote engines could be used already today, if microphone
capture was implemented. (One could use Flash or Addons
like https://mozillalabs.com/rainbow/ even today to get microphone
input)


>
>
>> Thus it would be inappropriate to issue even a v1 that does not
>> give our target audience something the chew on.
>
> Well, v1 would certainly give some of the target audience something
> to try out.
>
> My assumption for v1 has been that it would be used especially by
> those web developers who want to try out all the latest web
> technologies, even if that would limit which browsers they could
> support (since it takes time before an API is implemented by all the
> UAs). Also, similarly some companies, like Google and Facebook tend
> to try new technologies rather soon, even sooner than the APIs hav
> stabilized.
>
> [Milan] It's fine to target the needs of web developers in v1.  But
> as your proposal is written it seems like commercial and academic are
> largely ignored.  These groups need access to large grammars and such
> before prototyping becomes meaningful.

commercial and academic entities could run their own engines and use the 
existing technologies to prototype such web apps. Maybe even write
a script library which gives similar to v1 API for remote engines.
Other web developers could just use the native browser implementation.

If some Microphone API was implemented in browsers and
there were public speech services, web developers could just use those
services for prototyping. We wouldn't need to design any API. People
would just read data from microphone and send it to the remote
speech services and do their best with handling that all.

...but, we do need default engines, IMHO, and we should have a good API.
There is also the case of consistency in user experience.
If the system supports some kind of speech input, which many devices
and OSes do support, web apps should be able to use that to keep the
speech recognition consistent with native apps.



Björn, Sathish, would be great to hear what you think of this all.


-Olli


>
> - Milan
>
>
>
> -Olli
>
>>
>>
>> - Milan
>>
>> -----Original Message----- From:
>> public-xg-htmlspeech-request@w3.org
>> [mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of Olli
>> Pettay Sent: Monday, March 14, 2011 12:36 PM To: Robert Brown Cc:
>> public-xg-htmlspeech@w3.org Subject: Re: An early draft of a
>> speech API
>>
>> On 03/14/2011 08:43 PM, Robert Brown wrote:
>>> (I want to separate this feedback from the API design feedback,
>>> since they feel like different topics).
>>>
>>> You already know I disagree with your "v2" notion.  I'll try to
>>> describe why I feel so strongly about this.  Here are the
>>> implications I see to a "v1" of everything you've marked as
>>> "only in v2":
>>>
>>> 1.	Each browser manufacturer would have its own proprietary
>>> interaction with particular speech engines.  I'm predicting this
>>> would mean: i) Chrome uses Google's cloud service; ii) IE uses
>>> Microsoft's on-device and cloud service; iii) Firefox and Opera
>>> both use. I don't know. whatever's already on the device?... a
>>> special licensing deal they cut with a speech vendor?
>>>
>>> 2.	Many speech innovators who have large and successful customer
>>> bases will be left out in the cold.  Nuance (for example) has a
>>> thriving business and great brand based on the fact that they
>>> have world-class technology that their customers buy because it
>>> suits their application needs better than any other vendor's.
>>> But in the scheme proposed here, Nuance is excluded from
>>> developing HTML apps, and so are their customers.  This damages a
>>> lot of users, not to mention excludes a lot of world class speech
>>> innovation from the web. How does Nuance get back into this game?
>>> Build their own browser? Sign deals with all the major browser
>>> manufacturers? Whatever the answer, it's not good.  Replace the
>>> word "Nuance" with any other speech vendor, some of whom are also
>>> participating in the XG, and it's the same story.  If that's not
>>> sad enough, imagine a research institution or startup.  What are
>>> they supposed to do?
>>
>> How is co-operating with a browser vendor a bad thing? IBM used to
>> provide speech engines for Opera.
>>
>>>
>>> 3.	Take a look at all the popular speech apps on smart phones
>>> these days.  None of these could be built.  For example: Google&
>>> Microsoft have search apps that deliver great results using huge
>>> proprietary SLMs, too big to download, and with too much secret
>>> sauce to want to make public.  For example: Siri's virtual
>>> assistant (now owned by Apple) is very cool, and is powered by
>>> Nuance SR using specifically modeled SLMs (which I'm pretty sure
>>> have enough IP that they don't want leaving their secure
>>> servers). There are plenty of other examples.  But the point is
>>> that none of today's popular mobile speech apps can be built with
>>> your "v1" feature set.  So what does that leave?
>>
>> This is the reason for v2. As I've said, we could develop v1 and
>> v2 simultaneously, but since v1 would be hopefully simpler, it
>> could be implemented first (somewhat similarly to XMLHttpRequest v1
>> and v2). In this case one reason for v1 API would be to get
>> feedback from web developers asap. Also, we need to be able support
>> non-network engines. Otherwise offline webapps couldn't use any
>> ASR/TTS.
>>
>>
>> I could think of speech vendors to implement also the browser side
>> of the API. That way even v1 could use their engines - maybe just
>> some basic version. And then with network speech engines (v2) they
>> could sell the services to web sites which want to use higher
>> quality engines. If several speech vendors want to provide default
>> engines, they could design some API which browsers could internally
>> use and then user or browser could pick up the engine which happens
>> to work best for them. Default speech engine selection would become
>> close to selecting default search engine.
>>
>> And still to clarify - I'm not against network speech engines. I
>> just want something to be implemented rather soon to get feedback
>> and experience with the API. And also to give time for the
>> possible protocol design (although it is possible that the protocol
>> will be just something on top of XHR or WebSockets).
>>
>>
>>
>>
>>>
>>> This just doesn't feel like an "open" standard to me.
>>
>>
>> I don't know what is not "open".
>>
>>
>> -Olli
>>
>>
>>
>>>
>>> -----Original Message----- From:
>>> public-xg-htmlspeech-request@w3.org
>>> [mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of Olli
>>> Pettay Sent: Monday, February 28, 2011 12:38 PM To:
>>> public-xg-htmlspeech@w3.org Subject: An early draft of a speech
>>> API
>>>
>>> Hi all,
>>>
>>> here is what I had in mind for speech API. The text misses still
>>> lots of definitions, but I hope it is still somewhat clear how
>>> it should work. (Getting Firefox 4 done has taken most of my
>>> time.)
>>>
>>> The main difference to the Google's API is that this isn't based
>>> on elements, but requests objects.
>>>
>>> For TTS we could probably use something close to what Björn just
>>> proposed.
>>>
>>>
>>>
>>> -Olli
>>>
>>
>>
>>
>
>
>

Received on Wednesday, 16 March 2011 21:06:00 UTC