W3C home > Mailing lists > Public > public-xg-htmlspeech@w3.org > March 2011

RE: An early draft of a speech API

From: Young, Milan <Milan.Young@nuance.com>
Date: Wed, 16 Mar 2011 12:58:43 -0700
Message-ID: <1AA381D92997964F898DF2A3AA4FF9AD0A8636B6@SUN-EXCH01.nuance.com>
To: <Olli@pettay.fi>
Cc: "Robert Brown" <Robert.Brown@microsoft.com>, <public-xg-htmlspeech@w3.org>
Inline...

-----Original Message-----
From: Olli Pettay [mailto:Olli.Pettay@helsinki.fi] 
Sent: Tuesday, March 15, 2011 2:11 PM
To: Young, Milan
Cc: Robert Brown; public-xg-htmlspeech@w3.org
Subject: Re: An early draft of a speech API

On 03/15/2011 09:57 PM, Young, Milan wrote:
> I agree with Robert that the Mozilla proposal doesn't feel very
> "open".  I'd further suggest that the Google speech proposal has
> similar properties.
>
> In both cases, there is a tight coupling between the browser and
> speech service that is outside of W3C and IETF turf.  This closed
> model has all of the usual implications such as:
> * A cross-product of
> integrations across UA and SS
If Nuance has a public web based speech service and it exposes
the API for it, browsers
could use it as a default speech engine when the device is online.
Or browsers could use some other engine.

[Milan] If all speech vendors could agree on the same web services API, then I agree that this would eliminate my concerns.  But if there were variants, you'd still have the same fundamental problem.

The Microsoft approach seems to eliminate most of these issues by having the application layer communicate directly with the speech service.  The browser is effectively only responsible for setting up the pipe and marshaling data.  The downside of that approach is that the application layer needs to do potentially custom integrations with each speech service, but that's pretty much a given anyway.

Perhaps you could let me know what you don't like about that aspect of the Microsoft methodology.  Perhaps something to do with the API exposed to the application layer?



>* Future spec enhancements (such as v2)
> occurring in lock step with these custom integrations.
I'm not sure I understand this.

[Milan] Observing that the more custom integrations you have (browser and speech cross product), the harder it becomes to revise a standard.


>
>
> Regarding the v1/v2 debate I agree with Olli that it's best to issue
> a limping prototype to gain feedback rather than internally
> iterations on a "perfect" solution that nobody wants.  But my
> assumption from the start of this WG, is that we are designing a
> framework for commercial and academic applications.
My assumption has been that we're designing a framework or API
for the Web. And "the Web" includes commercial and academic entities,
but also individual web developers and users.

[Milan] I didn't phrase that very well.  My point was that the needs of commercial and academic are largely a superset of individual web developers.


>  Thus it would be
> inappropriate to issue even a v1 that does not give our target
> audience something the chew on.

Well, v1 would certainly give some of the target audience something
to try out.

My assumption for v1 has been that it would be used especially by
those web developers who want to try out all the latest
web technologies, even if that would limit which
browsers they could support (since it takes time before an
API is implemented by all the UAs).
Also, similarly some companies, like Google and Facebook
tend to try new technologies rather soon, even sooner than the APIs
have stabilized.

[Milan] It's fine to target the needs of web developers in v1.  But as your proposal is written it seems like commercial and academic are largely ignored.  These groups need access to large grammars and such before prototyping becomes meaningful.

- Milan



-Olli

>
>
> - Milan
>
> -----Original Message----- From: public-xg-htmlspeech-request@w3.org
> [mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of Olli
> Pettay Sent: Monday, March 14, 2011 12:36 PM To: Robert Brown Cc:
> public-xg-htmlspeech@w3.org Subject: Re: An early draft of a speech
> API
>
> On 03/14/2011 08:43 PM, Robert Brown wrote:
>> (I want to separate this feedback from the API design feedback,
>> since they feel like different topics).
>>
>> You already know I disagree with your "v2" notion.  I'll try to
>> describe why I feel so strongly about this.  Here are the
>> implications I see to a "v1" of everything you've marked as "only
>> in v2":
>>
>> 1.	Each browser manufacturer would have its own proprietary
>> interaction with particular speech engines.  I'm predicting this
>> would mean: i) Chrome uses Google's cloud service; ii) IE uses
>> Microsoft's on-device and cloud service; iii) Firefox and Opera
>> both use. I don't know. whatever's already on the device?... a
>> special licensing deal they cut with a speech vendor?
>>
>> 2.	Many speech innovators who have large and successful customer
>> bases will be left out in the cold.  Nuance (for example) has a
>> thriving business and great brand based on the fact that they have
>> world-class technology that their customers buy because it suits
>> their application needs better than any other vendor's.  But in
>> the scheme proposed here, Nuance is excluded from developing HTML
>> apps, and so are their customers.  This damages a lot of users, not
>> to mention excludes a lot of world class speech innovation from the
>> web. How does Nuance get back into this game?  Build their own
>> browser? Sign deals with all the major browser manufacturers?
>> Whatever the answer, it's not good.  Replace the word "Nuance" with
>> any other speech vendor, some of whom are also participating in the
>> XG, and it's the same story.  If that's not sad enough, imagine a
>> research institution or startup.  What are they supposed to do?
>
> How is co-operating with a browser vendor a bad thing? IBM used to
> provide speech engines for Opera.
>
>>
>> 3.	Take a look at all the popular speech apps on smart phones
>> these days.  None of these could be built.  For example: Google&
>> Microsoft have search apps that deliver great results using huge
>> proprietary SLMs, too big to download, and with too much secret
>> sauce to want to make public.  For example: Siri's virtual
>> assistant (now owned by Apple) is very cool, and is powered by
>> Nuance SR using specifically modeled SLMs (which I'm pretty sure
>> have enough IP that they don't want leaving their secure servers).
>> There are plenty of other examples.  But the point is that none of
>> today's popular mobile speech apps can be built with your "v1"
>> feature set.  So what does that leave?
>
> This is the reason for v2. As I've said, we could develop v1 and v2
> simultaneously, but since v1 would be hopefully simpler, it could be
> implemented first (somewhat similarly to XMLHttpRequest v1 and v2).
> In this case one reason for v1 API would be to get feedback from web
> developers asap. Also, we need to be able support non-network
> engines. Otherwise offline webapps couldn't use any ASR/TTS.
>
>
> I could think of speech vendors to implement also the browser side
> of the API. That way even v1 could use their engines - maybe just
> some basic version. And then with network speech engines (v2) they
> could sell the services to web sites which want to use higher quality
> engines. If several speech vendors want to provide default engines,
> they could design some API which browsers could internally use and
> then user or browser could pick up the engine which happens to work
> best for them. Default speech engine selection would become close to
> selecting default search engine.
>
> And still to clarify - I'm not against network speech engines. I just
> want something to be implemented rather soon to get feedback and
> experience with the API. And also to give time for the possible
> protocol design (although it is possible that the protocol will be
> just something on top of XHR or WebSockets).
>
>
>
>
>>
>> This just doesn't feel like an "open" standard to me.
>
>
> I don't know what is not "open".
>
>
> -Olli
>
>
>
>>
>> -----Original Message----- From:
>> public-xg-htmlspeech-request@w3.org
>> [mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of Olli
>> Pettay Sent: Monday, February 28, 2011 12:38 PM To:
>> public-xg-htmlspeech@w3.org Subject: An early draft of a speech
>> API
>>
>> Hi all,
>>
>> here is what I had in mind for speech API. The text misses still
>> lots of definitions, but I hope it is still somewhat clear how it
>> should work. (Getting Firefox 4 done has taken most of my time.)
>>
>> The main difference to the Google's API is that this isn't based
>> on elements, but requests objects.
>>
>> For TTS we could probably use something close to what Björn just
>> proposed.
>>
>>
>>
>> -Olli
>>
>
>
>
Received on Wednesday, 16 March 2011 19:59:19 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 16 March 2011 19:59:20 GMT