RE: Requirement for UA / SS protocol from Young, Milan on 2010-11-23 (public-xg-htmlspeech@w3.org from November 2010)

From: Young, Milan <Milan.Young@nuance.com>
Date: Tue, 23 Nov 2010 12:14:56 -0800
To: "Robert Brown" <Robert.Brown@microsoft.com>, <Olli@pettay.fi>
Cc: "Bjorn Bringert" <bringert@google.com>, <public-xg-htmlspeech@w3.org>
Message-ID: <1AA381D92997964F898DF2A3AA4FF9AD096DC785@SUN-EXCH01.nuance.com>
"Real world" as defined by apps which are currently in production.
Lab-ware doesn't count.

I'd support that definition.



-----Original Message-----
From: Robert Brown [mailto:Robert.Brown@microsoft.com] 
Sent: Tuesday, November 23, 2010 12:11 PM
To: Olli@pettay.fi; Young, Milan
Cc: Bjorn Bringert; public-xg-htmlspeech@w3.org
Subject: RE: Requirement for UA / SS protocol

Maybe we're arguing which priority-1 feature is more important.  If we
agree they're both pri-1 then I'll be quiet. :-)

We should define enough to enable real-world apps.  Fair?

-----Original Message-----
From: Olli Pettay [mailto:Olli.Pettay@helsinki.fi] 
Sent: Tuesday, November 23, 2010 11:29 AM
To: Young, Milan
Cc: Robert Brown; Bjorn Bringert; public-xg-htmlspeech@w3.org
Subject: Re: Requirement for UA / SS protocol

On 11/23/2010 08:55 PM, Young, Milan wrote:
> I agree with Robert that remote speech engines are the "real world"
> target.   Because of this, the web API can ultimately only be as
strong
> as the protocol behind it.
I don't disagree with this.

> I don't see how we could prioritize one over the other.
The API is the thing web developers will use. And we need feedback from
the developers, IMO. So getting something to their hands soon would be
good. Making reasonable simple v1 API might be enough to get feedback.
And if the API is easily extensible, v2 with support for web app
specified speech resources could come pretty soon after v1.
API-vise the situation might be similar to XMLHttpRequest or Web
Notifications, where v1 and v2 are specified in the same time, although
it is expected that v1 will be recommendation before v2.

I'm not saying we shouldn't define the protocol. We should, if we just
have time. But I still think the v1 API has higher priority than the
protocol.


-Olli


>
> I'd also like to respond to Bjorn's point about re-recognition being 
> difficult in the protocol due to resource management.  Keep in mind 
> that we've already agreed to some notion of a session which could span

> multiple dialog interactions.  Because these sessions would presumably

> have a finite lifespan, the server could use the end of session marker

> to trigger cleanup.  The marker could be either an explicit end event 
> or implicit timeout.  Probably a lot like cookies.
>
> Thanks
>
>
> -----Original Message-----
> From: Robert Brown [mailto:Robert.Brown@microsoft.com]
> Sent: Tuesday, November 23, 2010 10:06 AM
> To: Olli Pettay
> Cc: Bjorn Bringert; Young, Milan; public-xg-htmlspeech@w3.org
> Subject: RE: Requirement for UA / SS protocol
>
> Default speech engines really don't work well at all, except for very 
> generic things.  Sure, simple command and control may work, assuming 
> all the commands an app chooses to use happen to be in the lexicons of

> all default recognizers, and the letter-to-sound rules for 
> out-of-lexicon commands are consistent across all default recognizers.

> And anything requiring a large grammar or SLM, domain-specific 
> vocabulary, etc, can't be satisfied by a default service.  These 
> aren't corner cases: search, dictating topical messages, grammars that

> contain confidential data or are too big to transfer on-demand to 
> whatever recognizer the UA happens to be using.
>
> -----Original Message-----
> From: Olli Pettay [mailto:olli.pettay@gmail.com]
> Sent: Tuesday, November 23, 2010 9:54 AM
> To: Robert Brown
> Cc: Bjorn Bringert; Young, Milan; public-xg-htmlspeech@w3.org
> Subject: Re: Requirement for UA / SS protocol
>
> On 11/23/2010 07:36 PM, Robert Brown wrote:
>> I wouldn't say the API is most important.  There's enough prior work 
>> out there that it should be pretty easy to design a good one.
> There is indeed some prior work, but still there isn't any good API. 
> It shouldn't very difficult to define one, but for some reason no one 
> has done it, at least not in public.
>
>> What's most important is that a speech app works independently of the

>> vendor's UA.This implies a standard protocol that at least supports
>   >  mainstream scenarios.
>
> Why is that the most important? Or I can see why that is important, 
> but does it really imply standard protocol? If the default speech 
> engines work well enough, the most common use cases could be handled 
> without app-specified speech services.
>
>
> -Olli
>
>
>>
>> -----Original Message----- From: Olli Pettay 
>> [mailto:Olli.Pettay@helsinki.fi] Sent: Tuesday, November 23, 2010
>> 9:22 AM To: Bjorn Bringert Cc: Young, Milan; Robert Brown; 
>> public-xg-htmlspeech@w3.org Subject: Re: Requirement for UA / SS 
>> protocol
>>
>> On 11/23/2010 07:02 PM, Bjorn Bringert wrote:
>>> About priorities, I think that the web app API is definitely most 
>>> important, since it is needed for both default and app-specified 
>>> speech services, and because it's the one that is most clearly part 
>>> of the HTML Speech XG charter. Typically, W3C does not specify 
>>> client-server protocols at all. Adding a protocol for network speech

>>> services is a stretch of our charter, and the only chance to 
>>> complete it in our timeline would be to keep it super simple.
>>
>> I agree with this. The API is the most important, but it must the 
>> extensible enough, so that support for app-specified speech services 
>> can be added easily once the protocol has been defined.
>>
>> And this is an XG. Whatever followup WG there will be, that could 
>> define, perhaps with IETF, the protocol. The WG will anyway drive the

>> API proposal to Rec (or reject the proposal). Of course XG could 
>> define the protocol too, if there just is enough time for that.
>>
>> -Olli
>>
>>
>>
>>>
>>> /Bjorn
>>>
>>> On Tue, Nov 23, 2010 at 4:46 PM, Young,
>>> Milan<Milan.Young@nuance.com>    wrote:
>>>> I think I was missing something :).  Although chunking would encode

>>>> the client events just fine, the server would probably have to 
>>>> delay it's response until the client has completed.
>>>>
>>>>
>>>>
>>>> -----Original Message----- From: Young, Milan Sent: Tuesday, 
>>>> November 23, 2010 8:32 AM To: 'Bjorn Bringert'; Robert Brown Cc:
>>>> public-xg-htmlspeech@w3.org Subject: RE: Requirement for UA / SS 
>>>> protocol
>>>>
>>>> Hello Bjorn,
>>>>
>>>> Perhaps I'm missing something, but sending events from the client 
>>>> to server isn't any more difficult than sending events from server 
>>>> to client.  I suspect both could be symmetrically implemented in 
>>>> HTTP 1.1 through chunking.
>>>>
>>>> Regarding priorities, I believe the protocol is just as important 
>>>> as the web API.  As long as we have the concrete protocol proposal 
>>>> in hand along a similar timeline as the concrete web API proposals,

>>>> we should be safe.
>>>>
>>>> Thanks
>>>>
>>>>
>>>> -----Original Message----- From: Bjorn Bringert 
>>>> [mailto:bringert@google.com] Sent: Tuesday, November 23, 2010
>>>> 4:23 AM To: Robert Brown Cc: Young, Milan; 
>>>> public-xg-htmlspeech@w3.org Subject: Re: Requirement for UA / SS 
>>>> protocol
>>>>
>>>> While I agree that those two requirements have use cases, I think 
>>>> that they are not important enough to warrant to increased 
>>>> complexity that they would add to any user agent - speech 
>>>> recognizer protocol. I can't see any simple way to implement them 
>>>> e.g. using HTTP 1.1.
>>>>
>>>> Since the XG has a very aggressive timeline, I think that the first

>>>> priority should be to come up with a well-specified web app
>>>> - user agent API. The next priority would be to add a basic user 
>>>> agent - speech service protocol that can handle everything required

>>>> by the web app API. I'm quite sure that we will not have time to 
>>>> work on any advanced speech service API.
>>>>
>>>> /Bjorn
>>>>
>>>> On Mon, Nov 22, 2010 at 11:59 PM, Robert Brown
>>>> <Robert.Brown@microsoft.com>    wrote:
>>>>> Hi Milan,
>>>>>
>>>>> * App-to-server events:
>>>>>
>>>>> Okay, consider my arm twisted :-)  Nice example.
>>>>>
>>>>> * Re-recognition:
>>>>>
>>>>> If we make this a requirement of the protocol, we're going to have

>>>>> to be careful about the design.  If the UA doesn't buffer the 
>>>>> audio for re-recognition, this means a server somewhere is holding

>>>>> state, which may create an opportunity for evil apps/UAs to do bad

>>>>> things to services.  I guess this supports your point about 
>>>>> increased effort speccing the protocol.
>>>>>
>>>>> -----Original Message----- From: Young, Milan 
>>>>> [mailto:Milan.Young@nuance.com] Sent: Friday, November 19, 2010
>>>>> 8:00 PM To: Robert Brown; Bjorn Bringert Cc:
>>>>> public-xg-htmlspeech@w3.org Subject: RE: Requirement for UA / SS 
>>>>> protocol
>>>>>
>>>>> Hello Robert,
>>>>>
>>>>>
>>>>> Let's say the user is booking a flight.  They are presented with a

>>>>> multi-modal UI in which they can type or speak any of the 
>>>>> component facts (date, city, etc).
>>>>>
>>>>> If the implementation had to start and stop the recognition every 
>>>>> time a new piece of data was filled in on the visual side, you'd 
>>>>> run the risk of clipping.  It would be better if the visual could 
>>>>> feed into the recognition stream with events (e.g. City=Seattle) 
>>>>> in just the same way the recognition engine could feed events back

>>>>> to the visual.
>>>>>
>>>>> Keep in mind that although we haven't seen this class of advanced 
>>>>> visual/speech integrations outside the lab, they are most 
>>>>> certainly coming.  I believe this Speech HTML is going to unleash 
>>>>> a flurry of innovation.
>>>>>
>>>>>
>>>>>
>>>>> On the re-recognition front, I do not support the UI buffering and

>>>>> resending the data.  This is a needless use of UA memory, 
>>>>> bandwidth, and entails inherent latencies.  The only significant 
>>>>> downside I can think of is increased effort specing the protocol.

>>>>> But we have to do something similar for text interpretation 
>>>>> anyway, so I don't think it's such a big deal.
>>>>>
>>>>>
>>>>>
>>>>> Thank you
>>>>>
>>>>>
>>>>> -----Original Message----- From: Robert Brown 
>>>>> [mailto:Robert.Brown@microsoft.com] Sent: Friday, November 19,
>>>>> 2010 4:25 PM To: Young, Milan; Bjorn Bringert Cc:
>>>>> public-xg-htmlspeech@w3.org Subject: RE: Requirement for UA / SS 
>>>>> protocol
>>>>>
>>>>> * App-to-server events:
>>>>>
>>>>> I think this example illustrates the concept, but I'm skeptical 
>>>>> that that particular application would ever be built.  If I can 
>>>>> scroll through the list, it's probably a short list, so I doubt 
>>>>> the recognizer will need any hints.  And I may as well just touch 
>>>>> the person's name, since I can see it right there on the screen.  
>>>>> It feels academic to me.
>>>>>
>>>>> * Re-recognition:
>>>>>
>>>>> Another re-recognition example is the user saying "Starbucks in 
>>>>> Redmond Washington".  The initial grammar doesn't have all the 
>>>>> business names for all the districts in the entire continent, but 
>>>>> it is good at detecting which district the user is talking about.

>>>>> The re-recognition is then done using a business listing grammar 
>>>>> that's specific to that district.
>>>>>
>>>>> But I think there may be an argument that re-recognition is not a 
>>>>> hard requirement of the protocol.  The UA could just as easily 
>>>>> buffer the audio, and re-send it for a second recognition if 
>>>>> required by the app.
>>>>>
>>>>> Maybe there should just be a requirement: "The Web Application 
>>>>> should be able to re-use the audio recording from one recognition 
>>>>> to perform another recognition."
>>>>>
>>>>> -----Original Message----- From: Young, Milan 
>>>>> [mailto:Milan.Young@nuance.com] Sent: Friday, November 19, 2010
>>>>> 8:20 AM To: Bjorn Bringert; Robert Brown Cc:
>>>>> public-xg-htmlspeech@w3.org Subject: RE: Requirement for UA / SS 
>>>>> protocol
>>>>>
>>>>> First the use cases:
>>>>>
>>>>> Web-app to SS events - The user wants to place a call, but can't 
>>>>> quite remember the name of the contact.  The visual UI scrolls 
>>>>> through the list of contacts and sends an event each time a new 
>>>>> contact is displayed.  The recognizer uses this information to 
>>>>> weight the recognition result because it's likely the user will 
>>>>> speak a name it has just seen.
>>>>>
>>>>> Re-recognition using previous audio - The user has requested 
>>>>> directory assistance to find a residential phone number.  They 
>>>>> thought the contact lived in City-A, but no relevant results 
>>>>> found.  They want to try again in City-B, and shouldn't have to be

>>>>> asked again for the utterance.
>>>>>
>>>>> Interpretation over text - A common approach in speech processing 
>>>>> is to use a large statistical or speaker-dependent model to 
>>>>> identify the lexical nature of the tokens.  A second pass over the

>>>>> data extracts meaning from the tokens.  At present, this second 
>>>>> pass seems to be a more difficult task, and sometimes several 
>>>>> attempts need to be made each with a different base context.  For 
>>>>> example the word "bill" might refer to a financial transaction, a 
>>>>> duck, a hat, or a person.
>>>>>
>>>>>
>>>>>
>>>>> Other responses:
>>>>>
>>>>> FPR11 - I'd like to add a second sentence to this statement 
>>>>> (either in summary or text).  "This includes both TBD standard and

>>>>> extension parameters."
>>>>>
>>>>> FPR28 and 29 - Perhaps these statements could be slightly adjusted

>>>>> with "... fire implementation-specific events to the web app".
>>>>>
>>>>> FPR4 - EMMA seems like a sufficiently flexible standard to handle 
>>>>> all foreseeable needs.  Perhaps now is not the time, but I'd like 
>>>>> to see if we can agree to make this a required part of the 
>>>>> protocol.
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message----- From: Bjorn Bringert 
>>>>> [mailto:bringert@google.com] Sent: Friday, November 19, 2010
>>>>> 3:17 AM To: Robert Brown Cc: Young, Milan; 
>>>>> public-xg-htmlspeech@w3.org Subject: Re: Requirement for UA / SS 
>>>>> protocol
>>>>>
>>>>> We already had a requirement that there must be a standard 
>>>>> protocol. If I understand this list correctly, it adds a number of

>>>>> requirements on what features this standard protocol must support.

>>>>> I propose that we consider each of the bullet points a separate 
>>>>> requirement, so that they can be discussed independently.
>>>>>
>>>>> I think that most of them look fine. The only two that I'm not 
>>>>> sure about are:
>>>>>
>>>>> - web-app ->    speech service events, with the same objection
>>>>> that Robert raised.
>>>>>
>>>>> - Re-recognition using previous audio streams. What's the use case

>>>>> for this?
>>>>>
>>>>>
>>>>> Also, I think that the following are already covered by existing 
>>>>> requirements:
>>>>>
>>>>> - "Both standard and extension parameters passed from the web app 
>>>>> to the speech service at the start of the interaction.
>>>>> List of standard parameters TBD." Covered by "FPR11. If the web 
>>>>> apps specify speech services, it should be possible to specify 
>>>>> parameters."
>>>>>
>>>>> - The speech service ->    web app part of the birirectional
>>>>> events requirement is covered by:
>>>>>
>>>>> FPR21. The web app should be notified that capture starts.
>>>>> FPR22. The web app should be notified that speech is considered to

>>>>> have started for the purposes of recognition. FPR23. The web app 
>>>>> should be notified that speech is considered to have ended for the

>>>>> purposes of recognition. FPR24. The web app should be notified 
>>>>> when recognition results are available. FPR28. Speech recognition 
>>>>> implementations should be allowed to fire implementation specific 
>>>>> events. FPR29. Speech synthesis implementations should be allowed 
>>>>> to fire implementation specific events.
>>>>>
>>>>> - "EMMA results passed from the SS to the web app.  The syntax of 
>>>>> this result is TBD (e.g. XML and/or JSON)." Covered by:
>>>>> FPR4. It should be possible for the web application to get the 
>>>>> recognition results in a standard format such as EMMA.
>>>>>
>>>>> - "Interpretation over text." Covered by (if I understand it
>>>>> correctly): FPR2. Implementations must support the XML format of 
>>>>> SRGS and must support SISR.
>>>>>
>>>>>
>>>>> So, the remaining requirements from Milan's list that I support 
>>>>> adding are:
>>>>>
>>>>> * At least one standard audio codec.  UAs are permitted to 
>>>>> advertise alternate codecs at the start of the interaction and SSs

>>>>> are allowed to select any such alternate (e.g. HTTP Accept).
>>>>>
>>>>> * Transport layer security (e.g. HTTPS) if requested by the web 
>>>>> app.
>>>>>
>>>>> * Session identifier that could be used to form continuity across 
>>>>> multiple interactions (e.g. HTTP cookies).
>>>>>
>>>>> /Bjorn
>>>>>
>>>>> On Fri, Nov 19, 2010 at 1:49 AM, Robert
>>>>> Brown<Robert.Brown@microsoft.com>    wrote:
>>>>>> I mostly agree.  But do we need bidirectional events?  I suspect 
>>>>>> all the interesting ones originate at the server:
>>>>>> start-of-speech; hypothesis; partial result; warnings of noise, 
>>>>>> crosstalk, etc.  I'm trying to think why the server would care 
>>>>>> about events from the client, other than when the client is done 
>>>>>> sending audio (which it could do in response to a click or 
>>>>>> end-point detection).
>>>>>>
>>>>>>
>>>>>>
>>>>>> From: public-xg-htmlspeech-request@w3.org
>>>>>> [mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of Young, 
>>>>>> Milan Sent: Thursday, November 18, 2010 5:34 PM To:
>>>>>> public-xg-htmlspeech@w3.org Subject: Requirement for UA / SS 
>>>>>> protocol
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>>
>>>>>>
>>>>>> On the Nov 18th conference, I volunteer to send out proposed 
>>>>>> wording for a new requirement:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Summary - User agents and speech services are required to support

>>>>>> at least one common protocol.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Description - A common protocol will be defined as part of the 
>>>>>> final recommendation.  It will be built upon some TBD existing 
>>>>>> application layer protocol and include support for the following:
>>>>>>
>>>>>>
>>>>>>
>>>>>> * Streaming audio data (e.g. HTTP 1.1 chunking).  This include 
>>>>>> both audio streamed from UA to SS during recognition and audio 
>>>>>> streamed from SS to UA during synthesis.
>>>>>>
>>>>>>
>>>>>>
>>>>>> * Bidirectional events which can occur anytime during the 
>>>>>> interaction. These events could originate either within the web 
>>>>>> app (e.g. click) or the SS (e.g. start-of-speech or mark) and 
>>>>>> must be transmitted through the UA in a timely fashion.
>>>>>> The set of events include both standard events defined by the 
>>>>>> final recommendation and extension events.
>>>>>>
>>>>>>
>>>>>>
>>>>>> * Both standard and extension parameters passed from the web app 
>>>>>> to the speech service at the start of the interaction.
>>>>>> List of standard parameters TBD.
>>>>>>
>>>>>>
>>>>>>
>>>>>> * EMMA results passed from the SS to the web app.  The syntax of 
>>>>>> this result is TBD (e.g. XML and/or JSON).
>>>>>>
>>>>>>
>>>>>>
>>>>>> * At least one standard audio codec.  UAs are permitted to 
>>>>>> advertise alternate codecs at the start of the interaction and 
>>>>>> SSs are allowed to select any such alternate (e.g. HTTP Accept).
>>>>>>
>>>>>>
>>>>>>
>>>>>> * Transport layer security (e.g. HTTPS) if requested by the web 
>>>>>> app.
>>>>>>
>>>>>>
>>>>>>
>>>>>> * Session identifier that could be used to form continuity across

>>>>>> multiple interactions (e.g. HTTP cookies).
>>>>>>
>>>>>>
>>>>>>
>>>>>> * Interpretation over text.
>>>>>>
>>>>>>
>>>>>>
>>>>>> * Re-recognition using previous audio streams.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thank you
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -- Bjorn Bringert Google UK Limited, Registered Office:
>>>>> Belgrave House, 76 Buckingham Palace Road, London, SW1W 9TQ 
>>>>> Registered in England Number: 3977902
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> -- Bjorn Bringert Google UK Limited, Registered Office: Belgrave 
>>>> House, 76 Buckingham Palace Road, London, SW1W 9TQ Registered in 
>>>> England Number: 3977902
>>>>
>>>
>>>
>>>
>>
>>
>
>
>
>
Received on Tuesday, 23 November 2010 20:15:34 UTC