RE: R18. User perceived latency of synthesis must be minimized from Young, Milan on 2010-11-10 (public-xg-htmlspeech@w3.org from November 2010)

From: Young, Milan <Milan.Young@nuance.com>
Date: Tue, 9 Nov 2010 18:46:38 -0800
To: "Bjorn Bringert" <bringert@google.com>
Cc: "Satish Sampath" <satish@google.com>, <Olli@pettay.fi>, "Marc Schroeder" <marc.schroeder@dfki.de>, "Robert Brown" <Robert.Brown@microsoft.com>, "Dan Burnett" <dburnett@voxeo.com>, <public-xg-htmlspeech@w3.org>
Message-ID: <1AA381D92997964F898DF2A3AA4FF9AD094A55D4@SUN-EXCH01.nuance.com>
Hello Bjorn,

The first two points sound good.  I'd like to make the last point more concrete in three ways:

  * Define the type of information that will flow through the UA <-> speech service channel.  E.G. audio, events, and parameters.

  * Require support for communication with remote speech services based on HTTP 1.1.  Additional protocols supported through negotiation.

  * Require UAs also expose an API for local speech services.  The details of the protocol will not be specified, only the data flow and timing of events.



Thoughts?


-----Original Message-----
From: Bjorn Bringert [mailto:bringert@google.com] 
Sent: Tuesday, November 09, 2010 3:09 AM
To: Young, Milan
Cc: Satish Sampath; Olli@pettay.fi; Marc Schroeder; Robert Brown; Dan Burnett; public-xg-htmlspeech@w3.org
Subject: Re: R18. User perceived latency of synthesis must be minimized

I agree with providing:

- A standard interface for passing optional implementation-specific
parameters from web apps to speech service implementations (through
the user agent). A simple list of key-value pairs is probably enough.
This is in addition to standard parameters such as language, grammar,
timeouts etc.

- A standard interface for recognizers to return extra information to
web apps in the recognition results (through the user agent). Allowing
EMMA-based results should be enough for this.

- Standard protocols for user agents to communicate with speech
recognizers and synthesizers. If we are to be able to do this within
the scope of the Speech XG, these must be very simple protocols (e.g.
HTTP POST).

Is there anything else that you think is necessary?

/Bjorn

On Mon, Nov 8, 2010 at 11:58 PM, Young, Milan <Milan.Young@nuance.com> wrote:
> I absolutely agree that the UA to application API should be as consistent as possible.  This includes events (and their sequence) and standard interchange formats like SSML, EMMA, and SISR.
>
> But speech providers must be given flexibility in how the low-level requests are serviced if we are to foster innovation.  We've all talked about chunking, but what about feature extraction, endpointing, codecs, and the ideas we are yet to conceive?
>
> I agree with Satish that we should not be enumerating these options in the standard.  But at the same time we need to be realistic that protocol negation is not a realistic solution for such new technology.
>
> It is for this reason that I'd like to add requirements ensuring that speech providers be given access to the raw data where it makes sense.  I suspect this would amount to a plugin model, but am open to possibilities.
>
>
>
> Bjorn, VoiceXML 2.x was a huge step in the right direction compared to the proprietary IVR languages it replaced.  But it still has portability holes in the areas of parameter definitions, parameter defaults, and interchange formats for information which lacks a standard (like SLMs).
>
> There isn't much we can realistically achieve concerning the lack of speech standards on our timeline.  But there is low-hanging fruit in the way of standard parameter names and default values.  Do you think that should be part of the requirement or proposal phase?
>
>
> Thanks
>
>
>
> -----Original Message-----
> From: Bjorn Bringert [mailto:bringert@google.com]
> Sent: Monday, November 08, 2010 2:17 PM
> To: Satish Sampath
> Cc: Young, Milan; Olli@pettay.fi; Marc Schroeder; Robert Brown; Dan Burnett; public-xg-htmlspeech@w3.org
> Subject: Re: R18. User perceived latency of synthesis must be minimized
>
> It's probably hard to to specify exactly what UAs must do with respect
> to recording and playback timing etc. At the requirements stage, I
> think it's most important to specify what any specs must allow. This
> for example includes that any proposals must be streaming-friendly.
>
> As for the fidelity in passing parameters to speech services and
> returning results, the first priority for the UA should be to ensure
> that the UA <-> web app communication follows the HTML Speech API,
> e.g. with respect to the ordering of events, whatever the spec for
> that ends up being. Then there should be a standard API for
> extensions, including for example extra parameters to pass to speech
> services and extra events and metadata to get back. That is, there
> should be a common interface that web apps get, regardless of the
> speech service implementation used,  and there may also be
> implementation-specific extensions.
>
> Without this, we are just adding a very thin layer on top of a
> specific speech service implementation, which results in vendor
> lock-in. If a very direct interface to a specific speech service
> implementation is required, it would probably be better to use a
> generic audio capture / playback API instead of an HTML Speech API.
>
> Out of interest, at what level is this sort of thing specified in VoiceXML?
>
> /Bjorn
>
> On Mon, Nov 8, 2010 at 10:00 PM, Satish Sampath <satish@google.com> wrote:
>> Aren't servers already capable of enforcing such requirements? For
>> e.g. the server could choose to only honour a chunked transfer request
>> instead of a bulk http post, thereby only allowing UAs that support
>> streaming to work with it. Realistically though the servers would want
>> to service as many users as they can in a wide variety of network
>> conditions.. for e.g. many users on mobile networks may not be able to
>> stream out audio due to proxies. I think the spec should be agnostic
>> to the network conditions and not enforce any such requirements.
>>
>> Cheers
>> Satish
>>
>>
>>
>> On Mon, Nov 8, 2010 at 9:11 PM, Young, Milan <Milan.Young@nuance.com> wrote:
>>> Hello Olli,
>>>
>>> The current set of requirements certainly do make provision for techniques such as streaming.  But there is nothing, for example, that *requires* a UA to use streaming if that is what is desired by the speech server.
>>>
>>> The thrust of my new requirements is to afford the speech server authority over the low-level details of the speech dialog.  This includes everything from chunking, endpointing, codecs, parameter/result passing, event timings/mappings, etc.
>>>
>>> If we can figure out how to roll this concept into the existing requirements, that's fine with me.
>>>
>>> Thank you
>>>
>>>
>>> -----Original Message-----
>>> From: Olli Pettay [mailto:Olli.Pettay@helsinki.fi]
>>> Sent: Monday, November 08, 2010 12:07 PM
>>> To: Young, Milan
>>> Cc: Bjorn Bringert; Marc Schroeder; Olli@pettay.fi; Robert Brown; Satish Sampath; Dan Burnett; public-xg-htmlspeech@w3.org
>>> Subject: Re: R18. User perceived latency of synthesis must be minimized
>>>
>>> On 11/08/2010 07:07 PM, Young, Milan wrote:
>>>> I'm glad to see consensus on R18, but I'm suspect that we have not
>>>> fundamentally addressed Marc's concern regarding communication with
>>>> speech servers.
>>>>
>>>> Up to this point, most requirements have been focused on the application
>>>> author's interface with the UA.  Any requirements around the UA
>>>> integration with the speech server seem inferred at best.  While the
>>>> details of such protocol integrations are probably outside the domain of
>>>> the W3C, I believe it is still incumbent within our recommendation that
>>>> the general framework be addressed.
>>>>
>>>> I'd like to suggest a new set of requirements:
>>>>
>>>> ·Speech resources must be allowed to write to the audio buffer for the
>>>> complete duration of the real-time audio rendering.
>>>>
>>>> ·Speech resources must be allowed to read from the audio buffer for the
>>>> complete duration of the real-time audio capture.
>>> I don't quite understand these two.
>>> The first one, I think, talks about the "TTS" buffer and the latter
>>> one about "ASR" buffer, right?
>>> If so the wording could be clearer, and I'm not sure what benefit we
>>> get from such requirements comparing to R18 and R17.
>>>
>>>
>>>>
>>>> ·User agents must allow passing parameters (structured or otherwise)
>>>> from the application to the speech resource.
>>>
>>> R22 takes care of this, though I'm not now sure whether that got merged
>>> already with some other requirement.
>>> And R22 is not quite clear nor right. By default we should assume that
>>> local  speech engines work in a compatible way and no recognizer
>>> specific parameters are needed (we don't want browser specific
>>> webapps). Network services are quite different thing, there recognizer
>>> specific parameters make more sense.
>>>
>>>>
>>>> ·User agents must not filter results from the speech resource to the
>>>> application.  (Perhaps this is already handled by R6?)
>>> I'd leave this to UA. Right now I can't see any reason why UA would
>>> filter anything, but I also don't understand why that should be
>>> prohibited.
>>> Actually, some UA for kids for example might want to filter out
>>> some results.
>>> But anyway, I'd leave the decision of filtering to the UA and
>>> wouldn't add any requirement for it.
>>>
>>>
>>>>
>>>> ·User agents must not interfere with the timing of speech events from
>>>> the speech resource to the application.
>>> Why would we need this requirement? R17+R18 should be enough, IMO.
>>>
>>>
>>> -Olli
>>>
>>>
>>>
>>>
>>>>
>>>> Thoughts?
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> *From:*public-xg-htmlspeech-request@w3.org
>>>> [mailto:public-xg-htmlspeech-request@w3.org] *On Behalf Of *Bjorn Bringert
>>>> *Sent:* Friday, November 05, 2010 7:00 AM
>>>> *To:* Marc Schroeder
>>>> *Cc:* Olli@pettay.fi; Robert Brown; Satish Sampath; Dan Burnett;
>>>> public-xg-htmlspeech@w3.org
>>>> *Subject:* Re: R18. User perceived latency of synthesis must be minimized
>>>>
>>>> Ah, right, this is a symmetrical situation to the possible replaced
>>>> speech recognition requirement that I proposed. As you say, for TTS it
>>>> is the output that must be allowed to be streamed. How about rewording
>>>> R18 to make this more clear:
>>>>
>>>> New R18: "Implementations should be allowed to start playing back
>>>> synthesized speech before the complete results of the speech synthesis
>>>> request are available."
>>>>
>>>> The intention is to rule out proposals that make streaming
>>>> implementations impossible, but not to require implementations to
>>>> implement streaming (since that might not make sense for all
>>>> implementations).
>>>>
>>>> /Bjorn
>>>>
>>>> On Fri, Nov 5, 2010 at 2:29 PM, Marc Schroeder <marc.schroeder@dfki.de
>>>> <mailto:marc.schroeder@dfki.de>> wrote:
>>>>
>>>> Dear all,
>>>>
>>>> interesting question whether this is in scope or not. I think I brought
>>>> up the original discussion point, so here is what I had in mind.
>>>>
>>>> Let's assume for the moment that Bjorn's proposal of a <tts> element (or
>>>> some other mechanism for triggering TTS output) was accepted, and that
>>>> furthermore we agreed on the "cloud" part of R20 (letting the web
>>>> application author select a TTS engine on some server).
>>>>
>>>> Then it seems inevitable that the UA and the TTS server need to
>>>> communicate with one another using some protocol: the UA would have to
>>>> send the TTS server the request, and the TTS server would have to send
>>>> the synthesised result back to the UA.
>>>>
>>>> In my mind, the requirement concerns the part of the protocol where the
>>>> TTS server sends the audio to the UA. If that is done in "the right
>>>> way", latency could be minimised, compared to the worst case where the
>>>> entire request would have to be sent in one big audio chunk before
>>>> playback could start.
>>>>
>>>>
>>>> I am aware that this raises a bigger issue, about communicating with
>>>> speech servers. While this is maybe not a direct part of the
>>>> UA/DOM/JavaScript API (as illustrated by Google's proposals), to me it
>>>> is an integral part of what this group needs to address.
>>>>
>>>> The obvious shortcut would be to have TTS/ASR done only in the UA, and I
>>>> hope we can agree that this is not an acceptable perspective...
>>>>
>>>> Kind regards,
>>>> Marc
>>>>
>>>>
>>>>
>>>>
>>>> On 05.11.10 11:17, Bjorn Bringert wrote:
>>>>
>>>>     On Fri, Nov 5, 2010 at 10:57 AM, Olli Pettay
>>>>     <Olli.Pettay@helsinki.fi <mailto:Olli.Pettay@helsinki.fi>
>>>>
>>>>     <mailto:Olli.Pettay@helsinki.fi <mailto:Olli.Pettay@helsinki.fi>>>
>>>>     wrote:
>>>>
>>>>     On 11/05/2010 08:42 AM, Robert Brown wrote:
>>>>
>>>>     Agreed that the server case is out of scope. I wonder if there's
>>>>     anything that could be said about the client. Perhaps it could be
>>>>     rewritten as "user agents should provide/playback rendered TTS audio
>>>>     to the app immediately as it's received from the TTS service".
>>>>
>>>>     This might be actually quite "wrong" wording if we're going to
>>>>     extend HTML5's media elements to provide TTS.
>>>>     The application may want to cache the result from TTS engine before
>>>>     playing it out.
>>>>
>>>>
>>>>     Yes, such wording would prohibit the HTMLMediaElement autobuffer
>>>>     attribute from working, which ironically would increase latency.
>>>>
>>>>     I think that the only point of a latency requirement would be to make
>>>>     sure that spec proposals don't prohibit low-latency processing. For
>>>>     example, a spec that requires that audio capture must finish without the
>>>>     user aborting it before any audio transmission or speech processing is
>>>>     allowed to take place would force high latency in implementations.
>>>>
>>>>     For recognition latency (the next requirement), maybe something like
>>>>     this would be appropriate: "Implementations should be allowed to start
>>>>     processing captured audio before the capture completes." For TTS, I
>>>>     don't think that such a requirement is needed, since the text to
>>>>     synthesize is typically available immediately (as opposed to captured
>>>>     audio which becomes available at a fixed rate).
>>>>
>>>>
>>>>     Hmm... could do with better wording, and may be just stating the
>>>>     obvious.
>>>>
>>>>     -----Original Message----- From: Satish Sampath
>>>>
>>>>     [mailto:satish@google.com <mailto:satish@google.com>
>>>>     <mailto:satish@google.com <mailto:satish@google.com>>] Sent:
>>>>
>>>>
>>>>     Thursday, November 04, 2010 3:08 PM
>>>>     To: Robert Brown Cc: Bjorn Bringert; Dan Burnett;
>>>>
>>>>     public-xg-htmlspeech@w3.org <mailto:public-xg-htmlspeech@w3.org>
>>>>     <mailto:public-xg-htmlspeech@w3.org
>>>>     <mailto:public-xg-htmlspeech@w3.org>>
>>>>
>>>>
>>>>     Subject: Re: R18. User perceived latency
>>>>     of synthesis must be minimized
>>>>
>>>>     This seems more of a requirement on the speech service which
>>>>     synthesizes the audio, than the UA, since usually the complexity
>>>>     lies
>>>>     in the synthesizer. I equate this to a requirement like 'web pages
>>>>     must load as fast as possible' which in reality turns into 'web
>>>>     servers should process received requests as fast as they can'
>>>>     and the
>>>>     latter is really up to the implementation based on a lot of factors
>>>>     which are not in the control of the UA.
>>>>
>>>>     If we agree that to be the case, I think it is out of scope.
>>>>
>>>>     Cheers Satish
>>>>
>>>>
>>>>
>>>>     On Thu, Nov 4, 2010 at 10:22 PM, Robert
>>>>     Brown<Robert.Brown@microsoft.com <mailto:Robert.Brown@microsoft.com>
>>>>
>>>>     <mailto:Robert.Brown@microsoft.com
>>>>     <mailto:Robert.Brown@microsoft.com>>> wrote:
>>>>
>>>>
>>>>
>>>>     It may just be a requirement that's really obvious.
>>>>
>>>>
>>>>
>>>>     From: public-xg-htmlspeech-request@w3.org
>>>>     <mailto:public-xg-htmlspeech-request@w3.org>
>>>>     <mailto:public-xg-htmlspeech-request@w3.org
>>>>     <mailto:public-xg-htmlspeech-request@w3.org>>
>>>>     [mailto:public-xg-htmlspeech-request@w3.org
>>>>     <mailto:public-xg-htmlspeech-request@w3.org>
>>>>     <mailto:public-xg-htmlspeech-request@w3.org
>>>>     <mailto:public-xg-htmlspeech-request@w3.org>>] On Behalf Of Bjorn
>>>>     Bringert Sent: Thursday, November 04, 2010 1:27 PM To: Dan
>>>>     Burnett
>>>>     Cc: public-xg-htmlspeech@w3.org <mailto:public-xg-htmlspeech@w3.org>
>>>>
>>>>     <mailto:public-xg-htmlspeech@w3.org
>>>>     <mailto:public-xg-htmlspeech@w3.org>> Subject: Re: R18. User
>>>>
>>>>
>>>>     perceived
>>>>     latency of synthesis must be minimized
>>>>
>>>>
>>>>
>>>>     I don't see a need for this to be a requirement. It's up to
>>>>     implementations to be fast, and it's unrealistic to set any
>>>>     specific latency limits.
>>>>
>>>>
>>>>
>>>>     On Thu, Nov 4, 2010 at 9:23 PM, Dan
>>>>
>>>>     Burnett<dburnett@voxeo.com <mailto:dburnett@voxeo.com>
>>>>     <mailto:dburnett@voxeo.com <mailto:dburnett@voxeo.com>>>
>>>>
>>>>
>>>>     wrote:
>>>>
>>>>     Group,
>>>>
>>>>     This is the next of the requirements to discuss and prioritize
>>>>     based on our ranking approach [1].
>>>>
>>>>     This email is the beginning of a thread for questions,
>>>>     discussion,
>>>>     and opinions regarding our first draft of Requirement 18 [2].
>>>>
>>>>     Please discuss via email as we agreed at the Lyon f2f meeting.
>>>>     Outstanding points of contention will be discussed live at
>>>>     the next
>>>>     teleconference.
>>>>
>>>>     -- dan
>>>>
>>>>     [1]
>>>>     http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Oct/0024.
>>>>
>>>>
>>>>     html [2]
>>>>
>>>>     http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Oct/att-0
>>>>
>>>>
>>>>     001/speech.html#r18
>>>>
>>>>
>>>>
>>>>     -- Bjorn Bringert Google UK Limited, Registered Office: Belgrave
>>>>     House, 76 Buckingham Palace Road, London, SW1W 9TQ Registered in
>>>>     England Number: 3977902
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>     --
>>>>     Bjorn Bringert
>>>>     Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
>>>>     Palace Road, London, SW1W 9TQ
>>>>     Registered in England Number: 3977902
>>>>
>>>>
>>>> --
>>>> Dr. Marc Schröder, Senior Researcher at DFKI GmbH
>>>> Coordinator EU FP7 Project SEMAINE http://www.semaine-project.eu
>>>> Project leader for DFKI in SSPNet http://sspnet.eu
>>>> Project leader PAVOQUE http://mary.dfki.de/pavoque
>>>> Associate Editor IEEE Trans. Affective Computing http://computer.org/tac
>>>> Editor W3C EmotionML Working Draft http://www.w3.org/TR/emotionml/
>>>> Portal Editor http://emotion-research.net
>>>> Team Leader DFKI TTS Group http://mary.dfki.de
>>>>
>>>> Homepage: http://www.dfki.de/~schroed
>>>> Email: marc.schroeder@dfki.de <mailto:marc.schroeder@dfki.de>
>>>> Phone: +49-681-85775-5303
>>>> Postal address: DFKI GmbH, Campus D3_2, Stuhlsatzenhausweg 3, D-66123
>>>> Saarbrücken, Germany
>>>> --
>>>> Official DFKI coordinates:
>>>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>>>> Trippstadter Strasse 122, D-67663 Kaiserslautern, Germany
>>>> Geschaeftsfuehrung:
>>>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>>>> Dr. Walter Olthoff
>>>> Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
>>>> Amtsgericht Kaiserslautern, HRB 2313
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Bjorn Bringert
>>>> Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
>>>> Palace Road, London, SW1W 9TQ
>>>> Registered in England Number: 3977902
>>>>
>>>
>>>
>>
>
>
>
> --
> Bjorn Bringert
> Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
> Palace Road, London, SW1W 9TQ
> Registered in England Number: 3977902
>



-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902
Received on Wednesday, 10 November 2010 02:47:17 UTC