Re: R18. User perceived latency of synthesis must be minimized from Bjorn Bringert on 2010-11-08 (public-xg-htmlspeech@w3.org from November 2010)

From: Bjorn Bringert <bringert@google.com>
Date: Mon, 8 Nov 2010 22:17:11 +0000
To: Satish Sampath <satish@google.com>
Cc: "Young, Milan" <Milan.Young@nuance.com>, Olli@pettay.fi, Marc Schroeder <marc.schroeder@dfki.de>, Robert Brown <Robert.Brown@microsoft.com>, Dan Burnett <dburnett@voxeo.com>, public-xg-htmlspeech@w3.org
Message-ID: <AANLkTimPS4douDxiFBQOHPhMQjn7dbtMDsX7LNEW-5ja@mail.gmail.com>
It's probably hard to to specify exactly what UAs must do with respect
to recording and playback timing etc. At the requirements stage, I
think it's most important to specify what any specs must allow. This
for example includes that any proposals must be streaming-friendly.

As for the fidelity in passing parameters to speech services and
returning results, the first priority for the UA should be to ensure
that the UA <-> web app communication follows the HTML Speech API,
e.g. with respect to the ordering of events, whatever the spec for
that ends up being. Then there should be a standard API for
extensions, including for example extra parameters to pass to speech
services and extra events and metadata to get back. That is, there
should be a common interface that web apps get, regardless of the
speech service implementation used,  and there may also be
implementation-specific extensions.

Without this, we are just adding a very thin layer on top of a
specific speech service implementation, which results in vendor
lock-in. If a very direct interface to a specific speech service
implementation is required, it would probably be better to use a
generic audio capture / playback API instead of an HTML Speech API.

Out of interest, at what level is this sort of thing specified in VoiceXML?

/Bjorn

On Mon, Nov 8, 2010 at 10:00 PM, Satish Sampath <satish@google.com> wrote:
> Aren't servers already capable of enforcing such requirements? For
> e.g. the server could choose to only honour a chunked transfer request
> instead of a bulk http post, thereby only allowing UAs that support
> streaming to work with it. Realistically though the servers would want
> to service as many users as they can in a wide variety of network
> conditions.. for e.g. many users on mobile networks may not be able to
> stream out audio due to proxies. I think the spec should be agnostic
> to the network conditions and not enforce any such requirements.
>
> Cheers
> Satish
>
>
>
> On Mon, Nov 8, 2010 at 9:11 PM, Young, Milan <Milan.Young@nuance.com> wrote:
>> Hello Olli,
>>
>> The current set of requirements certainly do make provision for techniques such as streaming.  But there is nothing, for example, that *requires* a UA to use streaming if that is what is desired by the speech server.
>>
>> The thrust of my new requirements is to afford the speech server authority over the low-level details of the speech dialog.  This includes everything from chunking, endpointing, codecs, parameter/result passing, event timings/mappings, etc.
>>
>> If we can figure out how to roll this concept into the existing requirements, that's fine with me.
>>
>> Thank you
>>
>>
>> -----Original Message-----
>> From: Olli Pettay [mailto:Olli.Pettay@helsinki.fi]
>> Sent: Monday, November 08, 2010 12:07 PM
>> To: Young, Milan
>> Cc: Bjorn Bringert; Marc Schroeder; Olli@pettay.fi; Robert Brown; Satish Sampath; Dan Burnett; public-xg-htmlspeech@w3.org
>> Subject: Re: R18. User perceived latency of synthesis must be minimized
>>
>> On 11/08/2010 07:07 PM, Young, Milan wrote:
>>> I'm glad to see consensus on R18, but I'm suspect that we have not
>>> fundamentally addressed Marc's concern regarding communication with
>>> speech servers.
>>>
>>> Up to this point, most requirements have been focused on the application
>>> author's interface with the UA.  Any requirements around the UA
>>> integration with the speech server seem inferred at best.  While the
>>> details of such protocol integrations are probably outside the domain of
>>> the W3C, I believe it is still incumbent within our recommendation that
>>> the general framework be addressed.
>>>
>>> I'd like to suggest a new set of requirements:
>>>
>>> ·Speech resources must be allowed to write to the audio buffer for the
>>> complete duration of the real-time audio rendering.
>>>
>>> ·Speech resources must be allowed to read from the audio buffer for the
>>> complete duration of the real-time audio capture.
>> I don't quite understand these two.
>> The first one, I think, talks about the "TTS" buffer and the latter
>> one about "ASR" buffer, right?
>> If so the wording could be clearer, and I'm not sure what benefit we
>> get from such requirements comparing to R18 and R17.
>>
>>
>>>
>>> ·User agents must allow passing parameters (structured or otherwise)
>>> from the application to the speech resource.
>>
>> R22 takes care of this, though I'm not now sure whether that got merged
>> already with some other requirement.
>> And R22 is not quite clear nor right. By default we should assume that
>> local  speech engines work in a compatible way and no recognizer
>> specific parameters are needed (we don't want browser specific
>> webapps). Network services are quite different thing, there recognizer
>> specific parameters make more sense.
>>
>>>
>>> ·User agents must not filter results from the speech resource to the
>>> application.  (Perhaps this is already handled by R6?)
>> I'd leave this to UA. Right now I can't see any reason why UA would
>> filter anything, but I also don't understand why that should be
>> prohibited.
>> Actually, some UA for kids for example might want to filter out
>> some results.
>> But anyway, I'd leave the decision of filtering to the UA and
>> wouldn't add any requirement for it.
>>
>>
>>>
>>> ·User agents must not interfere with the timing of speech events from
>>> the speech resource to the application.
>> Why would we need this requirement? R17+R18 should be enough, IMO.
>>
>>
>> -Olli
>>
>>
>>
>>
>>>
>>> Thoughts?
>>>
>>> ------------------------------------------------------------------------
>>>
>>> *From:*public-xg-htmlspeech-request@w3.org
>>> [mailto:public-xg-htmlspeech-request@w3.org] *On Behalf Of *Bjorn Bringert
>>> *Sent:* Friday, November 05, 2010 7:00 AM
>>> *To:* Marc Schroeder
>>> *Cc:* Olli@pettay.fi; Robert Brown; Satish Sampath; Dan Burnett;
>>> public-xg-htmlspeech@w3.org
>>> *Subject:* Re: R18. User perceived latency of synthesis must be minimized
>>>
>>> Ah, right, this is a symmetrical situation to the possible replaced
>>> speech recognition requirement that I proposed. As you say, for TTS it
>>> is the output that must be allowed to be streamed. How about rewording
>>> R18 to make this more clear:
>>>
>>> New R18: "Implementations should be allowed to start playing back
>>> synthesized speech before the complete results of the speech synthesis
>>> request are available."
>>>
>>> The intention is to rule out proposals that make streaming
>>> implementations impossible, but not to require implementations to
>>> implement streaming (since that might not make sense for all
>>> implementations).
>>>
>>> /Bjorn
>>>
>>> On Fri, Nov 5, 2010 at 2:29 PM, Marc Schroeder <marc.schroeder@dfki.de
>>> <mailto:marc.schroeder@dfki.de>> wrote:
>>>
>>> Dear all,
>>>
>>> interesting question whether this is in scope or not. I think I brought
>>> up the original discussion point, so here is what I had in mind.
>>>
>>> Let's assume for the moment that Bjorn's proposal of a <tts> element (or
>>> some other mechanism for triggering TTS output) was accepted, and that
>>> furthermore we agreed on the "cloud" part of R20 (letting the web
>>> application author select a TTS engine on some server).
>>>
>>> Then it seems inevitable that the UA and the TTS server need to
>>> communicate with one another using some protocol: the UA would have to
>>> send the TTS server the request, and the TTS server would have to send
>>> the synthesised result back to the UA.
>>>
>>> In my mind, the requirement concerns the part of the protocol where the
>>> TTS server sends the audio to the UA. If that is done in "the right
>>> way", latency could be minimised, compared to the worst case where the
>>> entire request would have to be sent in one big audio chunk before
>>> playback could start.
>>>
>>>
>>> I am aware that this raises a bigger issue, about communicating with
>>> speech servers. While this is maybe not a direct part of the
>>> UA/DOM/JavaScript API (as illustrated by Google's proposals), to me it
>>> is an integral part of what this group needs to address.
>>>
>>> The obvious shortcut would be to have TTS/ASR done only in the UA, and I
>>> hope we can agree that this is not an acceptable perspective...
>>>
>>> Kind regards,
>>> Marc
>>>
>>>
>>>
>>>
>>> On 05.11.10 11:17, Bjorn Bringert wrote:
>>>
>>>     On Fri, Nov 5, 2010 at 10:57 AM, Olli Pettay
>>>     <Olli.Pettay@helsinki.fi <mailto:Olli.Pettay@helsinki.fi>
>>>
>>>     <mailto:Olli.Pettay@helsinki.fi <mailto:Olli.Pettay@helsinki.fi>>>
>>>     wrote:
>>>
>>>     On 11/05/2010 08:42 AM, Robert Brown wrote:
>>>
>>>     Agreed that the server case is out of scope. I wonder if there's
>>>     anything that could be said about the client. Perhaps it could be
>>>     rewritten as "user agents should provide/playback rendered TTS audio
>>>     to the app immediately as it's received from the TTS service".
>>>
>>>     This might be actually quite "wrong" wording if we're going to
>>>     extend HTML5's media elements to provide TTS.
>>>     The application may want to cache the result from TTS engine before
>>>     playing it out.
>>>
>>>
>>>     Yes, such wording would prohibit the HTMLMediaElement autobuffer
>>>     attribute from working, which ironically would increase latency.
>>>
>>>     I think that the only point of a latency requirement would be to make
>>>     sure that spec proposals don't prohibit low-latency processing. For
>>>     example, a spec that requires that audio capture must finish without the
>>>     user aborting it before any audio transmission or speech processing is
>>>     allowed to take place would force high latency in implementations.
>>>
>>>     For recognition latency (the next requirement), maybe something like
>>>     this would be appropriate: "Implementations should be allowed to start
>>>     processing captured audio before the capture completes." For TTS, I
>>>     don't think that such a requirement is needed, since the text to
>>>     synthesize is typically available immediately (as opposed to captured
>>>     audio which becomes available at a fixed rate).
>>>
>>>
>>>     Hmm... could do with better wording, and may be just stating the
>>>     obvious.
>>>
>>>     -----Original Message----- From: Satish Sampath
>>>
>>>     [mailto:satish@google.com <mailto:satish@google.com>
>>>     <mailto:satish@google.com <mailto:satish@google.com>>] Sent:
>>>
>>>
>>>     Thursday, November 04, 2010 3:08 PM
>>>     To: Robert Brown Cc: Bjorn Bringert; Dan Burnett;
>>>
>>>     public-xg-htmlspeech@w3.org <mailto:public-xg-htmlspeech@w3.org>
>>>     <mailto:public-xg-htmlspeech@w3.org
>>>     <mailto:public-xg-htmlspeech@w3.org>>
>>>
>>>
>>>     Subject: Re: R18. User perceived latency
>>>     of synthesis must be minimized
>>>
>>>     This seems more of a requirement on the speech service which
>>>     synthesizes the audio, than the UA, since usually the complexity
>>>     lies
>>>     in the synthesizer. I equate this to a requirement like 'web pages
>>>     must load as fast as possible' which in reality turns into 'web
>>>     servers should process received requests as fast as they can'
>>>     and the
>>>     latter is really up to the implementation based on a lot of factors
>>>     which are not in the control of the UA.
>>>
>>>     If we agree that to be the case, I think it is out of scope.
>>>
>>>     Cheers Satish
>>>
>>>
>>>
>>>     On Thu, Nov 4, 2010 at 10:22 PM, Robert
>>>     Brown<Robert.Brown@microsoft.com <mailto:Robert.Brown@microsoft.com>
>>>
>>>     <mailto:Robert.Brown@microsoft.com
>>>     <mailto:Robert.Brown@microsoft.com>>> wrote:
>>>
>>>
>>>
>>>     It may just be a requirement that's really obvious.
>>>
>>>
>>>
>>>     From: public-xg-htmlspeech-request@w3.org
>>>     <mailto:public-xg-htmlspeech-request@w3.org>
>>>     <mailto:public-xg-htmlspeech-request@w3.org
>>>     <mailto:public-xg-htmlspeech-request@w3.org>>
>>>     [mailto:public-xg-htmlspeech-request@w3.org
>>>     <mailto:public-xg-htmlspeech-request@w3.org>
>>>     <mailto:public-xg-htmlspeech-request@w3.org
>>>     <mailto:public-xg-htmlspeech-request@w3.org>>] On Behalf Of Bjorn
>>>     Bringert Sent: Thursday, November 04, 2010 1:27 PM To: Dan
>>>     Burnett
>>>     Cc: public-xg-htmlspeech@w3.org <mailto:public-xg-htmlspeech@w3.org>
>>>
>>>     <mailto:public-xg-htmlspeech@w3.org
>>>     <mailto:public-xg-htmlspeech@w3.org>> Subject: Re: R18. User
>>>
>>>
>>>     perceived
>>>     latency of synthesis must be minimized
>>>
>>>
>>>
>>>     I don't see a need for this to be a requirement. It's up to
>>>     implementations to be fast, and it's unrealistic to set any
>>>     specific latency limits.
>>>
>>>
>>>
>>>     On Thu, Nov 4, 2010 at 9:23 PM, Dan
>>>
>>>     Burnett<dburnett@voxeo.com <mailto:dburnett@voxeo.com>
>>>     <mailto:dburnett@voxeo.com <mailto:dburnett@voxeo.com>>>
>>>
>>>
>>>     wrote:
>>>
>>>     Group,
>>>
>>>     This is the next of the requirements to discuss and prioritize
>>>     based on our ranking approach [1].
>>>
>>>     This email is the beginning of a thread for questions,
>>>     discussion,
>>>     and opinions regarding our first draft of Requirement 18 [2].
>>>
>>>     Please discuss via email as we agreed at the Lyon f2f meeting.
>>>     Outstanding points of contention will be discussed live at
>>>     the next
>>>     teleconference.
>>>
>>>     -- dan
>>>
>>>     [1]
>>>     http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Oct/0024.
>>>
>>>
>>>     html [2]
>>>
>>>     http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Oct/att-0
>>>
>>>
>>>     001/speech.html#r18
>>>
>>>
>>>
>>>     -- Bjorn Bringert Google UK Limited, Registered Office: Belgrave
>>>     House, 76 Buckingham Palace Road, London, SW1W 9TQ Registered in
>>>     England Number: 3977902
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>     --
>>>     Bjorn Bringert
>>>     Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
>>>     Palace Road, London, SW1W 9TQ
>>>     Registered in England Number: 3977902
>>>
>>>
>>> --
>>> Dr. Marc Schröder, Senior Researcher at DFKI GmbH
>>> Coordinator EU FP7 Project SEMAINE http://www.semaine-project.eu
>>> Project leader for DFKI in SSPNet http://sspnet.eu
>>> Project leader PAVOQUE http://mary.dfki.de/pavoque
>>> Associate Editor IEEE Trans. Affective Computing http://computer.org/tac
>>> Editor W3C EmotionML Working Draft http://www.w3.org/TR/emotionml/
>>> Portal Editor http://emotion-research.net
>>> Team Leader DFKI TTS Group http://mary.dfki.de
>>>
>>> Homepage: http://www.dfki.de/~schroed
>>> Email: marc.schroeder@dfki.de <mailto:marc.schroeder@dfki.de>
>>> Phone: +49-681-85775-5303
>>> Postal address: DFKI GmbH, Campus D3_2, Stuhlsatzenhausweg 3, D-66123
>>> Saarbrücken, Germany
>>> --
>>> Official DFKI coordinates:
>>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>>> Trippstadter Strasse 122, D-67663 Kaiserslautern, Germany
>>> Geschaeftsfuehrung:
>>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>>> Dr. Walter Olthoff
>>> Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
>>> Amtsgericht Kaiserslautern, HRB 2313
>>>
>>>
>>>
>>>
>>> --
>>> Bjorn Bringert
>>> Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
>>> Palace Road, London, SW1W 9TQ
>>> Registered in England Number: 3977902
>>>
>>
>>
>



-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902
Received on Monday, 8 November 2010 22:17:44 UTC