Re: R18. User perceived latency of synthesis must be minimized from Satish Sampath on 2010-11-08 (public-xg-htmlspeech@w3.org from November 2010)

From: Satish Sampath <satish@google.com>
Date: Mon, 8 Nov 2010 22:00:22 +0000
To: "Young, Milan" <Milan.Young@nuance.com>
Cc: Olli@pettay.fi, Bjorn Bringert <bringert@google.com>, Marc Schroeder <marc.schroeder@dfki.de>, Robert Brown <Robert.Brown@microsoft.com>, Dan Burnett <dburnett@voxeo.com>, public-xg-htmlspeech@w3.org
Message-ID: <AANLkTimrgci4jAzj61NLwCd2AOkueN4Fy=92M5WeX+PP@mail.gmail.com>
Aren't servers already capable of enforcing such requirements? For
e.g. the server could choose to only honour a chunked transfer request
instead of a bulk http post, thereby only allowing UAs that support
streaming to work with it. Realistically though the servers would want
to service as many users as they can in a wide variety of network
conditions.. for e.g. many users on mobile networks may not be able to
stream out audio due to proxies. I think the spec should be agnostic
to the network conditions and not enforce any such requirements.

Cheers
Satish



On Mon, Nov 8, 2010 at 9:11 PM, Young, Milan <Milan.Young@nuance.com> wrote:
> Hello Olli,
>
> The current set of requirements certainly do make provision for techniques such as streaming.  But there is nothing, for example, that *requires* a UA to use streaming if that is what is desired by the speech server.
>
> The thrust of my new requirements is to afford the speech server authority over the low-level details of the speech dialog.  This includes everything from chunking, endpointing, codecs, parameter/result passing, event timings/mappings, etc.
>
> If we can figure out how to roll this concept into the existing requirements, that's fine with me.
>
> Thank you
>
>
> -----Original Message-----
> From: Olli Pettay [mailto:Olli.Pettay@helsinki.fi]
> Sent: Monday, November 08, 2010 12:07 PM
> To: Young, Milan
> Cc: Bjorn Bringert; Marc Schroeder; Olli@pettay.fi; Robert Brown; Satish Sampath; Dan Burnett; public-xg-htmlspeech@w3.org
> Subject: Re: R18. User perceived latency of synthesis must be minimized
>
> On 11/08/2010 07:07 PM, Young, Milan wrote:
>> I'm glad to see consensus on R18, but I'm suspect that we have not
>> fundamentally addressed Marc's concern regarding communication with
>> speech servers.
>>
>> Up to this point, most requirements have been focused on the application
>> author's interface with the UA.  Any requirements around the UA
>> integration with the speech server seem inferred at best.  While the
>> details of such protocol integrations are probably outside the domain of
>> the W3C, I believe it is still incumbent within our recommendation that
>> the general framework be addressed.
>>
>> I'd like to suggest a new set of requirements:
>>
>> ·Speech resources must be allowed to write to the audio buffer for the
>> complete duration of the real-time audio rendering.
>>
>> ·Speech resources must be allowed to read from the audio buffer for the
>> complete duration of the real-time audio capture.
> I don't quite understand these two.
> The first one, I think, talks about the "TTS" buffer and the latter
> one about "ASR" buffer, right?
> If so the wording could be clearer, and I'm not sure what benefit we
> get from such requirements comparing to R18 and R17.
>
>
>>
>> ·User agents must allow passing parameters (structured or otherwise)
>> from the application to the speech resource.
>
> R22 takes care of this, though I'm not now sure whether that got merged
> already with some other requirement.
> And R22 is not quite clear nor right. By default we should assume that
> local  speech engines work in a compatible way and no recognizer
> specific parameters are needed (we don't want browser specific
> webapps). Network services are quite different thing, there recognizer
> specific parameters make more sense.
>
>>
>> ·User agents must not filter results from the speech resource to the
>> application.  (Perhaps this is already handled by R6?)
> I'd leave this to UA. Right now I can't see any reason why UA would
> filter anything, but I also don't understand why that should be
> prohibited.
> Actually, some UA for kids for example might want to filter out
> some results.
> But anyway, I'd leave the decision of filtering to the UA and
> wouldn't add any requirement for it.
>
>
>>
>> ·User agents must not interfere with the timing of speech events from
>> the speech resource to the application.
> Why would we need this requirement? R17+R18 should be enough, IMO.
>
>
> -Olli
>
>
>
>
>>
>> Thoughts?
>>
>> ------------------------------------------------------------------------
>>
>> *From:*public-xg-htmlspeech-request@w3.org
>> [mailto:public-xg-htmlspeech-request@w3.org] *On Behalf Of *Bjorn Bringert
>> *Sent:* Friday, November 05, 2010 7:00 AM
>> *To:* Marc Schroeder
>> *Cc:* Olli@pettay.fi; Robert Brown; Satish Sampath; Dan Burnett;
>> public-xg-htmlspeech@w3.org
>> *Subject:* Re: R18. User perceived latency of synthesis must be minimized
>>
>> Ah, right, this is a symmetrical situation to the possible replaced
>> speech recognition requirement that I proposed. As you say, for TTS it
>> is the output that must be allowed to be streamed. How about rewording
>> R18 to make this more clear:
>>
>> New R18: "Implementations should be allowed to start playing back
>> synthesized speech before the complete results of the speech synthesis
>> request are available."
>>
>> The intention is to rule out proposals that make streaming
>> implementations impossible, but not to require implementations to
>> implement streaming (since that might not make sense for all
>> implementations).
>>
>> /Bjorn
>>
>> On Fri, Nov 5, 2010 at 2:29 PM, Marc Schroeder <marc.schroeder@dfki.de
>> <mailto:marc.schroeder@dfki.de>> wrote:
>>
>> Dear all,
>>
>> interesting question whether this is in scope or not. I think I brought
>> up the original discussion point, so here is what I had in mind.
>>
>> Let's assume for the moment that Bjorn's proposal of a <tts> element (or
>> some other mechanism for triggering TTS output) was accepted, and that
>> furthermore we agreed on the "cloud" part of R20 (letting the web
>> application author select a TTS engine on some server).
>>
>> Then it seems inevitable that the UA and the TTS server need to
>> communicate with one another using some protocol: the UA would have to
>> send the TTS server the request, and the TTS server would have to send
>> the synthesised result back to the UA.
>>
>> In my mind, the requirement concerns the part of the protocol where the
>> TTS server sends the audio to the UA. If that is done in "the right
>> way", latency could be minimised, compared to the worst case where the
>> entire request would have to be sent in one big audio chunk before
>> playback could start.
>>
>>
>> I am aware that this raises a bigger issue, about communicating with
>> speech servers. While this is maybe not a direct part of the
>> UA/DOM/JavaScript API (as illustrated by Google's proposals), to me it
>> is an integral part of what this group needs to address.
>>
>> The obvious shortcut would be to have TTS/ASR done only in the UA, and I
>> hope we can agree that this is not an acceptable perspective...
>>
>> Kind regards,
>> Marc
>>
>>
>>
>>
>> On 05.11.10 11:17, Bjorn Bringert wrote:
>>
>>     On Fri, Nov 5, 2010 at 10:57 AM, Olli Pettay
>>     <Olli.Pettay@helsinki.fi <mailto:Olli.Pettay@helsinki.fi>
>>
>>     <mailto:Olli.Pettay@helsinki.fi <mailto:Olli.Pettay@helsinki.fi>>>
>>     wrote:
>>
>>     On 11/05/2010 08:42 AM, Robert Brown wrote:
>>
>>     Agreed that the server case is out of scope. I wonder if there's
>>     anything that could be said about the client. Perhaps it could be
>>     rewritten as "user agents should provide/playback rendered TTS audio
>>     to the app immediately as it's received from the TTS service".
>>
>>     This might be actually quite "wrong" wording if we're going to
>>     extend HTML5's media elements to provide TTS.
>>     The application may want to cache the result from TTS engine before
>>     playing it out.
>>
>>
>>     Yes, such wording would prohibit the HTMLMediaElement autobuffer
>>     attribute from working, which ironically would increase latency.
>>
>>     I think that the only point of a latency requirement would be to make
>>     sure that spec proposals don't prohibit low-latency processing. For
>>     example, a spec that requires that audio capture must finish without the
>>     user aborting it before any audio transmission or speech processing is
>>     allowed to take place would force high latency in implementations.
>>
>>     For recognition latency (the next requirement), maybe something like
>>     this would be appropriate: "Implementations should be allowed to start
>>     processing captured audio before the capture completes." For TTS, I
>>     don't think that such a requirement is needed, since the text to
>>     synthesize is typically available immediately (as opposed to captured
>>     audio which becomes available at a fixed rate).
>>
>>
>>     Hmm... could do with better wording, and may be just stating the
>>     obvious.
>>
>>     -----Original Message----- From: Satish Sampath
>>
>>     [mailto:satish@google.com <mailto:satish@google.com>
>>     <mailto:satish@google.com <mailto:satish@google.com>>] Sent:
>>
>>
>>     Thursday, November 04, 2010 3:08 PM
>>     To: Robert Brown Cc: Bjorn Bringert; Dan Burnett;
>>
>>     public-xg-htmlspeech@w3.org <mailto:public-xg-htmlspeech@w3.org>
>>     <mailto:public-xg-htmlspeech@w3.org
>>     <mailto:public-xg-htmlspeech@w3.org>>
>>
>>
>>     Subject: Re: R18. User perceived latency
>>     of synthesis must be minimized
>>
>>     This seems more of a requirement on the speech service which
>>     synthesizes the audio, than the UA, since usually the complexity
>>     lies
>>     in the synthesizer. I equate this to a requirement like 'web pages
>>     must load as fast as possible' which in reality turns into 'web
>>     servers should process received requests as fast as they can'
>>     and the
>>     latter is really up to the implementation based on a lot of factors
>>     which are not in the control of the UA.
>>
>>     If we agree that to be the case, I think it is out of scope.
>>
>>     Cheers Satish
>>
>>
>>
>>     On Thu, Nov 4, 2010 at 10:22 PM, Robert
>>     Brown<Robert.Brown@microsoft.com <mailto:Robert.Brown@microsoft.com>
>>
>>     <mailto:Robert.Brown@microsoft.com
>>     <mailto:Robert.Brown@microsoft.com>>> wrote:
>>
>>
>>
>>     It may just be a requirement that's really obvious.
>>
>>
>>
>>     From: public-xg-htmlspeech-request@w3.org
>>     <mailto:public-xg-htmlspeech-request@w3.org>
>>     <mailto:public-xg-htmlspeech-request@w3.org
>>     <mailto:public-xg-htmlspeech-request@w3.org>>
>>     [mailto:public-xg-htmlspeech-request@w3.org
>>     <mailto:public-xg-htmlspeech-request@w3.org>
>>     <mailto:public-xg-htmlspeech-request@w3.org
>>     <mailto:public-xg-htmlspeech-request@w3.org>>] On Behalf Of Bjorn
>>     Bringert Sent: Thursday, November 04, 2010 1:27 PM To: Dan
>>     Burnett
>>     Cc: public-xg-htmlspeech@w3.org <mailto:public-xg-htmlspeech@w3.org>
>>
>>     <mailto:public-xg-htmlspeech@w3.org
>>     <mailto:public-xg-htmlspeech@w3.org>> Subject: Re: R18. User
>>
>>
>>     perceived
>>     latency of synthesis must be minimized
>>
>>
>>
>>     I don't see a need for this to be a requirement. It's up to
>>     implementations to be fast, and it's unrealistic to set any
>>     specific latency limits.
>>
>>
>>
>>     On Thu, Nov 4, 2010 at 9:23 PM, Dan
>>
>>     Burnett<dburnett@voxeo.com <mailto:dburnett@voxeo.com>
>>     <mailto:dburnett@voxeo.com <mailto:dburnett@voxeo.com>>>
>>
>>
>>     wrote:
>>
>>     Group,
>>
>>     This is the next of the requirements to discuss and prioritize
>>     based on our ranking approach [1].
>>
>>     This email is the beginning of a thread for questions,
>>     discussion,
>>     and opinions regarding our first draft of Requirement 18 [2].
>>
>>     Please discuss via email as we agreed at the Lyon f2f meeting.
>>     Outstanding points of contention will be discussed live at
>>     the next
>>     teleconference.
>>
>>     -- dan
>>
>>     [1]
>>     http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Oct/0024.
>>
>>
>>     html [2]
>>
>>     http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Oct/att-0
>>
>>
>>     001/speech.html#r18
>>
>>
>>
>>     -- Bjorn Bringert Google UK Limited, Registered Office: Belgrave
>>     House, 76 Buckingham Palace Road, London, SW1W 9TQ Registered in
>>     England Number: 3977902
>>
>>
>>
>>
>>
>>
>>
>>
>>     --
>>     Bjorn Bringert
>>     Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
>>     Palace Road, London, SW1W 9TQ
>>     Registered in England Number: 3977902
>>
>>
>> --
>> Dr. Marc Schröder, Senior Researcher at DFKI GmbH
>> Coordinator EU FP7 Project SEMAINE http://www.semaine-project.eu
>> Project leader for DFKI in SSPNet http://sspnet.eu
>> Project leader PAVOQUE http://mary.dfki.de/pavoque
>> Associate Editor IEEE Trans. Affective Computing http://computer.org/tac
>> Editor W3C EmotionML Working Draft http://www.w3.org/TR/emotionml/
>> Portal Editor http://emotion-research.net
>> Team Leader DFKI TTS Group http://mary.dfki.de
>>
>> Homepage: http://www.dfki.de/~schroed
>> Email: marc.schroeder@dfki.de <mailto:marc.schroeder@dfki.de>
>> Phone: +49-681-85775-5303
>> Postal address: DFKI GmbH, Campus D3_2, Stuhlsatzenhausweg 3, D-66123
>> Saarbrücken, Germany
>> --
>> Official DFKI coordinates:
>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>> Trippstadter Strasse 122, D-67663 Kaiserslautern, Germany
>> Geschaeftsfuehrung:
>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>> Dr. Walter Olthoff
>> Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
>> Amtsgericht Kaiserslautern, HRB 2313
>>
>>
>>
>>
>> --
>> Bjorn Bringert
>> Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
>> Palace Road, London, SW1W 9TQ
>> Registered in England Number: 3977902
>>
>
>
Received on Monday, 8 November 2010 22:00:54 UTC