RE: R18. User perceived latency of synthesis must be minimized from Young, Milan on 2010-11-08 (public-xg-htmlspeech@w3.org from November 2010)

From: Young, Milan <Milan.Young@nuance.com>
Date: Mon, 8 Nov 2010 13:11:18 -0800
To: <Olli@pettay.fi>
Cc: "Bjorn Bringert" <bringert@google.com>, "Marc Schroeder" <marc.schroeder@dfki.de>, "Robert Brown" <Robert.Brown@microsoft.com>, "Satish Sampath" <satish@google.com>, "Dan Burnett" <dburnett@voxeo.com>, <public-xg-htmlspeech@w3.org>
Message-ID: <1AA381D92997964F898DF2A3AA4FF9AD094A4C40@SUN-EXCH01.nuance.com>
Hello Olli,

The current set of requirements certainly do make provision for techniques such as streaming.  But there is nothing, for example, that *requires* a UA to use streaming if that is what is desired by the speech server.

The thrust of my new requirements is to afford the speech server authority over the low-level details of the speech dialog.  This includes everything from chunking, endpointing, codecs, parameter/result passing, event timings/mappings, etc.

If we can figure out how to roll this concept into the existing requirements, that's fine with me.

Thank you


-----Original Message-----
From: Olli Pettay [mailto:Olli.Pettay@helsinki.fi] 
Sent: Monday, November 08, 2010 12:07 PM
To: Young, Milan
Cc: Bjorn Bringert; Marc Schroeder; Olli@pettay.fi; Robert Brown; Satish Sampath; Dan Burnett; public-xg-htmlspeech@w3.org
Subject: Re: R18. User perceived latency of synthesis must be minimized

On 11/08/2010 07:07 PM, Young, Milan wrote:
> I'm glad to see consensus on R18, but I'm suspect that we have not
> fundamentally addressed Marc's concern regarding communication with
> speech servers.
>
> Up to this point, most requirements have been focused on the application
> author's interface with the UA.  Any requirements around the UA
> integration with the speech server seem inferred at best.  While the
> details of such protocol integrations are probably outside the domain of
> the W3C, I believe it is still incumbent within our recommendation that
> the general framework be addressed.
>
> I'd like to suggest a new set of requirements:
>
> ·Speech resources must be allowed to write to the audio buffer for the
> complete duration of the real-time audio rendering.
>
> ·Speech resources must be allowed to read from the audio buffer for the
> complete duration of the real-time audio capture.
I don't quite understand these two.
The first one, I think, talks about the "TTS" buffer and the latter
one about "ASR" buffer, right?
If so the wording could be clearer, and I'm not sure what benefit we
get from such requirements comparing to R18 and R17.


>
> ·User agents must allow passing parameters (structured or otherwise)
> from the application to the speech resource.

R22 takes care of this, though I'm not now sure whether that got merged
already with some other requirement.
And R22 is not quite clear nor right. By default we should assume that
local  speech engines work in a compatible way and no recognizer
specific parameters are needed (we don't want browser specific
webapps). Network services are quite different thing, there recognizer
specific parameters make more sense.

>
> ·User agents must not filter results from the speech resource to the
> application.  (Perhaps this is already handled by R6?)
I'd leave this to UA. Right now I can't see any reason why UA would
filter anything, but I also don't understand why that should be
prohibited.
Actually, some UA for kids for example might want to filter out
some results.
But anyway, I'd leave the decision of filtering to the UA and
wouldn't add any requirement for it.


>
> ·User agents must not interfere with the timing of speech events from
> the speech resource to the application.
Why would we need this requirement? R17+R18 should be enough, IMO.


-Olli




>
> Thoughts?
>
> ------------------------------------------------------------------------
>
> *From:*public-xg-htmlspeech-request@w3.org
> [mailto:public-xg-htmlspeech-request@w3.org] *On Behalf Of *Bjorn Bringert
> *Sent:* Friday, November 05, 2010 7:00 AM
> *To:* Marc Schroeder
> *Cc:* Olli@pettay.fi; Robert Brown; Satish Sampath; Dan Burnett;
> public-xg-htmlspeech@w3.org
> *Subject:* Re: R18. User perceived latency of synthesis must be minimized
>
> Ah, right, this is a symmetrical situation to the possible replaced
> speech recognition requirement that I proposed. As you say, for TTS it
> is the output that must be allowed to be streamed. How about rewording
> R18 to make this more clear:
>
> New R18: "Implementations should be allowed to start playing back
> synthesized speech before the complete results of the speech synthesis
> request are available."
>
> The intention is to rule out proposals that make streaming
> implementations impossible, but not to require implementations to
> implement streaming (since that might not make sense for all
> implementations).
>
> /Bjorn
>
> On Fri, Nov 5, 2010 at 2:29 PM, Marc Schroeder <marc.schroeder@dfki.de
> <mailto:marc.schroeder@dfki.de>> wrote:
>
> Dear all,
>
> interesting question whether this is in scope or not. I think I brought
> up the original discussion point, so here is what I had in mind.
>
> Let's assume for the moment that Bjorn's proposal of a <tts> element (or
> some other mechanism for triggering TTS output) was accepted, and that
> furthermore we agreed on the "cloud" part of R20 (letting the web
> application author select a TTS engine on some server).
>
> Then it seems inevitable that the UA and the TTS server need to
> communicate with one another using some protocol: the UA would have to
> send the TTS server the request, and the TTS server would have to send
> the synthesised result back to the UA.
>
> In my mind, the requirement concerns the part of the protocol where the
> TTS server sends the audio to the UA. If that is done in "the right
> way", latency could be minimised, compared to the worst case where the
> entire request would have to be sent in one big audio chunk before
> playback could start.
>
>
> I am aware that this raises a bigger issue, about communicating with
> speech servers. While this is maybe not a direct part of the
> UA/DOM/JavaScript API (as illustrated by Google's proposals), to me it
> is an integral part of what this group needs to address.
>
> The obvious shortcut would be to have TTS/ASR done only in the UA, and I
> hope we can agree that this is not an acceptable perspective...
>
> Kind regards,
> Marc
>
>
>
>
> On 05.11.10 11:17, Bjorn Bringert wrote:
>
>     On Fri, Nov 5, 2010 at 10:57 AM, Olli Pettay
>     <Olli.Pettay@helsinki.fi <mailto:Olli.Pettay@helsinki.fi>
>
>     <mailto:Olli.Pettay@helsinki.fi <mailto:Olli.Pettay@helsinki.fi>>>
>     wrote:
>
>     On 11/05/2010 08:42 AM, Robert Brown wrote:
>
>     Agreed that the server case is out of scope. I wonder if there's
>     anything that could be said about the client. Perhaps it could be
>     rewritten as "user agents should provide/playback rendered TTS audio
>     to the app immediately as it's received from the TTS service".
>
>     This might be actually quite "wrong" wording if we're going to
>     extend HTML5's media elements to provide TTS.
>     The application may want to cache the result from TTS engine before
>     playing it out.
>
>
>     Yes, such wording would prohibit the HTMLMediaElement autobuffer
>     attribute from working, which ironically would increase latency.
>
>     I think that the only point of a latency requirement would be to make
>     sure that spec proposals don't prohibit low-latency processing. For
>     example, a spec that requires that audio capture must finish without the
>     user aborting it before any audio transmission or speech processing is
>     allowed to take place would force high latency in implementations.
>
>     For recognition latency (the next requirement), maybe something like
>     this would be appropriate: "Implementations should be allowed to start
>     processing captured audio before the capture completes." For TTS, I
>     don't think that such a requirement is needed, since the text to
>     synthesize is typically available immediately (as opposed to captured
>     audio which becomes available at a fixed rate).
>
>
>     Hmm... could do with better wording, and may be just stating the
>     obvious.
>
>     -----Original Message----- From: Satish Sampath
>
>     [mailto:satish@google.com <mailto:satish@google.com>
>     <mailto:satish@google.com <mailto:satish@google.com>>] Sent:
>
>
>     Thursday, November 04, 2010 3:08 PM
>     To: Robert Brown Cc: Bjorn Bringert; Dan Burnett;
>
>     public-xg-htmlspeech@w3.org <mailto:public-xg-htmlspeech@w3.org>
>     <mailto:public-xg-htmlspeech@w3.org
>     <mailto:public-xg-htmlspeech@w3.org>>
>
>
>     Subject: Re: R18. User perceived latency
>     of synthesis must be minimized
>
>     This seems more of a requirement on the speech service which
>     synthesizes the audio, than the UA, since usually the complexity
>     lies
>     in the synthesizer. I equate this to a requirement like 'web pages
>     must load as fast as possible' which in reality turns into 'web
>     servers should process received requests as fast as they can'
>     and the
>     latter is really up to the implementation based on a lot of factors
>     which are not in the control of the UA.
>
>     If we agree that to be the case, I think it is out of scope.
>
>     Cheers Satish
>
>
>
>     On Thu, Nov 4, 2010 at 10:22 PM, Robert
>     Brown<Robert.Brown@microsoft.com <mailto:Robert.Brown@microsoft.com>
>
>     <mailto:Robert.Brown@microsoft.com
>     <mailto:Robert.Brown@microsoft.com>>> wrote:
>
>
>
>     It may just be a requirement that's really obvious.
>
>
>
>     From: public-xg-htmlspeech-request@w3.org
>     <mailto:public-xg-htmlspeech-request@w3.org>
>     <mailto:public-xg-htmlspeech-request@w3.org
>     <mailto:public-xg-htmlspeech-request@w3.org>>
>     [mailto:public-xg-htmlspeech-request@w3.org
>     <mailto:public-xg-htmlspeech-request@w3.org>
>     <mailto:public-xg-htmlspeech-request@w3.org
>     <mailto:public-xg-htmlspeech-request@w3.org>>] On Behalf Of Bjorn
>     Bringert Sent: Thursday, November 04, 2010 1:27 PM To: Dan
>     Burnett
>     Cc: public-xg-htmlspeech@w3.org <mailto:public-xg-htmlspeech@w3.org>
>
>     <mailto:public-xg-htmlspeech@w3.org
>     <mailto:public-xg-htmlspeech@w3.org>> Subject: Re: R18. User
>
>
>     perceived
>     latency of synthesis must be minimized
>
>
>
>     I don't see a need for this to be a requirement. It's up to
>     implementations to be fast, and it's unrealistic to set any
>     specific latency limits.
>
>
>
>     On Thu, Nov 4, 2010 at 9:23 PM, Dan
>
>     Burnett<dburnett@voxeo.com <mailto:dburnett@voxeo.com>
>     <mailto:dburnett@voxeo.com <mailto:dburnett@voxeo.com>>>
>
>
>     wrote:
>
>     Group,
>
>     This is the next of the requirements to discuss and prioritize
>     based on our ranking approach [1].
>
>     This email is the beginning of a thread for questions,
>     discussion,
>     and opinions regarding our first draft of Requirement 18 [2].
>
>     Please discuss via email as we agreed at the Lyon f2f meeting.
>     Outstanding points of contention will be discussed live at
>     the next
>     teleconference.
>
>     -- dan
>
>     [1]
>     http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Oct/0024.
>
>
>     html [2]
>
>     http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Oct/att-0
>
>
>     001/speech.html#r18
>
>
>
>     -- Bjorn Bringert Google UK Limited, Registered Office: Belgrave
>     House, 76 Buckingham Palace Road, London, SW1W 9TQ Registered in
>     England Number: 3977902
>
>
>
>
>
>
>
>
>     --
>     Bjorn Bringert
>     Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
>     Palace Road, London, SW1W 9TQ
>     Registered in England Number: 3977902
>
>
> --
> Dr. Marc Schröder, Senior Researcher at DFKI GmbH
> Coordinator EU FP7 Project SEMAINE http://www.semaine-project.eu
> Project leader for DFKI in SSPNet http://sspnet.eu
> Project leader PAVOQUE http://mary.dfki.de/pavoque
> Associate Editor IEEE Trans. Affective Computing http://computer.org/tac
> Editor W3C EmotionML Working Draft http://www.w3.org/TR/emotionml/
> Portal Editor http://emotion-research.net
> Team Leader DFKI TTS Group http://mary.dfki.de
>
> Homepage: http://www.dfki.de/~schroed
> Email: marc.schroeder@dfki.de <mailto:marc.schroeder@dfki.de>
> Phone: +49-681-85775-5303
> Postal address: DFKI GmbH, Campus D3_2, Stuhlsatzenhausweg 3, D-66123
> Saarbrücken, Germany
> --
> Official DFKI coordinates:
> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> Trippstadter Strasse 122, D-67663 Kaiserslautern, Germany
> Geschaeftsfuehrung:
> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> Dr. Walter Olthoff
> Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
> Amtsgericht Kaiserslautern, HRB 2313
>
>
>
>
> --
> Bjorn Bringert
> Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
> Palace Road, London, SW1W 9TQ
> Registered in England Number: 3977902
>
Received on Monday, 8 November 2010 21:12:09 UTC