Re: R18. User perceived latency of synthesis must be minimized from Marc Schroeder on 2010-11-05 (public-xg-htmlspeech@w3.org from November 2010)

From: Marc Schroeder <marc.schroeder@dfki.de>
Date: Fri, 05 Nov 2010 14:29:48 +0100
To: Bjorn Bringert <bringert@google.com>
CC: Olli@pettay.fi, Robert Brown <Robert.Brown@microsoft.com>, Satish Sampath <satish@google.com>, Dan Burnett <dburnett@voxeo.com>, "public-xg-htmlspeech@w3.org" <public-xg-htmlspeech@w3.org>
Message-ID: <4CD406CC.1020603@dfki.de>
Dear all,

interesting question whether this is in scope or not. I think I brought 
up the original discussion point, so here is what I had in mind.

Let's assume for the moment that Bjorn's proposal of a <tts> element (or 
some other mechanism for triggering TTS output) was accepted, and that 
furthermore we agreed on the "cloud" part of R20 (letting the web 
application author select a TTS engine on some server).

Then it seems inevitable that the UA and the TTS server need to 
communicate with one another using some protocol: the UA would have to 
send the TTS server the request, and the TTS server would have to send 
the synthesised result back to the UA.

In my mind, the requirement concerns the part of the protocol where the 
TTS server sends the audio to the UA. If that is done in "the right 
way", latency could be minimised, compared to the worst case where the 
entire request would have to be sent in one big audio chunk before 
playback could start.


I am aware that this raises a bigger issue, about communicating with 
speech servers. While this is maybe not a direct part of the 
UA/DOM/JavaScript API (as illustrated by Google's proposals), to me it 
is an integral part of what this group needs to address.

The obvious shortcut would be to have TTS/ASR done only in the UA, and I 
hope we can agree that this is not an acceptable perspective...

Kind regards,
Marc


On 05.11.10 11:17, Bjorn Bringert wrote:
> On Fri, Nov 5, 2010 at 10:57 AM, Olli Pettay <Olli.Pettay@helsinki.fi
> <mailto:Olli.Pettay@helsinki.fi>> wrote:
>
>     On 11/05/2010 08:42 AM, Robert Brown wrote:
>
>         Agreed that the server case is out of scope.  I wonder if there's
>         anything that could be said about the client.  Perhaps it could be
>         rewritten as "user agents should provide/playback rendered TTS audio
>         to the app immediately as it's received from the TTS service".
>
>     This might be actually quite "wrong" wording if we're going to
>     extend HTML5's media elements to provide TTS.
>     The application may want to cache the result from TTS engine before
>     playing it out.
>
>
> Yes, such wording would prohibit the HTMLMediaElement autobuffer
> attribute from working, which ironically would increase latency.
>
> I think that the only point of a latency requirement would be to make
> sure that spec proposals don't prohibit low-latency processing. For
> example, a spec that requires that audio capture must finish without the
> user aborting it before any audio transmission or speech processing is
> allowed to take place would force high latency in implementations.
>
> For recognition latency (the next requirement), maybe something like
> this would be appropriate: "Implementations should be allowed to start
> processing captured audio before the capture completes." For TTS, I
> don't think that such a requirement is needed, since the text to
> synthesize is typically available immediately (as opposed to captured
> audio which becomes available at a fixed rate).
>
>
>         Hmm... could do with better wording, and may be just stating the
>         obvious.
>
>         -----Original Message----- From: Satish Sampath
>         [mailto:satish@google.com <mailto:satish@google.com>] Sent:
>         Thursday, November 04, 2010 3:08 PM
>         To: Robert Brown Cc: Bjorn Bringert; Dan Burnett;
>         public-xg-htmlspeech@w3.org <mailto:public-xg-htmlspeech@w3.org>
>         Subject: Re: R18. User perceived latency
>         of synthesis must be minimized
>
>         This seems more of a requirement on the speech service which
>         synthesizes the audio, than the UA, since usually the complexity
>         lies
>         in the synthesizer. I equate this to a requirement like 'web pages
>         must load as fast as possible' which in reality turns into 'web
>         servers should process received requests as fast as they can'
>         and the
>         latter is really up to the implementation based on a lot of factors
>         which are not in the control of the UA.
>
>         If we agree that to be the case, I think it is out of scope.
>
>         Cheers Satish
>
>
>
>         On Thu, Nov 4, 2010 at 10:22 PM, Robert
>         Brown<Robert.Brown@microsoft.com
>         <mailto:Robert.Brown@microsoft.com>>  wrote:
>
>             It may just be a requirement that's really obvious.
>
>
>
>             From: public-xg-htmlspeech-request@w3.org
>             <mailto:public-xg-htmlspeech-request@w3.org>
>             [mailto:public-xg-htmlspeech-request@w3.org
>             <mailto:public-xg-htmlspeech-request@w3.org>] On Behalf Of Bjorn
>             Bringert Sent: Thursday, November 04, 2010 1:27 PM To: Dan
>             Burnett
>             Cc: public-xg-htmlspeech@w3.org
>             <mailto:public-xg-htmlspeech@w3.org> Subject: Re: R18. User
>             perceived
>             latency of synthesis must be minimized
>
>
>
>             I don't see a need for this to be a requirement. It's up to
>             implementations to be fast, and it's unrealistic to set any
>             specific latency limits.
>
>
>
>             On Thu, Nov 4, 2010 at 9:23 PM, Dan
>             Burnett<dburnett@voxeo.com <mailto:dburnett@voxeo.com>>
>             wrote:
>
>             Group,
>
>             This is the next of the requirements to discuss and prioritize
>             based on our ranking approach [1].
>
>             This email is the beginning of a thread for questions,
>             discussion,
>             and opinions regarding our first draft of Requirement 18 [2].
>
>             Please discuss via email as we agreed at the Lyon f2f meeting.
>             Outstanding points of contention will be discussed live at
>             the next
>             teleconference.
>
>             -- dan
>
>             [1]
>             http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Oct/0024.
>
>
>     html [2]
>
>             http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Oct/att-0
>
>
>     001/speech.html#r18
>
>
>
>             -- Bjorn Bringert Google UK Limited, Registered Office: Belgrave
>             House, 76 Buckingham Palace Road, London, SW1W 9TQ Registered in
>             England Number: 3977902
>
>
>
>
>
>
>
>
> --
> Bjorn Bringert
> Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
> Palace Road, London, SW1W 9TQ
> Registered in England Number: 3977902
>

-- 
Dr. Marc Schröder, Senior Researcher at DFKI GmbH
Coordinator EU FP7 Project SEMAINE http://www.semaine-project.eu
Project leader for DFKI in SSPNet http://sspnet.eu
Project leader PAVOQUE http://mary.dfki.de/pavoque
Associate Editor IEEE Trans. Affective Computing http://computer.org/tac
Editor W3C EmotionML Working Draft http://www.w3.org/TR/emotionml/
Portal Editor http://emotion-research.net
Team Leader DFKI TTS Group http://mary.dfki.de

Homepage: http://www.dfki.de/~schroed
Email: marc.schroeder@dfki.de
Phone: +49-681-85775-5303
Postal address: DFKI GmbH, Campus D3_2, Stuhlsatzenhausweg 3, D-66123 
Saarbrücken, Germany
--
Official DFKI coordinates:
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Trippstadter Strasse 122, D-67663 Kaiserslautern, Germany
Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
Received on Friday, 5 November 2010 13:30:24 UTC