RE: R18. User perceived latency of synthesis must be minimized

I'm glad to see consensus on R18, but I'm suspect that we have not fundamentally addressed Marc's concern regarding communication with speech servers.

 

Up to this point, most requirements have been focused on the application author's interface with the UA.  Any requirements around the UA integration with the speech server seem inferred at best.  While the details of such protocol integrations are probably outside the domain of the W3C, I believe it is still incumbent within our recommendation that the general framework be addressed.

 

I'd like to suggest a new set of requirements:

*         Speech resources must be allowed to write to the audio buffer for the complete duration of the real-time audio rendering.

*         Speech resources must be allowed to read from the audio buffer for the complete duration of the real-time audio capture.

*         User agents must allow passing parameters (structured or otherwise) from the application to the speech resource.

*         User agents must not filter results from the speech resource to the application.  (Perhaps this is already handled by R6?)

*         User agents must not interfere with the timing of speech events from the speech resource to the application.

 

Thoughts?

 

 

________________________________

From: public-xg-htmlspeech-request@w3.org [mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of Bjorn Bringert
Sent: Friday, November 05, 2010 7:00 AM
To: Marc Schroeder
Cc: Olli@pettay.fi; Robert Brown; Satish Sampath; Dan Burnett; public-xg-htmlspeech@w3.org
Subject: Re: R18. User perceived latency of synthesis must be minimized

 

Ah, right, this is a symmetrical situation to the possible replaced speech recognition requirement that I proposed. As you say, for TTS it is the output that must be allowed to be streamed. How about rewording R18 to make this more clear: 

 

New R18: "Implementations should be allowed to start playing back synthesized speech before the complete results of the speech synthesis request are available."

 

The intention is to rule out proposals that make streaming implementations impossible, but not to require implementations to implement streaming (since that might not make sense for all implementations).

 

/Bjorn

On Fri, Nov 5, 2010 at 2:29 PM, Marc Schroeder <marc.schroeder@dfki.de> wrote:

Dear all,

interesting question whether this is in scope or not. I think I brought up the original discussion point, so here is what I had in mind.

Let's assume for the moment that Bjorn's proposal of a <tts> element (or some other mechanism for triggering TTS output) was accepted, and that furthermore we agreed on the "cloud" part of R20 (letting the web application author select a TTS engine on some server).

Then it seems inevitable that the UA and the TTS server need to communicate with one another using some protocol: the UA would have to send the TTS server the request, and the TTS server would have to send the synthesised result back to the UA.

In my mind, the requirement concerns the part of the protocol where the TTS server sends the audio to the UA. If that is done in "the right way", latency could be minimised, compared to the worst case where the entire request would have to be sent in one big audio chunk before playback could start.


I am aware that this raises a bigger issue, about communicating with speech servers. While this is maybe not a direct part of the UA/DOM/JavaScript API (as illustrated by Google's proposals), to me it is an integral part of what this group needs to address.

The obvious shortcut would be to have TTS/ASR done only in the UA, and I hope we can agree that this is not an acceptable perspective...

Kind regards,
Marc




On 05.11.10 11:17, Bjorn Bringert wrote:

	On Fri, Nov 5, 2010 at 10:57 AM, Olli Pettay <Olli.Pettay@helsinki.fi

	<mailto:Olli.Pettay@helsinki.fi>> wrote:
	
	   On 11/05/2010 08:42 AM, Robert Brown wrote:
	
	       Agreed that the server case is out of scope.  I wonder if there's
	       anything that could be said about the client.  Perhaps it could be
	       rewritten as "user agents should provide/playback rendered TTS audio
	       to the app immediately as it's received from the TTS service".
	
	   This might be actually quite "wrong" wording if we're going to
	   extend HTML5's media elements to provide TTS.
	   The application may want to cache the result from TTS engine before
	   playing it out.
	
	
	Yes, such wording would prohibit the HTMLMediaElement autobuffer
	attribute from working, which ironically would increase latency.
	
	I think that the only point of a latency requirement would be to make
	sure that spec proposals don't prohibit low-latency processing. For
	example, a spec that requires that audio capture must finish without the
	user aborting it before any audio transmission or speech processing is
	allowed to take place would force high latency in implementations.
	
	For recognition latency (the next requirement), maybe something like
	this would be appropriate: "Implementations should be allowed to start
	processing captured audio before the capture completes." For TTS, I
	don't think that such a requirement is needed, since the text to
	synthesize is typically available immediately (as opposed to captured
	audio which becomes available at a fixed rate).
	
	
	       Hmm... could do with better wording, and may be just stating the
	       obvious.
	
	       -----Original Message----- From: Satish Sampath

	       [mailto:satish@google.com <mailto:satish@google.com>] Sent:

	
	       Thursday, November 04, 2010 3:08 PM
	       To: Robert Brown Cc: Bjorn Bringert; Dan Burnett;

	       public-xg-htmlspeech@w3.org <mailto:public-xg-htmlspeech@w3.org>

	
	       Subject: Re: R18. User perceived latency
	       of synthesis must be minimized
	
	       This seems more of a requirement on the speech service which
	       synthesizes the audio, than the UA, since usually the complexity
	       lies
	       in the synthesizer. I equate this to a requirement like 'web pages
	       must load as fast as possible' which in reality turns into 'web
	       servers should process received requests as fast as they can'
	       and the
	       latter is really up to the implementation based on a lot of factors
	       which are not in the control of the UA.
	
	       If we agree that to be the case, I think it is out of scope.
	
	       Cheers Satish
	
	
	
	       On Thu, Nov 4, 2010 at 10:22 PM, Robert
	       Brown<Robert.Brown@microsoft.com

	       <mailto:Robert.Brown@microsoft.com>>  wrote:

	
	
	           It may just be a requirement that's really obvious.
	
	
	
	           From: public-xg-htmlspeech-request@w3.org
	           <mailto:public-xg-htmlspeech-request@w3.org>
	           [mailto:public-xg-htmlspeech-request@w3.org
	           <mailto:public-xg-htmlspeech-request@w3.org>] On Behalf Of Bjorn
	           Bringert Sent: Thursday, November 04, 2010 1:27 PM To: Dan
	           Burnett
	           Cc: public-xg-htmlspeech@w3.org

	           <mailto:public-xg-htmlspeech@w3.org> Subject: Re: R18. User

	
	           perceived
	           latency of synthesis must be minimized
	
	
	
	           I don't see a need for this to be a requirement. It's up to
	           implementations to be fast, and it's unrealistic to set any
	           specific latency limits.
	
	
	
	           On Thu, Nov 4, 2010 at 9:23 PM, Dan

	           Burnett<dburnett@voxeo.com <mailto:dburnett@voxeo.com>>

	
	           wrote:
	
	           Group,
	
	           This is the next of the requirements to discuss and prioritize
	           based on our ranking approach [1].
	
	           This email is the beginning of a thread for questions,
	           discussion,
	           and opinions regarding our first draft of Requirement 18 [2].
	
	           Please discuss via email as we agreed at the Lyon f2f meeting.
	           Outstanding points of contention will be discussed live at
	           the next
	           teleconference.
	
	           -- dan
	
	           [1]
	           http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Oct/0024.
	
	
	   html [2]
	
	           http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Oct/att-0
	
	
	   001/speech.html#r18
	
	
	
	           -- Bjorn Bringert Google UK Limited, Registered Office: Belgrave
	           House, 76 Buckingham Palace Road, London, SW1W 9TQ Registered in
	           England Number: 3977902
	
	
	
	
	
	
	
	
	--
	Bjorn Bringert
	Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
	Palace Road, London, SW1W 9TQ
	Registered in England Number: 3977902


-- 
Dr. Marc Schröder, Senior Researcher at DFKI GmbH
Coordinator EU FP7 Project SEMAINE http://www.semaine-project.eu
Project leader for DFKI in SSPNet http://sspnet.eu
Project leader PAVOQUE http://mary.dfki.de/pavoque
Associate Editor IEEE Trans. Affective Computing http://computer.org/tac
Editor W3C EmotionML Working Draft http://www.w3.org/TR/emotionml/
Portal Editor http://emotion-research.net
Team Leader DFKI TTS Group http://mary.dfki.de

Homepage: http://www.dfki.de/~schroed
Email: marc.schroeder@dfki.de
Phone: +49-681-85775-5303
Postal address: DFKI GmbH, Campus D3_2, Stuhlsatzenhausweg 3, D-66123 Saarbrücken, Germany
--
Official DFKI coordinates:
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Trippstadter Strasse 122, D-67663 Kaiserslautern, Germany
Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313




-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902

Received on Monday, 8 November 2010 17:07:41 UTC