- From: Satish Sampath <satish@google.com>
- Date: Mon, 8 Nov 2010 22:00:22 +0000
- To: "Young, Milan" <Milan.Young@nuance.com>
- Cc: Olli@pettay.fi, Bjorn Bringert <bringert@google.com>, Marc Schroeder <marc.schroeder@dfki.de>, Robert Brown <Robert.Brown@microsoft.com>, Dan Burnett <dburnett@voxeo.com>, public-xg-htmlspeech@w3.org
Aren't servers already capable of enforcing such requirements? For e.g. the server could choose to only honour a chunked transfer request instead of a bulk http post, thereby only allowing UAs that support streaming to work with it. Realistically though the servers would want to service as many users as they can in a wide variety of network conditions.. for e.g. many users on mobile networks may not be able to stream out audio due to proxies. I think the spec should be agnostic to the network conditions and not enforce any such requirements. Cheers Satish On Mon, Nov 8, 2010 at 9:11 PM, Young, Milan <Milan.Young@nuance.com> wrote: > Hello Olli, > > The current set of requirements certainly do make provision for techniques such as streaming. But there is nothing, for example, that *requires* a UA to use streaming if that is what is desired by the speech server. > > The thrust of my new requirements is to afford the speech server authority over the low-level details of the speech dialog. This includes everything from chunking, endpointing, codecs, parameter/result passing, event timings/mappings, etc. > > If we can figure out how to roll this concept into the existing requirements, that's fine with me. > > Thank you > > > -----Original Message----- > From: Olli Pettay [mailto:Olli.Pettay@helsinki.fi] > Sent: Monday, November 08, 2010 12:07 PM > To: Young, Milan > Cc: Bjorn Bringert; Marc Schroeder; Olli@pettay.fi; Robert Brown; Satish Sampath; Dan Burnett; public-xg-htmlspeech@w3.org > Subject: Re: R18. User perceived latency of synthesis must be minimized > > On 11/08/2010 07:07 PM, Young, Milan wrote: >> I'm glad to see consensus on R18, but I'm suspect that we have not >> fundamentally addressed Marc's concern regarding communication with >> speech servers. >> >> Up to this point, most requirements have been focused on the application >> author's interface with the UA. Any requirements around the UA >> integration with the speech server seem inferred at best. While the >> details of such protocol integrations are probably outside the domain of >> the W3C, I believe it is still incumbent within our recommendation that >> the general framework be addressed. >> >> I'd like to suggest a new set of requirements: >> >> ·Speech resources must be allowed to write to the audio buffer for the >> complete duration of the real-time audio rendering. >> >> ·Speech resources must be allowed to read from the audio buffer for the >> complete duration of the real-time audio capture. > I don't quite understand these two. > The first one, I think, talks about the "TTS" buffer and the latter > one about "ASR" buffer, right? > If so the wording could be clearer, and I'm not sure what benefit we > get from such requirements comparing to R18 and R17. > > >> >> ·User agents must allow passing parameters (structured or otherwise) >> from the application to the speech resource. > > R22 takes care of this, though I'm not now sure whether that got merged > already with some other requirement. > And R22 is not quite clear nor right. By default we should assume that > local speech engines work in a compatible way and no recognizer > specific parameters are needed (we don't want browser specific > webapps). Network services are quite different thing, there recognizer > specific parameters make more sense. > >> >> ·User agents must not filter results from the speech resource to the >> application. (Perhaps this is already handled by R6?) > I'd leave this to UA. Right now I can't see any reason why UA would > filter anything, but I also don't understand why that should be > prohibited. > Actually, some UA for kids for example might want to filter out > some results. > But anyway, I'd leave the decision of filtering to the UA and > wouldn't add any requirement for it. > > >> >> ·User agents must not interfere with the timing of speech events from >> the speech resource to the application. > Why would we need this requirement? R17+R18 should be enough, IMO. > > > -Olli > > > > >> >> Thoughts? >> >> ------------------------------------------------------------------------ >> >> *From:*public-xg-htmlspeech-request@w3.org >> [mailto:public-xg-htmlspeech-request@w3.org] *On Behalf Of *Bjorn Bringert >> *Sent:* Friday, November 05, 2010 7:00 AM >> *To:* Marc Schroeder >> *Cc:* Olli@pettay.fi; Robert Brown; Satish Sampath; Dan Burnett; >> public-xg-htmlspeech@w3.org >> *Subject:* Re: R18. User perceived latency of synthesis must be minimized >> >> Ah, right, this is a symmetrical situation to the possible replaced >> speech recognition requirement that I proposed. As you say, for TTS it >> is the output that must be allowed to be streamed. How about rewording >> R18 to make this more clear: >> >> New R18: "Implementations should be allowed to start playing back >> synthesized speech before the complete results of the speech synthesis >> request are available." >> >> The intention is to rule out proposals that make streaming >> implementations impossible, but not to require implementations to >> implement streaming (since that might not make sense for all >> implementations). >> >> /Bjorn >> >> On Fri, Nov 5, 2010 at 2:29 PM, Marc Schroeder <marc.schroeder@dfki.de >> <mailto:marc.schroeder@dfki.de>> wrote: >> >> Dear all, >> >> interesting question whether this is in scope or not. I think I brought >> up the original discussion point, so here is what I had in mind. >> >> Let's assume for the moment that Bjorn's proposal of a <tts> element (or >> some other mechanism for triggering TTS output) was accepted, and that >> furthermore we agreed on the "cloud" part of R20 (letting the web >> application author select a TTS engine on some server). >> >> Then it seems inevitable that the UA and the TTS server need to >> communicate with one another using some protocol: the UA would have to >> send the TTS server the request, and the TTS server would have to send >> the synthesised result back to the UA. >> >> In my mind, the requirement concerns the part of the protocol where the >> TTS server sends the audio to the UA. If that is done in "the right >> way", latency could be minimised, compared to the worst case where the >> entire request would have to be sent in one big audio chunk before >> playback could start. >> >> >> I am aware that this raises a bigger issue, about communicating with >> speech servers. While this is maybe not a direct part of the >> UA/DOM/JavaScript API (as illustrated by Google's proposals), to me it >> is an integral part of what this group needs to address. >> >> The obvious shortcut would be to have TTS/ASR done only in the UA, and I >> hope we can agree that this is not an acceptable perspective... >> >> Kind regards, >> Marc >> >> >> >> >> On 05.11.10 11:17, Bjorn Bringert wrote: >> >> On Fri, Nov 5, 2010 at 10:57 AM, Olli Pettay >> <Olli.Pettay@helsinki.fi <mailto:Olli.Pettay@helsinki.fi> >> >> <mailto:Olli.Pettay@helsinki.fi <mailto:Olli.Pettay@helsinki.fi>>> >> wrote: >> >> On 11/05/2010 08:42 AM, Robert Brown wrote: >> >> Agreed that the server case is out of scope. I wonder if there's >> anything that could be said about the client. Perhaps it could be >> rewritten as "user agents should provide/playback rendered TTS audio >> to the app immediately as it's received from the TTS service". >> >> This might be actually quite "wrong" wording if we're going to >> extend HTML5's media elements to provide TTS. >> The application may want to cache the result from TTS engine before >> playing it out. >> >> >> Yes, such wording would prohibit the HTMLMediaElement autobuffer >> attribute from working, which ironically would increase latency. >> >> I think that the only point of a latency requirement would be to make >> sure that spec proposals don't prohibit low-latency processing. For >> example, a spec that requires that audio capture must finish without the >> user aborting it before any audio transmission or speech processing is >> allowed to take place would force high latency in implementations. >> >> For recognition latency (the next requirement), maybe something like >> this would be appropriate: "Implementations should be allowed to start >> processing captured audio before the capture completes." For TTS, I >> don't think that such a requirement is needed, since the text to >> synthesize is typically available immediately (as opposed to captured >> audio which becomes available at a fixed rate). >> >> >> Hmm... could do with better wording, and may be just stating the >> obvious. >> >> -----Original Message----- From: Satish Sampath >> >> [mailto:satish@google.com <mailto:satish@google.com> >> <mailto:satish@google.com <mailto:satish@google.com>>] Sent: >> >> >> Thursday, November 04, 2010 3:08 PM >> To: Robert Brown Cc: Bjorn Bringert; Dan Burnett; >> >> public-xg-htmlspeech@w3.org <mailto:public-xg-htmlspeech@w3.org> >> <mailto:public-xg-htmlspeech@w3.org >> <mailto:public-xg-htmlspeech@w3.org>> >> >> >> Subject: Re: R18. User perceived latency >> of synthesis must be minimized >> >> This seems more of a requirement on the speech service which >> synthesizes the audio, than the UA, since usually the complexity >> lies >> in the synthesizer. I equate this to a requirement like 'web pages >> must load as fast as possible' which in reality turns into 'web >> servers should process received requests as fast as they can' >> and the >> latter is really up to the implementation based on a lot of factors >> which are not in the control of the UA. >> >> If we agree that to be the case, I think it is out of scope. >> >> Cheers Satish >> >> >> >> On Thu, Nov 4, 2010 at 10:22 PM, Robert >> Brown<Robert.Brown@microsoft.com <mailto:Robert.Brown@microsoft.com> >> >> <mailto:Robert.Brown@microsoft.com >> <mailto:Robert.Brown@microsoft.com>>> wrote: >> >> >> >> It may just be a requirement that's really obvious. >> >> >> >> From: public-xg-htmlspeech-request@w3.org >> <mailto:public-xg-htmlspeech-request@w3.org> >> <mailto:public-xg-htmlspeech-request@w3.org >> <mailto:public-xg-htmlspeech-request@w3.org>> >> [mailto:public-xg-htmlspeech-request@w3.org >> <mailto:public-xg-htmlspeech-request@w3.org> >> <mailto:public-xg-htmlspeech-request@w3.org >> <mailto:public-xg-htmlspeech-request@w3.org>>] On Behalf Of Bjorn >> Bringert Sent: Thursday, November 04, 2010 1:27 PM To: Dan >> Burnett >> Cc: public-xg-htmlspeech@w3.org <mailto:public-xg-htmlspeech@w3.org> >> >> <mailto:public-xg-htmlspeech@w3.org >> <mailto:public-xg-htmlspeech@w3.org>> Subject: Re: R18. User >> >> >> perceived >> latency of synthesis must be minimized >> >> >> >> I don't see a need for this to be a requirement. It's up to >> implementations to be fast, and it's unrealistic to set any >> specific latency limits. >> >> >> >> On Thu, Nov 4, 2010 at 9:23 PM, Dan >> >> Burnett<dburnett@voxeo.com <mailto:dburnett@voxeo.com> >> <mailto:dburnett@voxeo.com <mailto:dburnett@voxeo.com>>> >> >> >> wrote: >> >> Group, >> >> This is the next of the requirements to discuss and prioritize >> based on our ranking approach [1]. >> >> This email is the beginning of a thread for questions, >> discussion, >> and opinions regarding our first draft of Requirement 18 [2]. >> >> Please discuss via email as we agreed at the Lyon f2f meeting. >> Outstanding points of contention will be discussed live at >> the next >> teleconference. >> >> -- dan >> >> [1] >> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Oct/0024. >> >> >> html [2] >> >> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Oct/att-0 >> >> >> 001/speech.html#r18 >> >> >> >> -- Bjorn Bringert Google UK Limited, Registered Office: Belgrave >> House, 76 Buckingham Palace Road, London, SW1W 9TQ Registered in >> England Number: 3977902 >> >> >> >> >> >> >> >> >> -- >> Bjorn Bringert >> Google UK Limited, Registered Office: Belgrave House, 76 Buckingham >> Palace Road, London, SW1W 9TQ >> Registered in England Number: 3977902 >> >> >> -- >> Dr. Marc Schröder, Senior Researcher at DFKI GmbH >> Coordinator EU FP7 Project SEMAINE http://www.semaine-project.eu >> Project leader for DFKI in SSPNet http://sspnet.eu >> Project leader PAVOQUE http://mary.dfki.de/pavoque >> Associate Editor IEEE Trans. Affective Computing http://computer.org/tac >> Editor W3C EmotionML Working Draft http://www.w3.org/TR/emotionml/ >> Portal Editor http://emotion-research.net >> Team Leader DFKI TTS Group http://mary.dfki.de >> >> Homepage: http://www.dfki.de/~schroed >> Email: marc.schroeder@dfki.de <mailto:marc.schroeder@dfki.de> >> Phone: +49-681-85775-5303 >> Postal address: DFKI GmbH, Campus D3_2, Stuhlsatzenhausweg 3, D-66123 >> Saarbrücken, Germany >> -- >> Official DFKI coordinates: >> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH >> Trippstadter Strasse 122, D-67663 Kaiserslautern, Germany >> Geschaeftsfuehrung: >> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender) >> Dr. Walter Olthoff >> Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes >> Amtsgericht Kaiserslautern, HRB 2313 >> >> >> >> >> -- >> Bjorn Bringert >> Google UK Limited, Registered Office: Belgrave House, 76 Buckingham >> Palace Road, London, SW1W 9TQ >> Registered in England Number: 3977902 >> > >
Received on Monday, 8 November 2010 22:00:54 UTC