- From: Bjorn Bringert <bringert@google.com>
- Date: Mon, 8 Nov 2010 22:17:11 +0000
- To: Satish Sampath <satish@google.com>
- Cc: "Young, Milan" <Milan.Young@nuance.com>, Olli@pettay.fi, Marc Schroeder <marc.schroeder@dfki.de>, Robert Brown <Robert.Brown@microsoft.com>, Dan Burnett <dburnett@voxeo.com>, public-xg-htmlspeech@w3.org
It's probably hard to to specify exactly what UAs must do with respect to recording and playback timing etc. At the requirements stage, I think it's most important to specify what any specs must allow. This for example includes that any proposals must be streaming-friendly. As for the fidelity in passing parameters to speech services and returning results, the first priority for the UA should be to ensure that the UA <-> web app communication follows the HTML Speech API, e.g. with respect to the ordering of events, whatever the spec for that ends up being. Then there should be a standard API for extensions, including for example extra parameters to pass to speech services and extra events and metadata to get back. That is, there should be a common interface that web apps get, regardless of the speech service implementation used, and there may also be implementation-specific extensions. Without this, we are just adding a very thin layer on top of a specific speech service implementation, which results in vendor lock-in. If a very direct interface to a specific speech service implementation is required, it would probably be better to use a generic audio capture / playback API instead of an HTML Speech API. Out of interest, at what level is this sort of thing specified in VoiceXML? /Bjorn On Mon, Nov 8, 2010 at 10:00 PM, Satish Sampath <satish@google.com> wrote: > Aren't servers already capable of enforcing such requirements? For > e.g. the server could choose to only honour a chunked transfer request > instead of a bulk http post, thereby only allowing UAs that support > streaming to work with it. Realistically though the servers would want > to service as many users as they can in a wide variety of network > conditions.. for e.g. many users on mobile networks may not be able to > stream out audio due to proxies. I think the spec should be agnostic > to the network conditions and not enforce any such requirements. > > Cheers > Satish > > > > On Mon, Nov 8, 2010 at 9:11 PM, Young, Milan <Milan.Young@nuance.com> wrote: >> Hello Olli, >> >> The current set of requirements certainly do make provision for techniques such as streaming. But there is nothing, for example, that *requires* a UA to use streaming if that is what is desired by the speech server. >> >> The thrust of my new requirements is to afford the speech server authority over the low-level details of the speech dialog. This includes everything from chunking, endpointing, codecs, parameter/result passing, event timings/mappings, etc. >> >> If we can figure out how to roll this concept into the existing requirements, that's fine with me. >> >> Thank you >> >> >> -----Original Message----- >> From: Olli Pettay [mailto:Olli.Pettay@helsinki.fi] >> Sent: Monday, November 08, 2010 12:07 PM >> To: Young, Milan >> Cc: Bjorn Bringert; Marc Schroeder; Olli@pettay.fi; Robert Brown; Satish Sampath; Dan Burnett; public-xg-htmlspeech@w3.org >> Subject: Re: R18. User perceived latency of synthesis must be minimized >> >> On 11/08/2010 07:07 PM, Young, Milan wrote: >>> I'm glad to see consensus on R18, but I'm suspect that we have not >>> fundamentally addressed Marc's concern regarding communication with >>> speech servers. >>> >>> Up to this point, most requirements have been focused on the application >>> author's interface with the UA. Any requirements around the UA >>> integration with the speech server seem inferred at best. While the >>> details of such protocol integrations are probably outside the domain of >>> the W3C, I believe it is still incumbent within our recommendation that >>> the general framework be addressed. >>> >>> I'd like to suggest a new set of requirements: >>> >>> ·Speech resources must be allowed to write to the audio buffer for the >>> complete duration of the real-time audio rendering. >>> >>> ·Speech resources must be allowed to read from the audio buffer for the >>> complete duration of the real-time audio capture. >> I don't quite understand these two. >> The first one, I think, talks about the "TTS" buffer and the latter >> one about "ASR" buffer, right? >> If so the wording could be clearer, and I'm not sure what benefit we >> get from such requirements comparing to R18 and R17. >> >> >>> >>> ·User agents must allow passing parameters (structured or otherwise) >>> from the application to the speech resource. >> >> R22 takes care of this, though I'm not now sure whether that got merged >> already with some other requirement. >> And R22 is not quite clear nor right. By default we should assume that >> local speech engines work in a compatible way and no recognizer >> specific parameters are needed (we don't want browser specific >> webapps). Network services are quite different thing, there recognizer >> specific parameters make more sense. >> >>> >>> ·User agents must not filter results from the speech resource to the >>> application. (Perhaps this is already handled by R6?) >> I'd leave this to UA. Right now I can't see any reason why UA would >> filter anything, but I also don't understand why that should be >> prohibited. >> Actually, some UA for kids for example might want to filter out >> some results. >> But anyway, I'd leave the decision of filtering to the UA and >> wouldn't add any requirement for it. >> >> >>> >>> ·User agents must not interfere with the timing of speech events from >>> the speech resource to the application. >> Why would we need this requirement? R17+R18 should be enough, IMO. >> >> >> -Olli >> >> >> >> >>> >>> Thoughts? >>> >>> ------------------------------------------------------------------------ >>> >>> *From:*public-xg-htmlspeech-request@w3.org >>> [mailto:public-xg-htmlspeech-request@w3.org] *On Behalf Of *Bjorn Bringert >>> *Sent:* Friday, November 05, 2010 7:00 AM >>> *To:* Marc Schroeder >>> *Cc:* Olli@pettay.fi; Robert Brown; Satish Sampath; Dan Burnett; >>> public-xg-htmlspeech@w3.org >>> *Subject:* Re: R18. User perceived latency of synthesis must be minimized >>> >>> Ah, right, this is a symmetrical situation to the possible replaced >>> speech recognition requirement that I proposed. As you say, for TTS it >>> is the output that must be allowed to be streamed. How about rewording >>> R18 to make this more clear: >>> >>> New R18: "Implementations should be allowed to start playing back >>> synthesized speech before the complete results of the speech synthesis >>> request are available." >>> >>> The intention is to rule out proposals that make streaming >>> implementations impossible, but not to require implementations to >>> implement streaming (since that might not make sense for all >>> implementations). >>> >>> /Bjorn >>> >>> On Fri, Nov 5, 2010 at 2:29 PM, Marc Schroeder <marc.schroeder@dfki.de >>> <mailto:marc.schroeder@dfki.de>> wrote: >>> >>> Dear all, >>> >>> interesting question whether this is in scope or not. I think I brought >>> up the original discussion point, so here is what I had in mind. >>> >>> Let's assume for the moment that Bjorn's proposal of a <tts> element (or >>> some other mechanism for triggering TTS output) was accepted, and that >>> furthermore we agreed on the "cloud" part of R20 (letting the web >>> application author select a TTS engine on some server). >>> >>> Then it seems inevitable that the UA and the TTS server need to >>> communicate with one another using some protocol: the UA would have to >>> send the TTS server the request, and the TTS server would have to send >>> the synthesised result back to the UA. >>> >>> In my mind, the requirement concerns the part of the protocol where the >>> TTS server sends the audio to the UA. If that is done in "the right >>> way", latency could be minimised, compared to the worst case where the >>> entire request would have to be sent in one big audio chunk before >>> playback could start. >>> >>> >>> I am aware that this raises a bigger issue, about communicating with >>> speech servers. While this is maybe not a direct part of the >>> UA/DOM/JavaScript API (as illustrated by Google's proposals), to me it >>> is an integral part of what this group needs to address. >>> >>> The obvious shortcut would be to have TTS/ASR done only in the UA, and I >>> hope we can agree that this is not an acceptable perspective... >>> >>> Kind regards, >>> Marc >>> >>> >>> >>> >>> On 05.11.10 11:17, Bjorn Bringert wrote: >>> >>> On Fri, Nov 5, 2010 at 10:57 AM, Olli Pettay >>> <Olli.Pettay@helsinki.fi <mailto:Olli.Pettay@helsinki.fi> >>> >>> <mailto:Olli.Pettay@helsinki.fi <mailto:Olli.Pettay@helsinki.fi>>> >>> wrote: >>> >>> On 11/05/2010 08:42 AM, Robert Brown wrote: >>> >>> Agreed that the server case is out of scope. I wonder if there's >>> anything that could be said about the client. Perhaps it could be >>> rewritten as "user agents should provide/playback rendered TTS audio >>> to the app immediately as it's received from the TTS service". >>> >>> This might be actually quite "wrong" wording if we're going to >>> extend HTML5's media elements to provide TTS. >>> The application may want to cache the result from TTS engine before >>> playing it out. >>> >>> >>> Yes, such wording would prohibit the HTMLMediaElement autobuffer >>> attribute from working, which ironically would increase latency. >>> >>> I think that the only point of a latency requirement would be to make >>> sure that spec proposals don't prohibit low-latency processing. For >>> example, a spec that requires that audio capture must finish without the >>> user aborting it before any audio transmission or speech processing is >>> allowed to take place would force high latency in implementations. >>> >>> For recognition latency (the next requirement), maybe something like >>> this would be appropriate: "Implementations should be allowed to start >>> processing captured audio before the capture completes." For TTS, I >>> don't think that such a requirement is needed, since the text to >>> synthesize is typically available immediately (as opposed to captured >>> audio which becomes available at a fixed rate). >>> >>> >>> Hmm... could do with better wording, and may be just stating the >>> obvious. >>> >>> -----Original Message----- From: Satish Sampath >>> >>> [mailto:satish@google.com <mailto:satish@google.com> >>> <mailto:satish@google.com <mailto:satish@google.com>>] Sent: >>> >>> >>> Thursday, November 04, 2010 3:08 PM >>> To: Robert Brown Cc: Bjorn Bringert; Dan Burnett; >>> >>> public-xg-htmlspeech@w3.org <mailto:public-xg-htmlspeech@w3.org> >>> <mailto:public-xg-htmlspeech@w3.org >>> <mailto:public-xg-htmlspeech@w3.org>> >>> >>> >>> Subject: Re: R18. User perceived latency >>> of synthesis must be minimized >>> >>> This seems more of a requirement on the speech service which >>> synthesizes the audio, than the UA, since usually the complexity >>> lies >>> in the synthesizer. I equate this to a requirement like 'web pages >>> must load as fast as possible' which in reality turns into 'web >>> servers should process received requests as fast as they can' >>> and the >>> latter is really up to the implementation based on a lot of factors >>> which are not in the control of the UA. >>> >>> If we agree that to be the case, I think it is out of scope. >>> >>> Cheers Satish >>> >>> >>> >>> On Thu, Nov 4, 2010 at 10:22 PM, Robert >>> Brown<Robert.Brown@microsoft.com <mailto:Robert.Brown@microsoft.com> >>> >>> <mailto:Robert.Brown@microsoft.com >>> <mailto:Robert.Brown@microsoft.com>>> wrote: >>> >>> >>> >>> It may just be a requirement that's really obvious. >>> >>> >>> >>> From: public-xg-htmlspeech-request@w3.org >>> <mailto:public-xg-htmlspeech-request@w3.org> >>> <mailto:public-xg-htmlspeech-request@w3.org >>> <mailto:public-xg-htmlspeech-request@w3.org>> >>> [mailto:public-xg-htmlspeech-request@w3.org >>> <mailto:public-xg-htmlspeech-request@w3.org> >>> <mailto:public-xg-htmlspeech-request@w3.org >>> <mailto:public-xg-htmlspeech-request@w3.org>>] On Behalf Of Bjorn >>> Bringert Sent: Thursday, November 04, 2010 1:27 PM To: Dan >>> Burnett >>> Cc: public-xg-htmlspeech@w3.org <mailto:public-xg-htmlspeech@w3.org> >>> >>> <mailto:public-xg-htmlspeech@w3.org >>> <mailto:public-xg-htmlspeech@w3.org>> Subject: Re: R18. User >>> >>> >>> perceived >>> latency of synthesis must be minimized >>> >>> >>> >>> I don't see a need for this to be a requirement. It's up to >>> implementations to be fast, and it's unrealistic to set any >>> specific latency limits. >>> >>> >>> >>> On Thu, Nov 4, 2010 at 9:23 PM, Dan >>> >>> Burnett<dburnett@voxeo.com <mailto:dburnett@voxeo.com> >>> <mailto:dburnett@voxeo.com <mailto:dburnett@voxeo.com>>> >>> >>> >>> wrote: >>> >>> Group, >>> >>> This is the next of the requirements to discuss and prioritize >>> based on our ranking approach [1]. >>> >>> This email is the beginning of a thread for questions, >>> discussion, >>> and opinions regarding our first draft of Requirement 18 [2]. >>> >>> Please discuss via email as we agreed at the Lyon f2f meeting. >>> Outstanding points of contention will be discussed live at >>> the next >>> teleconference. >>> >>> -- dan >>> >>> [1] >>> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Oct/0024. >>> >>> >>> html [2] >>> >>> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Oct/att-0 >>> >>> >>> 001/speech.html#r18 >>> >>> >>> >>> -- Bjorn Bringert Google UK Limited, Registered Office: Belgrave >>> House, 76 Buckingham Palace Road, London, SW1W 9TQ Registered in >>> England Number: 3977902 >>> >>> >>> >>> >>> >>> >>> >>> >>> -- >>> Bjorn Bringert >>> Google UK Limited, Registered Office: Belgrave House, 76 Buckingham >>> Palace Road, London, SW1W 9TQ >>> Registered in England Number: 3977902 >>> >>> >>> -- >>> Dr. Marc Schröder, Senior Researcher at DFKI GmbH >>> Coordinator EU FP7 Project SEMAINE http://www.semaine-project.eu >>> Project leader for DFKI in SSPNet http://sspnet.eu >>> Project leader PAVOQUE http://mary.dfki.de/pavoque >>> Associate Editor IEEE Trans. Affective Computing http://computer.org/tac >>> Editor W3C EmotionML Working Draft http://www.w3.org/TR/emotionml/ >>> Portal Editor http://emotion-research.net >>> Team Leader DFKI TTS Group http://mary.dfki.de >>> >>> Homepage: http://www.dfki.de/~schroed >>> Email: marc.schroeder@dfki.de <mailto:marc.schroeder@dfki.de> >>> Phone: +49-681-85775-5303 >>> Postal address: DFKI GmbH, Campus D3_2, Stuhlsatzenhausweg 3, D-66123 >>> Saarbrücken, Germany >>> -- >>> Official DFKI coordinates: >>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH >>> Trippstadter Strasse 122, D-67663 Kaiserslautern, Germany >>> Geschaeftsfuehrung: >>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender) >>> Dr. Walter Olthoff >>> Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes >>> Amtsgericht Kaiserslautern, HRB 2313 >>> >>> >>> >>> >>> -- >>> Bjorn Bringert >>> Google UK Limited, Registered Office: Belgrave House, 76 Buckingham >>> Palace Road, London, SW1W 9TQ >>> Registered in England Number: 3977902 >>> >> >> > -- Bjorn Bringert Google UK Limited, Registered Office: Belgrave House, 76 Buckingham Palace Road, London, SW1W 9TQ Registered in England Number: 3977902
Received on Monday, 8 November 2010 22:17:44 UTC