- From: Marc Schroeder <marc.schroeder@dfki.de>
- Date: Mon, 05 Sep 2011 10:12:02 +0200
- To: Robert Brown <Robert.Brown@microsoft.com>
- CC: "public-xg-htmlspeech@w3.org" <public-xg-htmlspeech@w3.org>
Hi Robert, I have given this some thought and I think that, while the TTS mechanics for the protocol will be almost the same, for educational purposes we might want to have three TTS examples that cover: (1) unplannable just-in-time TTS where the next utterance depends on the recent user action, e.g. a dialogue system or an in-car navigation system; (2) bulk requests of TTS for pre-determined elements of the web page (say, spoken help texts to be played on mouseover); (3) a multimodal presentation example, showcasing the requirement for marking points in time in the synthesized audio ("would you like to sit HERE at the window or rather HERE at the aisle?"). In each of these, in the protocol document we would only give the use case scenario and the web api integration in the form of a short explanatory story, but flesh out the protocol aspect of these in detail. Examples (1) and (2) could be either plain text or SSML, whereas (3) must be SSML with <ssml:mark> tags. Does that match what you had in mind? Story text for each of these could be as follows (see below). If you think these examples make sense, I would be most grateful if you could write the protocol part going with them... it is beyond me at the moment. Thanks and best, Marc (1) The most straightforward use case for TTS is the synthesis of one utterance at a time. This is inevitable for just-in-time rendition of speech, for example in dialogue systems or in in-car navigation scenarios. Here, the web application will send a single speech synthesis request to the speech service, and retrieve the resulting speech output as described (elsewhere). On the protocol level, the synthesis of a single utterance would look as follows. (a) plain-text example The utterance to be spoken can be sent as plain text. In this case, it is necessary to specify the language to use: (protocol level details here...) (b) Speech Synthesis Markup example For richer markup of the text, it is possible to use the SSML format for sending an annotated request. For example, it is possible to propose an appropriate pronunciation or to indicate where to insert pauses: (example adapted from http://www.w3.org/TR/speech-synthesis11/#edef_break): <?xml version="1.0"?> <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US"> Please make your choice. <break time="3s"/> Click any of the buttons to indicate your preference. </speak> (protocol level details here...) (2) Some use cases require relatively static speech output which can be known at the time of loading a web page. In these cases, all required speech output can be requested in parallel as multiple concurrent requests. Callback methods in the web api are responsible to relate each speech stream to the appropriate place in the web application. On the protocol level, the request of multiple speech streams concurrently is realized as follows. (for educational purposes, maybe use several languages and voices but with plain text, such as "Hola, me llamo Maria." (Spanish), "Hi, I'm George." (UK English), or "Hallo, ich heiße Peter." (German).) (protocol level details here...) (3) In order to synchronize the speech content with other events in the web application, it is possible to mark relevant points in time using the SSML <mark> tag. When the speech is played back, a callback method is called for these markers, allowing the web application to present, e.g., visual displays synchronously. (example adapted from http://www.w3.org/TR/speech-synthesis11/#S3.3.2): <?xml version="1.0"?> <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US"> Would you like to sit <mark name="window_seat"/> here at the window, or rather <mark name="aisle_seat"/> here at the aisle? </speak> (protocol level details here...) On 30.08.11 15:46, Robert Brown wrote: > Hi Marc, > > No problem if you're short of time. If you can suggest the examples, I can create them and send them to you for comment. > > Cheers, > > /Rob > ________________________________________ > From: Marc Schroeder [marc.schroeder@dfki.de] > Sent: Tuesday, August 30, 2011 6:15 AM > To: Robert Brown > Subject: Re: more protocol examples for synthesis? > > Hi Robert, > > sorry for my relative silence recently, time is in short supply at my > end at the moment. > > I definitely think there should be some more synthesis examples. I can > certainly think of some, and attempt, with my limited understanding of > MRCP and thus of this protocol, to formulate them. > > An issue might be time; until when are they needed? > > Best wishes, > Marc > > On 30.08.11 02:13, Robert Brown wrote: >> Hi Marc, >> >> I was wondering if you think the protocol draft needs any additional >> synthesis examples? >> >> (here’s the current link: >> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Aug/att-0004/speech-protocol-draft-04.html >> ) >> >> Also, if you think it does, would you like to write them? >> >> Let me know what you think. >> >> Cheers, >> >> /Rob >> -- Dr. Marc Schröder, Senior Researcher at DFKI GmbH Project leader for DFKI in SSPNet http://sspnet.eu Team Leader DFKI TTS Group http://mary.dfki.de Editor W3C EmotionML Working Draft http://www.w3.org/TR/emotionml/ Portal Editor http://emotion-research.net Homepage: http://www.dfki.de/~schroed Email: marc.schroeder@dfki.de Phone: +49-681-85775-5303 Postal address: DFKI GmbH, Campus D3_2, Stuhlsatzenhausweg 3, D-66123 Saarbrücken, Germany -- Official DFKI coordinates: Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH Trippstadter Strasse 122, D-67663 Kaiserslautern, Germany Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender) Dr. Walter Olthoff Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes Amtsgericht Kaiserslautern, HRB 2313
Received on Monday, 5 September 2011 08:12:43 UTC