- From: Robert Brown <Robert.Brown@microsoft.com>
- Date: Tue, 6 Sep 2011 17:20:25 +0000
- To: T.V Raman <raman@google.com>, "marc.schroeder@dfki.de" <marc.schroeder@dfki.de>
- CC: "public-xg-htmlspeech@w3.org" <public-xg-htmlspeech@w3.org>
Thanks guys. I'll put some protocol samples together along the lines Marc suggested. TV's suggestions would make good scripting examples for the API. -----Original Message----- From: T.V Raman [mailto:raman@google.com] Sent: Tuesday, September 06, 2011 8:36 AM To: marc.schroeder@dfki.de Cc: Robert Brown; public-xg-htmlspeech@w3.org Subject: Re: more protocol examples for synthesis? In addition, accessibility use cases, e.g. screenreaders for the blind need: 1. Rapid response when speaking individual letters; in general it is advantageous to have a separate "speakLeter" call that reacts instantaneously 2. Index markers and a callback when the index mark has been processed 3. A callback when speech synthesis playback is complete. Marc Schroeder writes: > Hi Robert, > > I have given this some thought and I think that, while the TTS mechanics > for the protocol will be almost the same, for educational purposes we > might want to have three TTS examples that cover: > > (1) unplannable just-in-time TTS where the next utterance depends on the > recent user action, e.g. a dialogue system or an in-car navigation system; > > (2) bulk requests of TTS for pre-determined elements of the web page > (say, spoken help texts to be played on mouseover); > > (3) a multimodal presentation example, showcasing the requirement for > marking points in time in the synthesized audio ("would you like to sit > HERE at the window or rather HERE at the aisle?"). > > In each of these, in the protocol document we would only give the use > case scenario and the web api integration in the form of a short > explanatory story, but flesh out the protocol aspect of these in detail. > > Examples (1) and (2) could be either plain text or SSML, whereas (3) > must be SSML with <ssml:mark> tags. > > > Does that match what you had in mind? > > Story text for each of these could be as follows (see below). > > If you think these examples make sense, I would be most grateful if you > could write the protocol part going with them... it is beyond me at the > moment. > > Thanks and best, > Marc > > > > (1) The most straightforward use case for TTS is the synthesis of one > utterance at a time. This is inevitable for just-in-time rendition of > speech, for example in dialogue systems or in in-car navigation > scenarios. Here, the web application will send a single speech synthesis > request to the speech service, and retrieve the resulting speech output > as described (elsewhere). > > On the protocol level, the synthesis of a single utterance would look as > follows. > > (a) plain-text example > > The utterance to be spoken can be sent as plain text. In this case, it > is necessary to specify the language to use: > > (protocol level details here...) > > > (b) Speech Synthesis Markup example > > For richer markup of the text, it is possible to use the SSML format for > sending an annotated request. For example, it is possible to propose an > appropriate pronunciation or to indicate where to insert pauses: > > (example adapted from http://www.w3.org/TR/speech-synthesis11/#edef_break): > > <?xml version="1.0"?> > <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" > xml:lang="en-US"> > Please make your choice. <break time="3s"/> > Click any of the buttons to indicate your preference. > </speak> > > (protocol level details here...) > > > (2) Some use cases require relatively static speech output which can be > known at the time of loading a web page. In these cases, all required > speech output can be requested in parallel as multiple concurrent > requests. Callback methods in the web api are responsible to relate each > speech stream to the appropriate place in the web application. > > On the protocol level, the request of multiple speech streams > concurrently is realized as follows. > > (for educational purposes, maybe use several languages and voices but > with plain text, such as "Hola, me llamo Maria." (Spanish), "Hi, I'm > George." (UK English), or "Hallo, ich heiße Peter." (German).) > > (protocol level details here...) > > > (3) In order to synchronize the speech content with other events in the > web application, it is possible to mark relevant points in time using > the SSML <mark> tag. When the speech is played back, a callback method > is called for these markers, allowing the web application to present, > e.g., visual displays synchronously. > > (example adapted from http://www.w3.org/TR/speech-synthesis11/#S3.3.2): > > <?xml version="1.0"?> > <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" > xml:lang="en-US"> > Would you like to sit <mark name="window_seat"/> here at the window, or > rather <mark name="aisle_seat"/> here at the aisle? > </speak> > > > (protocol level details here...) > > > On 30.08.11 15:46, Robert Brown wrote: > > Hi Marc, > > > > No problem if you're short of time. If you can suggest the examples, I can create them and send them to you for comment. > > > > Cheers, > > > > /Rob > > ________________________________________ > > From: Marc Schroeder [marc.schroeder@dfki.de] > > Sent: Tuesday, August 30, 2011 6:15 AM > > To: Robert Brown > > Subject: Re: more protocol examples for synthesis? > > > > Hi Robert, > > > > sorry for my relative silence recently, time is in short supply at my > > end at the moment. > > > > I definitely think there should be some more synthesis examples. I can > > certainly think of some, and attempt, with my limited understanding of > > MRCP and thus of this protocol, to formulate them. > > > > An issue might be time; until when are they needed? > > > > Best wishes, > > Marc > > > > On 30.08.11 02:13, Robert Brown wrote: > >> Hi Marc, > >> > >> I was wondering if you think the protocol draft needs any additional > >> synthesis examples? > >> > >> (here’s the current link: > >> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Aug/att-0004/speech-protocol-draft-04.html > >> ) > >> > >> Also, if you think it does, would you like to write them? > >> > >> Let me know what you think. > >> > >> Cheers, > >> > >> /Rob > >> > > -- > Dr. Marc Schröder, Senior Researcher at DFKI GmbH > Project leader for DFKI in SSPNet http://sspnet.eu > Team Leader DFKI TTS Group http://mary.dfki.de > Editor W3C EmotionML Working Draft http://www.w3.org/TR/emotionml/ > Portal Editor http://emotion-research.net > > Homepage: http://www.dfki.de/~schroed > Email: marc.schroeder@dfki.de > Phone: +49-681-85775-5303 > Postal address: DFKI GmbH, Campus D3_2, Stuhlsatzenhausweg 3, D-66123 > Saarbrücken, Germany > -- > Official DFKI coordinates: > Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH > Trippstadter Strasse 122, D-67663 Kaiserslautern, Germany > Geschaeftsfuehrung: > Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender) > Dr. Walter Olthoff > Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes > Amtsgericht Kaiserslautern, HRB 2313
Received on Tuesday, 6 September 2011 17:20:59 UTC