RE: more protocol examples for synthesis? from Robert Brown on 2011-09-06 (public-xg-htmlspeech@w3.org from September 2011)

From: Robert Brown <Robert.Brown@microsoft.com>
Date: Tue, 6 Sep 2011 17:20:25 +0000
To: T.V Raman <raman@google.com>, "marc.schroeder@dfki.de" <marc.schroeder@dfki.de>
CC: "public-xg-htmlspeech@w3.org" <public-xg-htmlspeech@w3.org>
Message-ID: <113BCF28740AF44989BE7D3F84AE18DD1B24B7A4@TK5EX14MBXC112.redmond.corp.microsoft.>
Thanks guys. I'll put some protocol samples together along the lines Marc suggested. TV's suggestions would make good scripting examples for the API.

-----Original Message-----
From: T.V Raman [mailto:raman@google.com] 
Sent: Tuesday, September 06, 2011 8:36 AM
To: marc.schroeder@dfki.de
Cc: Robert Brown; public-xg-htmlspeech@w3.org
Subject: Re: more protocol examples for synthesis?


In addition, accessibility use cases, e.g. screenreaders for the blind need:

1. Rapid response  when speaking individual letters; in general
   it is advantageous to have a separate "speakLeter" call that
   reacts instantaneously 

2. Index markers and a callback when the index mark has been
   processed

3. A callback when speech synthesis playback is complete.

Marc Schroeder writes:
 > Hi Robert,
 >
 > I have given this some thought and I think that, while the TTS mechanics  > for the protocol will be almost the same, for educational purposes we  > might want to have three TTS examples that cover:
 >
 > (1) unplannable just-in-time TTS where the next utterance depends on the  > recent user action, e.g. a dialogue system or an in-car navigation system;  >  > (2) bulk requests of TTS for pre-determined elements of the web page  > (say, spoken help texts to be played on mouseover);  >  > (3) a multimodal presentation example, showcasing the requirement for  > marking points in time in the synthesized audio ("would you like to sit  > HERE at the window or rather HERE at the aisle?").
 >
 > In each of these, in the protocol document we would only give the use  > case scenario and the web api integration in the form of a short  > explanatory story, but flesh out the protocol aspect of these in detail.
 >
 > Examples (1) and (2) could be either plain text or SSML, whereas (3)  > must be SSML with <ssml:mark> tags.
 >
 >
 > Does that match what you had in mind?
 >
 > Story text for each of these could be as follows (see below).
 >
 > If you think these examples make sense, I would be most grateful if you  > could write the protocol part going with them... it is beyond me at the  > moment.
 >
 > Thanks and best,
 > Marc
 >
 >
 >
 > (1) The most straightforward use case for TTS is the synthesis of one  > utterance at a time. This is inevitable for just-in-time rendition of  > speech, for example in dialogue systems or in in-car navigation  > scenarios. Here, the web application will send a single speech synthesis  > request to the speech service, and retrieve the resulting speech output  > as described (elsewhere).
 >
 > On the protocol level, the synthesis of a single utterance would look as  > follows.
 >
 > (a) plain-text example
 >
 > The utterance to be spoken can be sent as plain text. In this case, it  > is necessary to specify the language to use:
 >
 > (protocol level details here...)
 >
 >
 > (b) Speech Synthesis Markup example
 >
 > For richer markup of the text, it is possible to use the SSML format for  > sending an annotated request. For example, it is possible to propose an  > appropriate pronunciation or to indicate where to insert pauses:
 >
 > (example adapted from http://www.w3.org/TR/speech-synthesis11/#edef_break):
 >
 > <?xml version="1.0"?>
 > <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
 >         xml:lang="en-US">
 >    Please make your choice. <break time="3s"/>
 >    Click any of the buttons to indicate your preference.
 > </speak>
 >
 > (protocol level details here...)
 >
 >
 > (2) Some use cases require relatively static speech output which can be  > known at the time of loading a web page. In these cases, all required  > speech output can be requested in parallel as multiple concurrent  > requests. Callback methods in the web api are responsible to relate each  > speech stream to the appropriate place in the web application.
 >
 > On the protocol level, the request of multiple speech streams  > concurrently is realized as follows.
 >
 > (for educational purposes, maybe use several languages and voices but  > with plain text, such as "Hola, me llamo Maria." (Spanish), "Hi, I'm  > George." (UK English), or "Hallo, ich heiße Peter." (German).)  >  > (protocol level details here...)  >  >  > (3) In order to synchronize the speech content with other events in the  > web application, it is possible to mark relevant points in time using  > the SSML <mark> tag. When the speech is played back, a callback method  > is called for these markers, allowing the web application to present,  > e.g., visual displays synchronously.
 >
 > (example adapted from http://www.w3.org/TR/speech-synthesis11/#S3.3.2):
 >
 > <?xml version="1.0"?>
 > <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
 >         xml:lang="en-US">
 > Would you like to sit <mark name="window_seat"/> here at the window, or  > rather <mark name="aisle_seat"/> here at the aisle?
 > </speak>
 >
 >
 > (protocol level details here...)
 >
 >
 > On 30.08.11 15:46, Robert Brown wrote:
 > > Hi Marc,
 > >
 > > No problem if you're short of time. If you can suggest the examples, I can create them and send them to you for comment.
 > >
 > > Cheers,
 > >
 > > /Rob
 > > ________________________________________
 > > From: Marc Schroeder [marc.schroeder@dfki.de]  > > Sent: Tuesday, August 30, 2011 6:15 AM  > > To: Robert Brown  > > Subject: Re: more protocol examples for synthesis?
 > >
 > > Hi Robert,
 > >
 > > sorry for my relative silence recently, time is in short supply at my  > > end at the moment.
 > >
 > > I definitely think there should be some more synthesis examples. I can  > > certainly think of some, and attempt, with my limited understanding of  > > MRCP and thus of this protocol, to formulate them.
 > >
 > > An issue might be time; until when are they needed?
 > >
 > > Best wishes,
 > > Marc
 > >
 > > On 30.08.11 02:13, Robert Brown wrote:
 > >> Hi Marc,
 > >>
 > >> I was wondering if you think the protocol draft needs any additional  > >> synthesis examples?
 > >>
 > >> (here’s the current link:
 > >> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Aug/att-0004/speech-protocol-draft-04.html

 > >> )
 > >>
 > >> Also, if you think it does, would you like to write them?
 > >>
 > >> Let me know what you think.
 > >>
 > >> Cheers,
 > >>
 > >> /Rob
 > >>
 >
 > --
 > Dr. Marc Schröder, Senior Researcher at DFKI GmbH  > Project leader for DFKI in SSPNet http://sspnet.eu  > Team Leader DFKI TTS Group http://mary.dfki.de  > Editor W3C EmotionML Working Draft http://www.w3.org/TR/emotionml/  > Portal Editor http://emotion-research.net  >  > Homepage: http://www.dfki.de/~schroed  > Email: marc.schroeder@dfki.de  > Phone: +49-681-85775-5303  > Postal address: DFKI GmbH, Campus D3_2, Stuhlsatzenhausweg 3, D-66123  > Saarbrücken, Germany  > --  > Official DFKI coordinates:
 > Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH  > Trippstadter Strasse 122, D-67663 Kaiserslautern, Germany  > Geschaeftsfuehrung:
 > Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)  > Dr. Walter Olthoff  > Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes  > Amtsgericht Kaiserslautern, HRB 2313
Received on Tuesday, 6 September 2011 17:20:59 UTC