Re: more protocol examples for synthesis? from T.V Raman on 2011-09-06 (public-xg-htmlspeech@w3.org from September 2011)

From: T.V Raman <raman@google.com>
Date: Tue, 6 Sep 2011 08:35:30 -0700
To: marc.schroeder@dfki.de
Cc: Robert.Brown@microsoft.com, public-xg-htmlspeech@w3.org
Message-ID: <20070.15810.443808.537245@retriever.mtv.corp.google.com>
In addition, accessibility use cases, e.g. screenreaders for the
blind need:

1. Rapid response  when speaking individual letters; in general
   it is advantageous to have a separate "speakLeter" call that
   reacts instantaneously 

2. Index markers and a callback when the index mark has been
   processed

3. A callback when speech synthesis playback is complete.

Marc Schroeder writes:
 > Hi Robert,
 > 
 > I have given this some thought and I think that, while the TTS mechanics 
 > for the protocol will be almost the same, for educational purposes we 
 > might want to have three TTS examples that cover:
 > 
 > (1) unplannable just-in-time TTS where the next utterance depends on the 
 > recent user action, e.g. a dialogue system or an in-car navigation system;
 > 
 > (2) bulk requests of TTS for pre-determined elements of the web page 
 > (say, spoken help texts to be played on mouseover);
 > 
 > (3) a multimodal presentation example, showcasing the requirement for 
 > marking points in time in the synthesized audio ("would you like to sit 
 > HERE at the window or rather HERE at the aisle?").
 > 
 > In each of these, in the protocol document we would only give the use 
 > case scenario and the web api integration in the form of a short 
 > explanatory story, but flesh out the protocol aspect of these in detail.
 > 
 > Examples (1) and (2) could be either plain text or SSML, whereas (3) 
 > must be SSML with <ssml:mark> tags.
 > 
 > 
 > Does that match what you had in mind?
 > 
 > Story text for each of these could be as follows (see below).
 > 
 > If you think these examples make sense, I would be most grateful if you 
 > could write the protocol part going with them... it is beyond me at the 
 > moment.
 > 
 > Thanks and best,
 > Marc
 > 
 > 
 > 
 > (1) The most straightforward use case for TTS is the synthesis of one 
 > utterance at a time. This is inevitable for just-in-time rendition of 
 > speech, for example in dialogue systems or in in-car navigation 
 > scenarios. Here, the web application will send a single speech synthesis 
 > request to the speech service, and retrieve the resulting speech output 
 > as described (elsewhere).
 > 
 > On the protocol level, the synthesis of a single utterance would look as 
 > follows.
 > 
 > (a) plain-text example
 > 
 > The utterance to be spoken can be sent as plain text. In this case, it 
 > is necessary to specify the language to use:
 > 
 > (protocol level details here...)
 > 
 > 
 > (b) Speech Synthesis Markup example
 > 
 > For richer markup of the text, it is possible to use the SSML format for 
 > sending an annotated request. For example, it is possible to propose an 
 > appropriate pronunciation or to indicate where to insert pauses:
 > 
 > (example adapted from http://www.w3.org/TR/speech-synthesis11/#edef_break):
 > 
 > <?xml version="1.0"?>
 > <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
 >         xml:lang="en-US">
 >    Please make your choice. <break time="3s"/>
 >    Click any of the buttons to indicate your preference.
 > </speak>
 > 
 > (protocol level details here...)
 > 
 > 
 > (2) Some use cases require relatively static speech output which can be 
 > known at the time of loading a web page. In these cases, all required 
 > speech output can be requested in parallel as multiple concurrent 
 > requests. Callback methods in the web api are responsible to relate each 
 > speech stream to the appropriate place in the web application.
 > 
 > On the protocol level, the request of multiple speech streams 
 > concurrently is realized as follows.
 > 
 > (for educational purposes, maybe use several languages and voices but 
 > with plain text, such as "Hola, me llamo Maria." (Spanish), "Hi, I'm 
 > George." (UK English), or "Hallo, ich heiße Peter." (German).)
 > 
 > (protocol level details here...)
 > 
 > 
 > (3) In order to synchronize the speech content with other events in the 
 > web application, it is possible to mark relevant points in time using 
 > the SSML <mark> tag. When the speech is played back, a callback method 
 > is called for these markers, allowing the web application to present, 
 > e.g., visual displays synchronously.
 > 
 > (example adapted from http://www.w3.org/TR/speech-synthesis11/#S3.3.2):
 > 
 > <?xml version="1.0"?>
 > <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
 >         xml:lang="en-US">
 > Would you like to sit <mark name="window_seat"/> here at the window, or 
 > rather <mark name="aisle_seat"/> here at the aisle?
 > </speak>
 > 
 > 
 > (protocol level details here...)
 > 
 > 
 > On 30.08.11 15:46, Robert Brown wrote:
 > > Hi Marc,
 > >
 > > No problem if you're short of time. If you can suggest the examples, I can create them and send them to you for comment.
 > >
 > > Cheers,
 > >
 > > /Rob
 > > ________________________________________
 > > From: Marc Schroeder [marc.schroeder@dfki.de]
 > > Sent: Tuesday, August 30, 2011 6:15 AM
 > > To: Robert Brown
 > > Subject: Re: more protocol examples for synthesis?
 > >
 > > Hi Robert,
 > >
 > > sorry for my relative silence recently, time is in short supply at my
 > > end at the moment.
 > >
 > > I definitely think there should be some more synthesis examples. I can
 > > certainly think of some, and attempt, with my limited understanding of
 > > MRCP and thus of this protocol, to formulate them.
 > >
 > > An issue might be time; until when are they needed?
 > >
 > > Best wishes,
 > > Marc
 > >
 > > On 30.08.11 02:13, Robert Brown wrote:
 > >> Hi Marc,
 > >>
 > >> I was wondering if you think the protocol draft needs any additional
 > >> synthesis examples?
 > >>
 > >> (here’s the current link:
 > >> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Aug/att-0004/speech-protocol-draft-04.html
 > >> )
 > >>
 > >> Also, if you think it does, would you like to write them?
 > >>
 > >> Let me know what you think.
 > >>
 > >> Cheers,
 > >>
 > >> /Rob
 > >>
 > 
 > -- 
 > Dr. Marc Schröder, Senior Researcher at DFKI GmbH
 > Project leader for DFKI in SSPNet http://sspnet.eu
 > Team Leader DFKI TTS Group http://mary.dfki.de
 > Editor W3C EmotionML Working Draft http://www.w3.org/TR/emotionml/
 > Portal Editor http://emotion-research.net
 > 
 > Homepage: http://www.dfki.de/~schroed
 > Email: marc.schroeder@dfki.de
 > Phone: +49-681-85775-5303
 > Postal address: DFKI GmbH, Campus D3_2, Stuhlsatzenhausweg 3, D-66123 
 > Saarbrücken, Germany
 > --
 > Official DFKI coordinates:
 > Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
 > Trippstadter Strasse 122, D-67663 Kaiserslautern, Germany
 > Geschaeftsfuehrung:
 > Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
 > Dr. Walter Olthoff
 > Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
 > Amtsgericht Kaiserslautern, HRB 2313
Received on Tuesday, 6 September 2011 15:36:01 UTC