- From: Robert Brown <Robert.Brown@microsoft.com>
- Date: Tue, 6 Sep 2011 17:20:25 +0000
- To: T.V Raman <raman@google.com>, "marc.schroeder@dfki.de" <marc.schroeder@dfki.de>
- CC: "public-xg-htmlspeech@w3.org" <public-xg-htmlspeech@w3.org>
Thanks guys. I'll put some protocol samples together along the lines Marc suggested. TV's suggestions would make good scripting examples for the API.
-----Original Message-----
From: T.V Raman [mailto:raman@google.com]
Sent: Tuesday, September 06, 2011 8:36 AM
To: marc.schroeder@dfki.de
Cc: Robert Brown; public-xg-htmlspeech@w3.org
Subject: Re: more protocol examples for synthesis?
In addition, accessibility use cases, e.g. screenreaders for the blind need:
1. Rapid response when speaking individual letters; in general
it is advantageous to have a separate "speakLeter" call that
reacts instantaneously
2. Index markers and a callback when the index mark has been
processed
3. A callback when speech synthesis playback is complete.
Marc Schroeder writes:
> Hi Robert,
>
> I have given this some thought and I think that, while the TTS mechanics > for the protocol will be almost the same, for educational purposes we > might want to have three TTS examples that cover:
>
> (1) unplannable just-in-time TTS where the next utterance depends on the > recent user action, e.g. a dialogue system or an in-car navigation system; > > (2) bulk requests of TTS for pre-determined elements of the web page > (say, spoken help texts to be played on mouseover); > > (3) a multimodal presentation example, showcasing the requirement for > marking points in time in the synthesized audio ("would you like to sit > HERE at the window or rather HERE at the aisle?").
>
> In each of these, in the protocol document we would only give the use > case scenario and the web api integration in the form of a short > explanatory story, but flesh out the protocol aspect of these in detail.
>
> Examples (1) and (2) could be either plain text or SSML, whereas (3) > must be SSML with <ssml:mark> tags.
>
>
> Does that match what you had in mind?
>
> Story text for each of these could be as follows (see below).
>
> If you think these examples make sense, I would be most grateful if you > could write the protocol part going with them... it is beyond me at the > moment.
>
> Thanks and best,
> Marc
>
>
>
> (1) The most straightforward use case for TTS is the synthesis of one > utterance at a time. This is inevitable for just-in-time rendition of > speech, for example in dialogue systems or in in-car navigation > scenarios. Here, the web application will send a single speech synthesis > request to the speech service, and retrieve the resulting speech output > as described (elsewhere).
>
> On the protocol level, the synthesis of a single utterance would look as > follows.
>
> (a) plain-text example
>
> The utterance to be spoken can be sent as plain text. In this case, it > is necessary to specify the language to use:
>
> (protocol level details here...)
>
>
> (b) Speech Synthesis Markup example
>
> For richer markup of the text, it is possible to use the SSML format for > sending an annotated request. For example, it is possible to propose an > appropriate pronunciation or to indicate where to insert pauses:
>
> (example adapted from http://www.w3.org/TR/speech-synthesis11/#edef_break):
>
> <?xml version="1.0"?>
> <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
> xml:lang="en-US">
> Please make your choice. <break time="3s"/>
> Click any of the buttons to indicate your preference.
> </speak>
>
> (protocol level details here...)
>
>
> (2) Some use cases require relatively static speech output which can be > known at the time of loading a web page. In these cases, all required > speech output can be requested in parallel as multiple concurrent > requests. Callback methods in the web api are responsible to relate each > speech stream to the appropriate place in the web application.
>
> On the protocol level, the request of multiple speech streams > concurrently is realized as follows.
>
> (for educational purposes, maybe use several languages and voices but > with plain text, such as "Hola, me llamo Maria." (Spanish), "Hi, I'm > George." (UK English), or "Hallo, ich heiße Peter." (German).) > > (protocol level details here...) > > > (3) In order to synchronize the speech content with other events in the > web application, it is possible to mark relevant points in time using > the SSML <mark> tag. When the speech is played back, a callback method > is called for these markers, allowing the web application to present, > e.g., visual displays synchronously.
>
> (example adapted from http://www.w3.org/TR/speech-synthesis11/#S3.3.2):
>
> <?xml version="1.0"?>
> <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
> xml:lang="en-US">
> Would you like to sit <mark name="window_seat"/> here at the window, or > rather <mark name="aisle_seat"/> here at the aisle?
> </speak>
>
>
> (protocol level details here...)
>
>
> On 30.08.11 15:46, Robert Brown wrote:
> > Hi Marc,
> >
> > No problem if you're short of time. If you can suggest the examples, I can create them and send them to you for comment.
> >
> > Cheers,
> >
> > /Rob
> > ________________________________________
> > From: Marc Schroeder [marc.schroeder@dfki.de] > > Sent: Tuesday, August 30, 2011 6:15 AM > > To: Robert Brown > > Subject: Re: more protocol examples for synthesis?
> >
> > Hi Robert,
> >
> > sorry for my relative silence recently, time is in short supply at my > > end at the moment.
> >
> > I definitely think there should be some more synthesis examples. I can > > certainly think of some, and attempt, with my limited understanding of > > MRCP and thus of this protocol, to formulate them.
> >
> > An issue might be time; until when are they needed?
> >
> > Best wishes,
> > Marc
> >
> > On 30.08.11 02:13, Robert Brown wrote:
> >> Hi Marc,
> >>
> >> I was wondering if you think the protocol draft needs any additional > >> synthesis examples?
> >>
> >> (here’s the current link:
> >> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Aug/att-0004/speech-protocol-draft-04.html
> >> )
> >>
> >> Also, if you think it does, would you like to write them?
> >>
> >> Let me know what you think.
> >>
> >> Cheers,
> >>
> >> /Rob
> >>
>
> --
> Dr. Marc Schröder, Senior Researcher at DFKI GmbH > Project leader for DFKI in SSPNet http://sspnet.eu > Team Leader DFKI TTS Group http://mary.dfki.de > Editor W3C EmotionML Working Draft http://www.w3.org/TR/emotionml/ > Portal Editor http://emotion-research.net > > Homepage: http://www.dfki.de/~schroed > Email: marc.schroeder@dfki.de > Phone: +49-681-85775-5303 > Postal address: DFKI GmbH, Campus D3_2, Stuhlsatzenhausweg 3, D-66123 > Saarbrücken, Germany > -- > Official DFKI coordinates:
> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH > Trippstadter Strasse 122, D-67663 Kaiserslautern, Germany > Geschaeftsfuehrung:
> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender) > Dr. Walter Olthoff > Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes > Amtsgericht Kaiserslautern, HRB 2313
Received on Tuesday, 6 September 2011 17:20:59 UTC