- From: Marc Schroeder <marc.schroeder@dfki.de>
- Date: Mon, 05 Sep 2011 10:12:02 +0200
- To: Robert Brown <Robert.Brown@microsoft.com>
- CC: "public-xg-htmlspeech@w3.org" <public-xg-htmlspeech@w3.org>
Hi Robert,
I have given this some thought and I think that, while the TTS mechanics
for the protocol will be almost the same, for educational purposes we
might want to have three TTS examples that cover:
(1) unplannable just-in-time TTS where the next utterance depends on the
recent user action, e.g. a dialogue system or an in-car navigation system;
(2) bulk requests of TTS for pre-determined elements of the web page
(say, spoken help texts to be played on mouseover);
(3) a multimodal presentation example, showcasing the requirement for
marking points in time in the synthesized audio ("would you like to sit
HERE at the window or rather HERE at the aisle?").
In each of these, in the protocol document we would only give the use
case scenario and the web api integration in the form of a short
explanatory story, but flesh out the protocol aspect of these in detail.
Examples (1) and (2) could be either plain text or SSML, whereas (3)
must be SSML with <ssml:mark> tags.
Does that match what you had in mind?
Story text for each of these could be as follows (see below).
If you think these examples make sense, I would be most grateful if you
could write the protocol part going with them... it is beyond me at the
moment.
Thanks and best,
Marc
(1) The most straightforward use case for TTS is the synthesis of one
utterance at a time. This is inevitable for just-in-time rendition of
speech, for example in dialogue systems or in in-car navigation
scenarios. Here, the web application will send a single speech synthesis
request to the speech service, and retrieve the resulting speech output
as described (elsewhere).
On the protocol level, the synthesis of a single utterance would look as
follows.
(a) plain-text example
The utterance to be spoken can be sent as plain text. In this case, it
is necessary to specify the language to use:
(protocol level details here...)
(b) Speech Synthesis Markup example
For richer markup of the text, it is possible to use the SSML format for
sending an annotated request. For example, it is possible to propose an
appropriate pronunciation or to indicate where to insert pauses:
(example adapted from http://www.w3.org/TR/speech-synthesis11/#edef_break):
<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xml:lang="en-US">
Please make your choice. <break time="3s"/>
Click any of the buttons to indicate your preference.
</speak>
(protocol level details here...)
(2) Some use cases require relatively static speech output which can be
known at the time of loading a web page. In these cases, all required
speech output can be requested in parallel as multiple concurrent
requests. Callback methods in the web api are responsible to relate each
speech stream to the appropriate place in the web application.
On the protocol level, the request of multiple speech streams
concurrently is realized as follows.
(for educational purposes, maybe use several languages and voices but
with plain text, such as "Hola, me llamo Maria." (Spanish), "Hi, I'm
George." (UK English), or "Hallo, ich heiße Peter." (German).)
(protocol level details here...)
(3) In order to synchronize the speech content with other events in the
web application, it is possible to mark relevant points in time using
the SSML <mark> tag. When the speech is played back, a callback method
is called for these markers, allowing the web application to present,
e.g., visual displays synchronously.
(example adapted from http://www.w3.org/TR/speech-synthesis11/#S3.3.2):
<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xml:lang="en-US">
Would you like to sit <mark name="window_seat"/> here at the window, or
rather <mark name="aisle_seat"/> here at the aisle?
</speak>
(protocol level details here...)
On 30.08.11 15:46, Robert Brown wrote:
> Hi Marc,
>
> No problem if you're short of time. If you can suggest the examples, I can create them and send them to you for comment.
>
> Cheers,
>
> /Rob
> ________________________________________
> From: Marc Schroeder [marc.schroeder@dfki.de]
> Sent: Tuesday, August 30, 2011 6:15 AM
> To: Robert Brown
> Subject: Re: more protocol examples for synthesis?
>
> Hi Robert,
>
> sorry for my relative silence recently, time is in short supply at my
> end at the moment.
>
> I definitely think there should be some more synthesis examples. I can
> certainly think of some, and attempt, with my limited understanding of
> MRCP and thus of this protocol, to formulate them.
>
> An issue might be time; until when are they needed?
>
> Best wishes,
> Marc
>
> On 30.08.11 02:13, Robert Brown wrote:
>> Hi Marc,
>>
>> I was wondering if you think the protocol draft needs any additional
>> synthesis examples?
>>
>> (here’s the current link:
>> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Aug/att-0004/speech-protocol-draft-04.html
>> )
>>
>> Also, if you think it does, would you like to write them?
>>
>> Let me know what you think.
>>
>> Cheers,
>>
>> /Rob
>>
--
Dr. Marc Schröder, Senior Researcher at DFKI GmbH
Project leader for DFKI in SSPNet http://sspnet.eu
Team Leader DFKI TTS Group http://mary.dfki.de
Editor W3C EmotionML Working Draft http://www.w3.org/TR/emotionml/
Portal Editor http://emotion-research.net
Homepage: http://www.dfki.de/~schroed
Email: marc.schroeder@dfki.de
Phone: +49-681-85775-5303
Postal address: DFKI GmbH, Campus D3_2, Stuhlsatzenhausweg 3, D-66123
Saarbrücken, Germany
--
Official DFKI coordinates:
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Trippstadter Strasse 122, D-67663 Kaiserslautern, Germany
Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
Received on Monday, 5 September 2011 08:12:43 UTC