Re: more protocol examples for synthesis?

Hi Robert,

I have given this some thought and I think that, while the TTS mechanics 
for the protocol will be almost the same, for educational purposes we 
might want to have three TTS examples that cover:

(1) unplannable just-in-time TTS where the next utterance depends on the 
recent user action, e.g. a dialogue system or an in-car navigation system;

(2) bulk requests of TTS for pre-determined elements of the web page 
(say, spoken help texts to be played on mouseover);

(3) a multimodal presentation example, showcasing the requirement for 
marking points in time in the synthesized audio ("would you like to sit 
HERE at the window or rather HERE at the aisle?").

In each of these, in the protocol document we would only give the use 
case scenario and the web api integration in the form of a short 
explanatory story, but flesh out the protocol aspect of these in detail.

Examples (1) and (2) could be either plain text or SSML, whereas (3) 
must be SSML with <ssml:mark> tags.


Does that match what you had in mind?

Story text for each of these could be as follows (see below).

If you think these examples make sense, I would be most grateful if you 
could write the protocol part going with them... it is beyond me at the 
moment.

Thanks and best,
Marc



(1) The most straightforward use case for TTS is the synthesis of one 
utterance at a time. This is inevitable for just-in-time rendition of 
speech, for example in dialogue systems or in in-car navigation 
scenarios. Here, the web application will send a single speech synthesis 
request to the speech service, and retrieve the resulting speech output 
as described (elsewhere).

On the protocol level, the synthesis of a single utterance would look as 
follows.

(a) plain-text example

The utterance to be spoken can be sent as plain text. In this case, it 
is necessary to specify the language to use:

(protocol level details here...)


(b) Speech Synthesis Markup example

For richer markup of the text, it is possible to use the SSML format for 
sending an annotated request. For example, it is possible to propose an 
appropriate pronunciation or to indicate where to insert pauses:

(example adapted from http://www.w3.org/TR/speech-synthesis11/#edef_break):

<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
        xml:lang="en-US">
   Please make your choice. <break time="3s"/>
   Click any of the buttons to indicate your preference.
</speak>

(protocol level details here...)


(2) Some use cases require relatively static speech output which can be 
known at the time of loading a web page. In these cases, all required 
speech output can be requested in parallel as multiple concurrent 
requests. Callback methods in the web api are responsible to relate each 
speech stream to the appropriate place in the web application.

On the protocol level, the request of multiple speech streams 
concurrently is realized as follows.

(for educational purposes, maybe use several languages and voices but 
with plain text, such as "Hola, me llamo Maria." (Spanish), "Hi, I'm 
George." (UK English), or "Hallo, ich heiße Peter." (German).)

(protocol level details here...)


(3) In order to synchronize the speech content with other events in the 
web application, it is possible to mark relevant points in time using 
the SSML <mark> tag. When the speech is played back, a callback method 
is called for these markers, allowing the web application to present, 
e.g., visual displays synchronously.

(example adapted from http://www.w3.org/TR/speech-synthesis11/#S3.3.2):

<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
        xml:lang="en-US">
Would you like to sit <mark name="window_seat"/> here at the window, or 
rather <mark name="aisle_seat"/> here at the aisle?
</speak>


(protocol level details here...)


On 30.08.11 15:46, Robert Brown wrote:
> Hi Marc,
>
> No problem if you're short of time. If you can suggest the examples, I can create them and send them to you for comment.
>
> Cheers,
>
> /Rob
> ________________________________________
> From: Marc Schroeder [marc.schroeder@dfki.de]
> Sent: Tuesday, August 30, 2011 6:15 AM
> To: Robert Brown
> Subject: Re: more protocol examples for synthesis?
>
> Hi Robert,
>
> sorry for my relative silence recently, time is in short supply at my
> end at the moment.
>
> I definitely think there should be some more synthesis examples. I can
> certainly think of some, and attempt, with my limited understanding of
> MRCP and thus of this protocol, to formulate them.
>
> An issue might be time; until when are they needed?
>
> Best wishes,
> Marc
>
> On 30.08.11 02:13, Robert Brown wrote:
>> Hi Marc,
>>
>> I was wondering if you think the protocol draft needs any additional
>> synthesis examples?
>>
>> (here’s the current link:
>> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Aug/att-0004/speech-protocol-draft-04.html
>> )
>>
>> Also, if you think it does, would you like to write them?
>>
>> Let me know what you think.
>>
>> Cheers,
>>
>> /Rob
>>

-- 
Dr. Marc Schröder, Senior Researcher at DFKI GmbH
Project leader for DFKI in SSPNet http://sspnet.eu
Team Leader DFKI TTS Group http://mary.dfki.de
Editor W3C EmotionML Working Draft http://www.w3.org/TR/emotionml/
Portal Editor http://emotion-research.net

Homepage: http://www.dfki.de/~schroed
Email: marc.schroeder@dfki.de
Phone: +49-681-85775-5303
Postal address: DFKI GmbH, Campus D3_2, Stuhlsatzenhausweg 3, D-66123 
Saarbrücken, Germany
--
Official DFKI coordinates:
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Trippstadter Strasse 122, D-67663 Kaiserslautern, Germany
Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313

Received on Monday, 5 September 2011 08:12:43 UTC