[HTML Speech] Text to speech from Marc Schroeder on 2010-09-09 (public-xg-htmlspeech@w3.org from September 2010)

From: Marc Schroeder <marc.schroeder@dfki.de>
Date: Thu, 09 Sep 2010 17:17:12 +0200
To: public-xg-htmlspeech@w3.org
Message-ID: <4C88FA78.5060206@dfki.de>
Hi all,

let me try and bring the TTS topic into the discussion.

I am the core developer of DFKI's open source MARY TTS platform 
http://mary.dfki.de/, written in pure Java. Our TTS server provides an 
HTTP based interface with a simple AJAX user frontend (which you can try 
at http://mary.dfki.de:59125/); we are currently sending synthesis 
results via a GET request into an HTML 5 <audio> tag, which works (in 
Firefox 3.5+) but seems suboptimal in some ways.

I think <audio> is suboptimal even for server-side TTS, for the 
following reasons/requirements:

* <audio> provides no temporal structure of the synthesised speech. One 
feature that you often need is to know the time at which a given word is 
spoken, e.g.,
   - to highlight the word in a visual rendition of the speech;
   - to synchronize with other modalities in a multimodal presentation 
(think of an arrow appearing in a picture when a deictic is used -- 
"THIS person", or of a talking head, or gesture animation in avatars);
   - to know when to interrupt (you might not want to cut off the speech 
in the middle of a sentence)

* For longer stretches of spoken output, it is not obvious to me how to 
do "streaming" with an <audio> tag. Let's say a TTS can process one 
sentence at a time, and is requested to read an email consisting of 
three paragraphs. At the moment we would have to render the full email 
on the server before sending the result, which prolongs time-to-audio 
much more than necessary, for a simple transport/scheduling reason: 
IIRC, we need to indicate the Content-Length when sending the response, 
or else the audio wouldn't be played...

* There are certain properties of speech output that could be provided 
in an API, such as gender of the voice, language of the text to be 
spoken, preferred pronounciations, etc. -- of course SSML comes to mind 
(http://www.w3.org/TR/speech-synthesis11/ -- congratulations for 
reaching Recommendation status, Dan!)



BTW, I have seen a number of emails on the whatwg list and here taking 
an a priori stance regarding the question whether ASR (and TTS) would 
happen in the browser ("user agent", you guys seem to call it) or on the 
server. I don't think the choice is a priori clear, I am sure there are 
good use cases for either choice. The question is whether there is a way 
to cater for both in an HTML speech API...

Best for now,
Marc

-- 
please note my NEW phone number: +49-681-85775-5303

Dr. Marc Schröder, Senior Researcher at DFKI GmbH
Coordinator EU FP7 Project SEMAINE http://www.semaine-project.eu
Project leader for DFKI in SSPNet http://sspnet.eu
Project leader PAVOQUE http://mary.dfki.de/pavoque
Associate Editor IEEE Trans. Affective Computing http://computer.org/tac
Editor W3C EmotionML Working Draft http://www.w3.org/TR/emotionml/
Portal Editor http://emotion-research.net
Team Leader DFKI TTS Group http://mary.dfki.de

Homepage: http://www.dfki.de/~schroed
Email: marc.schroeder@dfki.de
Phone: +49-681-85775-5303
Postal address: DFKI GmbH, Campus D3_2, Stuhlsatzenhausweg 3, D-66123 
Saarbrücken, Germany
--
Official DFKI coordinates:
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Trippstadter Strasse 122, D-67663 Kaiserslautern, Germany
Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
Received on Thursday, 9 September 2010 15:17:48 UTC