Re: [HTML Speech] Text to speech from T.V Raman on 2010-09-09 (public-xg-htmlspeech@w3.org from September 2010)

From: T.V Raman <raman@google.com>
Date: Thu, 9 Sep 2010 09:41:36 -0700
To: david.bolter@gmail.com
Cc: bringert@google.com, marc.schroeder@dfki.de, public-xg-htmlspeech@w3.org
Message-ID: <19593.3648.904963.327178@retriever.mtv.corp.google.com>
TTS  is *definitely* in scope.

David Bolter writes:
 >   Hi all,
 > 
 > I am actually more interested in TTS than speech recogntion however I 
 > wasn't aware TTS is in scope for this group? Perhaps Dan can clarify.
 > 
 > cheers,
 > David
 > 
 > On 09/09/10 11:56 AM, Bjorn Bringert wrote:
 > > On Thu, Sep 9, 2010 at 4:17 PM, Marc Schroeder<marc.schroeder@dfki.de>  wrote:
 > >> Hi all,
 > >>
 > >> let me try and bring the TTS topic into the discussion.
 > >>
 > >> I am the core developer of DFKI's open source MARY TTS platform
 > >> http://mary.dfki.de/, written in pure Java. Our TTS server provides an HTTP
 > >> based interface with a simple AJAX user frontend (which you can try at
 > >> http://mary.dfki.de:59125/); we are currently sending synthesis results via
 > >> a GET request into an HTML 5<audio>  tag, which works (in Firefox 3.5+) but
 > >> seems suboptimal in some ways.
 > > I was just going to send out this TTS proposal:
 > > http://docs.google.com/View?id=dcfg79pz_4gnmp96cz
 > >
 > > The basic idea is to add a<tts>  element which extends
 > > HTMLMediaElement (like<audio>  and<video>  do). I think that it
 > > addresses most of the points that you bring up, see below.
 > >
 > >
 > >> I think<audio>  is suboptimal even for server-side TTS, for the following
 > >> reasons/requirements:
 > >>
 > >> *<audio>  provides no temporal structure of the synthesised speech. One
 > >> feature that you often need is to know the time at which a given word is
 > >> spoken, e.g.,
 > >>   - to highlight the word in a visual rendition of the speech;
 > >>   - to synchronize with other modalities in a multimodal presentation (think
 > >> of an arrow appearing in a picture when a deictic is used -- "THIS person",
 > >> or of a talking head, or gesture animation in avatars);
 > >>   - to know when to interrupt (you might not want to cut off the speech in
 > >> the middle of a sentence)
 > > The web app is notified when SSML<mark>  events are reached (using the
 > > HTMLMediaElement timeupdate event).
 > >
 > >
 > >> * For longer stretches of spoken output, it is not obvious to me how to do
 > >> "streaming" with an<audio>  tag. Let's say a TTS can process one sentence at
 > >> a time, and is requested to read an email consisting of three paragraphs. At
 > >> the moment we would have to render the full email on the server before
 > >> sending the result, which prolongs time-to-audio much more than necessary,
 > >> for a simple transport/scheduling reason: IIRC, we need to indicate the
 > >> Content-Length when sending the response, or else the audio wouldn't be
 > >> played...
 > > While this is outside the scope of the proposal (since the proposal
 > > doesn't specify how the browser talks to the synthesizer), streaming
 > > from a server-side synthesizer can be done with chunked transfer
 > > encoding.
 > >
 > >
 > >> * There are certain properties of speech output that could be provided in an
 > >> API, such as gender of the voice, language of the text to be spoken,
 > >> preferred pronounciations, etc. -- of course SSML comes to mind
 > >> (http://www.w3.org/TR/speech-synthesis11/ -- congratulations for reaching
 > >> Recommendation status, Dan!)
 > > SSML documents can be used as the source in<tts>, so all these
 > > parameters are supported.
 > >
 > >
 > >> BTW, I have seen a number of emails on the whatwg list and here taking an a
 > >> priori stance regarding the question whether ASR (and TTS) would happen in
 > >> the browser ("user agent", you guys seem to call it) or on the server. I
 > >> don't think the choice is a priori clear, I am sure there are good use cases
 > >> for either choice. The question is whether there is a way to cater for both
 > >> in an HTML speech API...
 > > The proposal leaves the choice of client or server synthesis
 > > completely up to the browser. The web app just provides the text or
 > > SSML to synthesize. The browser may even use both client- and
 > > server-side synthesis, for example using a server-side synthesizer for
 > > languages that the client-side one doesn't support, or using a simple
 > > client-side synthesizer as a fallback if the network connection fails.
 > >
 > >
 > 

-- 
Best Regards,
--raman

Title:  Research Scientist                              
Email:  raman@google.com                                
WWW:    http://emacspeak.sf.net/raman/                  
Google: tv+raman                                        
GTalk:  raman@google.com                                
PGP:    http://emacspeak.sf.net/raman/raman-almaden.asc
Received on Thursday, 9 September 2010 16:42:10 UTC