Re: [HTML Speech] Text to speech from Olli Pettay on 2010-09-09 (public-xg-htmlspeech@w3.org from September 2010)

From: Olli Pettay <Olli.Pettay@helsinki.fi>
Date: Thu, 09 Sep 2010 23:06:11 +0300
To: Bjorn Bringert <bringert@google.com>
CC: Marc Schroeder <marc.schroeder@dfki.de>, public-xg-htmlspeech@w3.org
Message-ID: <4C893E33.40700@helsinki.fi>
On 09/09/2010 06:56 PM, Bjorn Bringert wrote:
> On Thu, Sep 9, 2010 at 4:17 PM, Marc Schroeder<marc.schroeder@dfki.de>  wrote:
>> Hi all,
>>
>> let me try and bring the TTS topic into the discussion.
>>
>> I am the core developer of DFKI's open source MARY TTS platform
>> http://mary.dfki.de/, written in pure Java. Our TTS server provides an HTTP
>> based interface with a simple AJAX user frontend (which you can try at
>> http://mary.dfki.de:59125/); we are currently sending synthesis results via
>> a GET request into an HTML 5<audio>  tag, which works (in Firefox 3.5+) but
>> seems suboptimal in some ways.
>
> I was just going to send out this TTS proposal:
> http://docs.google.com/View?id=dcfg79pz_4gnmp96cz
>
> The basic idea is to add a<tts>  element which extends
> HTMLMediaElement (like<audio>  and<video>  do). I think that it
> addresses most of the points that you bring up, see below.

Why do we need a new element?
If we had a proper JS API, it could be just something like
TTS.play(someElement); and that would synthesize someElement.textContent.
TTS object could support queuing and events, play, pause, etc.

-Olli


>
>
>> I think<audio>  is suboptimal even for server-side TTS, for the following
>> reasons/requirements:
>>
>> *<audio>  provides no temporal structure of the synthesised speech. One
>> feature that you often need is to know the time at which a given word is
>> spoken, e.g.,
>>   - to highlight the word in a visual rendition of the speech;
>>   - to synchronize with other modalities in a multimodal presentation (think
>> of an arrow appearing in a picture when a deictic is used -- "THIS person",
>> or of a talking head, or gesture animation in avatars);
>>   - to know when to interrupt (you might not want to cut off the speech in
>> the middle of a sentence)
>
> The web app is notified when SSML<mark>  events are reached (using the
> HTMLMediaElement timeupdate event).
>
>
>> * For longer stretches of spoken output, it is not obvious to me how to do
>> "streaming" with an<audio>  tag. Let's say a TTS can process one sentence at
>> a time, and is requested to read an email consisting of three paragraphs. At
>> the moment we would have to render the full email on the server before
>> sending the result, which prolongs time-to-audio much more than necessary,
>> for a simple transport/scheduling reason: IIRC, we need to indicate the
>> Content-Length when sending the response, or else the audio wouldn't be
>> played...
>
> While this is outside the scope of the proposal (since the proposal
> doesn't specify how the browser talks to the synthesizer), streaming
> from a server-side synthesizer can be done with chunked transfer
> encoding.
>
>
>> * There are certain properties of speech output that could be provided in an
>> API, such as gender of the voice, language of the text to be spoken,
>> preferred pronounciations, etc. -- of course SSML comes to mind
>> (http://www.w3.org/TR/speech-synthesis11/ -- congratulations for reaching
>> Recommendation status, Dan!)
>
> SSML documents can be used as the source in<tts>, so all these
> parameters are supported.
>
>
>> BTW, I have seen a number of emails on the whatwg list and here taking an a
>> priori stance regarding the question whether ASR (and TTS) would happen in
>> the browser ("user agent", you guys seem to call it) or on the server. I
>> don't think the choice is a priori clear, I am sure there are good use cases
>> for either choice. The question is whether there is a way to cater for both
>> in an HTML speech API...
>
> The proposal leaves the choice of client or server synthesis
> completely up to the browser. The web app just provides the text or
> SSML to synthesize. The browser may even use both client- and
> server-side synthesis, for example using a server-side synthesizer for
> languages that the client-side one doesn't support, or using a simple
> client-side synthesizer as a fallback if the network connection fails.
>
>
Received on Thursday, 9 September 2010 20:06:49 UTC