Re: [HTML Speech] Text to speech

On Thu, Sep 9, 2010 at 9:06 PM, Olli Pettay <Olli.Pettay@helsinki.fi> wrote:
> On 09/09/2010 06:56 PM, Bjorn Bringert wrote:
>>
>> On Thu, Sep 9, 2010 at 4:17 PM, Marc Schroeder<marc.schroeder@dfki.de>
>>  wrote:
>>>
>>> Hi all,
>>>
>>> let me try and bring the TTS topic into the discussion.
>>>
>>> I am the core developer of DFKI's open source MARY TTS platform
>>> http://mary.dfki.de/, written in pure Java. Our TTS server provides an
>>> HTTP
>>> based interface with a simple AJAX user frontend (which you can try at
>>> http://mary.dfki.de:59125/); we are currently sending synthesis results
>>> via
>>> a GET request into an HTML 5<audio>  tag, which works (in Firefox 3.5+)
>>> but
>>> seems suboptimal in some ways.
>>
>> I was just going to send out this TTS proposal:
>> http://docs.google.com/View?id=dcfg79pz_4gnmp96cz
>>
>> The basic idea is to add a<tts>  element which extends
>> HTMLMediaElement (like<audio>  and<video>  do). I think that it
>> addresses most of the points that you bring up, see below.
>
> Why do we need a new element?
> If we had a proper JS API, it could be just something like
> TTS.play(someElement); and that would synthesize someElement.textContent.
> TTS object could support queuing and events, play, pause, etc.

Sure, that would work too. But why introduce new APIs when
HTMLMediaElement has pretty much all that's needed? Adding a JS API
would require adding all the methods for playing, pausing, looping,
autobuffering, getting events, changing source etc. HTMLMediaElement
already has APIs for all that. It really just boils done to the choice
between HTML and JavaScript I guess, and adding a <tts> element seemed
most in line with HTML5.

Also, HTMLMediaElement allows showing UI controls by just setting an
attribute. If it were solely a JavaScript API, web app developers
would have to build their own control UIs.


>>> I think<audio>  is suboptimal even for server-side TTS, for the following
>>> reasons/requirements:
>>>
>>> *<audio>  provides no temporal structure of the synthesised speech. One
>>> feature that you often need is to know the time at which a given word is
>>> spoken, e.g.,
>>>  - to highlight the word in a visual rendition of the speech;
>>>  - to synchronize with other modalities in a multimodal presentation
>>> (think
>>> of an arrow appearing in a picture when a deictic is used -- "THIS
>>> person",
>>> or of a talking head, or gesture animation in avatars);
>>>  - to know when to interrupt (you might not want to cut off the speech in
>>> the middle of a sentence)
>>
>> The web app is notified when SSML<mark>  events are reached (using the
>> HTMLMediaElement timeupdate event).
>>
>>
>>> * For longer stretches of spoken output, it is not obvious to me how to
>>> do
>>> "streaming" with an<audio>  tag. Let's say a TTS can process one sentence
>>> at
>>> a time, and is requested to read an email consisting of three paragraphs.
>>> At
>>> the moment we would have to render the full email on the server before
>>> sending the result, which prolongs time-to-audio much more than
>>> necessary,
>>> for a simple transport/scheduling reason: IIRC, we need to indicate the
>>> Content-Length when sending the response, or else the audio wouldn't be
>>> played...
>>
>> While this is outside the scope of the proposal (since the proposal
>> doesn't specify how the browser talks to the synthesizer), streaming
>> from a server-side synthesizer can be done with chunked transfer
>> encoding.
>>
>>
>>> * There are certain properties of speech output that could be provided in
>>> an
>>> API, such as gender of the voice, language of the text to be spoken,
>>> preferred pronounciations, etc. -- of course SSML comes to mind
>>> (http://www.w3.org/TR/speech-synthesis11/ -- congratulations for reaching
>>> Recommendation status, Dan!)
>>
>> SSML documents can be used as the source in<tts>, so all these
>> parameters are supported.
>>
>>
>>> BTW, I have seen a number of emails on the whatwg list and here taking an
>>> a
>>> priori stance regarding the question whether ASR (and TTS) would happen
>>> in
>>> the browser ("user agent", you guys seem to call it) or on the server. I
>>> don't think the choice is a priori clear, I am sure there are good use
>>> cases
>>> for either choice. The question is whether there is a way to cater for
>>> both
>>> in an HTML speech API...
>>
>> The proposal leaves the choice of client or server synthesis
>> completely up to the browser. The web app just provides the text or
>> SSML to synthesize. The browser may even use both client- and
>> server-side synthesis, for example using a server-side synthesizer for
>> languages that the client-side one doesn't support, or using a simple
>> client-side synthesizer as a fallback if the network connection fails.
>>
>>
>
>



-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902

Received on Thursday, 9 September 2010 20:20:12 UTC