Re: [HTML Speech] Text to speech from David Bolter on 2010-09-09 (public-xg-htmlspeech@w3.org from September 2010)

From: David Bolter <david.bolter@gmail.com>
Date: Thu, 09 Sep 2010 13:52:22 -0400
To: Satish Sampath <satish@google.com>
CC: Bjorn Bringert <bringert@google.com>, Marc Schroeder <marc.schroeder@dfki.de>, public-xg-htmlspeech@w3.org
Message-ID: <4C891ED6.2020805@gmail.com>
  Indeed it is! Thanks Satish and T.V. and sorry all for my confusion.

cheers,
David

On 09/09/10 12:27 PM, Satish Sampath wrote:
> Hi David,
>
> Speech synthesis is part of this incubator group's scope, as mentioned
> in the charter (http://www.w3.org/2005/Incubator/htmlspeech/charter).
>
> Cheers
> Satish
>
>
>
> On Thu, Sep 9, 2010 at 5:25 PM, David Bolter<david.bolter@gmail.com>  wrote:
>>   Hi all,
>>
>> I am actually more interested in TTS than speech recogntion however I wasn't
>> aware TTS is in scope for this group? Perhaps Dan can clarify.
>>
>> cheers,
>> David
>>
>> On 09/09/10 11:56 AM, Bjorn Bringert wrote:
>>> On Thu, Sep 9, 2010 at 4:17 PM, Marc Schroeder<marc.schroeder@dfki.de>
>>>   wrote:
>>>> Hi all,
>>>>
>>>> let me try and bring the TTS topic into the discussion.
>>>>
>>>> I am the core developer of DFKI's open source MARY TTS platform
>>>> http://mary.dfki.de/, written in pure Java. Our TTS server provides an
>>>> HTTP
>>>> based interface with a simple AJAX user frontend (which you can try at
>>>> http://mary.dfki.de:59125/); we are currently sending synthesis results
>>>> via
>>>> a GET request into an HTML 5<audio>    tag, which works (in Firefox 3.5+)
>>>> but
>>>> seems suboptimal in some ways.
>>> I was just going to send out this TTS proposal:
>>> http://docs.google.com/View?id=dcfg79pz_4gnmp96cz
>>>
>>> The basic idea is to add a<tts>    element which extends
>>> HTMLMediaElement (like<audio>    and<video>    do). I think that it
>>> addresses most of the points that you bring up, see below.
>>>
>>>
>>>> I think<audio>    is suboptimal even for server-side TTS, for the following
>>>> reasons/requirements:
>>>>
>>>> *<audio>    provides no temporal structure of the synthesised speech. One
>>>> feature that you often need is to know the time at which a given word is
>>>> spoken, e.g.,
>>>>   - to highlight the word in a visual rendition of the speech;
>>>>   - to synchronize with other modalities in a multimodal presentation
>>>> (think
>>>> of an arrow appearing in a picture when a deictic is used -- "THIS
>>>> person",
>>>> or of a talking head, or gesture animation in avatars);
>>>>   - to know when to interrupt (you might not want to cut off the speech in
>>>> the middle of a sentence)
>>> The web app is notified when SSML<mark>    events are reached (using the
>>> HTMLMediaElement timeupdate event).
>>>
>>>
>>>> * For longer stretches of spoken output, it is not obvious to me how to
>>>> do
>>>> "streaming" with an<audio>    tag. Let's say a TTS can process one sentence
>>>> at
>>>> a time, and is requested to read an email consisting of three paragraphs.
>>>> At
>>>> the moment we would have to render the full email on the server before
>>>> sending the result, which prolongs time-to-audio much more than
>>>> necessary,
>>>> for a simple transport/scheduling reason: IIRC, we need to indicate the
>>>> Content-Length when sending the response, or else the audio wouldn't be
>>>> played...
>>> While this is outside the scope of the proposal (since the proposal
>>> doesn't specify how the browser talks to the synthesizer), streaming
>>> from a server-side synthesizer can be done with chunked transfer
>>> encoding.
>>>
>>>
>>>> * There are certain properties of speech output that could be provided in
>>>> an
>>>> API, such as gender of the voice, language of the text to be spoken,
>>>> preferred pronounciations, etc. -- of course SSML comes to mind
>>>> (http://www.w3.org/TR/speech-synthesis11/ -- congratulations for reaching
>>>> Recommendation status, Dan!)
>>> SSML documents can be used as the source in<tts>, so all these
>>> parameters are supported.
>>>
>>>
>>>> BTW, I have seen a number of emails on the whatwg list and here taking an
>>>> a
>>>> priori stance regarding the question whether ASR (and TTS) would happen
>>>> in
>>>> the browser ("user agent", you guys seem to call it) or on the server. I
>>>> don't think the choice is a priori clear, I am sure there are good use
>>>> cases
>>>> for either choice. The question is whether there is a way to cater for
>>>> both
>>>> in an HTML speech API...
>>> The proposal leaves the choice of client or server synthesis
>>> completely up to the browser. The web app just provides the text or
>>> SSML to synthesize. The browser may even use both client- and
>>> server-side synthesis, for example using a server-side synthesizer for
>>> languages that the client-side one doesn't support, or using a simple
>>> client-side synthesizer as a fallback if the network connection fails.
>>>
>>>
>>
>>
Received on Thursday, 9 September 2010 17:52:59 UTC