Re: [HTML Speech] Text to speech from Satish Sampath on 2010-09-09 (public-xg-htmlspeech@w3.org from September 2010)

From: Satish Sampath <satish@google.com>
Date: Thu, 9 Sep 2010 17:27:56 +0100
To: David Bolter <david.bolter@gmail.com>
Cc: Bjorn Bringert <bringert@google.com>, Marc Schroeder <marc.schroeder@dfki.de>, public-xg-htmlspeech@w3.org
Message-ID: <AANLkTi=a=VYpk=ACfNExiR6uKBkrUqA29Fp+vE=ni6n3@mail.gmail.com>
Hi David,

Speech synthesis is part of this incubator group's scope, as mentioned
in the charter (http://www.w3.org/2005/Incubator/htmlspeech/charter).

Cheers
Satish



On Thu, Sep 9, 2010 at 5:25 PM, David Bolter <david.bolter@gmail.com> wrote:
>  Hi all,
>
> I am actually more interested in TTS than speech recogntion however I wasn't
> aware TTS is in scope for this group? Perhaps Dan can clarify.
>
> cheers,
> David
>
> On 09/09/10 11:56 AM, Bjorn Bringert wrote:
>>
>> On Thu, Sep 9, 2010 at 4:17 PM, Marc Schroeder<marc.schroeder@dfki.de>
>>  wrote:
>>>
>>> Hi all,
>>>
>>> let me try and bring the TTS topic into the discussion.
>>>
>>> I am the core developer of DFKI's open source MARY TTS platform
>>> http://mary.dfki.de/, written in pure Java. Our TTS server provides an
>>> HTTP
>>> based interface with a simple AJAX user frontend (which you can try at
>>> http://mary.dfki.de:59125/); we are currently sending synthesis results
>>> via
>>> a GET request into an HTML 5<audio>  tag, which works (in Firefox 3.5+)
>>> but
>>> seems suboptimal in some ways.
>>
>> I was just going to send out this TTS proposal:
>> http://docs.google.com/View?id=dcfg79pz_4gnmp96cz
>>
>> The basic idea is to add a<tts>  element which extends
>> HTMLMediaElement (like<audio>  and<video>  do). I think that it
>> addresses most of the points that you bring up, see below.
>>
>>
>>> I think<audio>  is suboptimal even for server-side TTS, for the following
>>> reasons/requirements:
>>>
>>> *<audio>  provides no temporal structure of the synthesised speech. One
>>> feature that you often need is to know the time at which a given word is
>>> spoken, e.g.,
>>>  - to highlight the word in a visual rendition of the speech;
>>>  - to synchronize with other modalities in a multimodal presentation
>>> (think
>>> of an arrow appearing in a picture when a deictic is used -- "THIS
>>> person",
>>> or of a talking head, or gesture animation in avatars);
>>>  - to know when to interrupt (you might not want to cut off the speech in
>>> the middle of a sentence)
>>
>> The web app is notified when SSML<mark>  events are reached (using the
>> HTMLMediaElement timeupdate event).
>>
>>
>>> * For longer stretches of spoken output, it is not obvious to me how to
>>> do
>>> "streaming" with an<audio>  tag. Let's say a TTS can process one sentence
>>> at
>>> a time, and is requested to read an email consisting of three paragraphs.
>>> At
>>> the moment we would have to render the full email on the server before
>>> sending the result, which prolongs time-to-audio much more than
>>> necessary,
>>> for a simple transport/scheduling reason: IIRC, we need to indicate the
>>> Content-Length when sending the response, or else the audio wouldn't be
>>> played...
>>
>> While this is outside the scope of the proposal (since the proposal
>> doesn't specify how the browser talks to the synthesizer), streaming
>> from a server-side synthesizer can be done with chunked transfer
>> encoding.
>>
>>
>>> * There are certain properties of speech output that could be provided in
>>> an
>>> API, such as gender of the voice, language of the text to be spoken,
>>> preferred pronounciations, etc. -- of course SSML comes to mind
>>> (http://www.w3.org/TR/speech-synthesis11/ -- congratulations for reaching
>>> Recommendation status, Dan!)
>>
>> SSML documents can be used as the source in<tts>, so all these
>> parameters are supported.
>>
>>
>>> BTW, I have seen a number of emails on the whatwg list and here taking an
>>> a
>>> priori stance regarding the question whether ASR (and TTS) would happen
>>> in
>>> the browser ("user agent", you guys seem to call it) or on the server. I
>>> don't think the choice is a priori clear, I am sure there are good use
>>> cases
>>> for either choice. The question is whether there is a way to cater for
>>> both
>>> in an HTML speech API...
>>
>> The proposal leaves the choice of client or server synthesis
>> completely up to the browser. The web app just provides the text or
>> SSML to synthesize. The browser may even use both client- and
>> server-side synthesis, for example using a server-side synthesizer for
>> languages that the client-side one doesn't support, or using a simple
>> client-side synthesizer as a fallback if the network connection fails.
>>
>>
>
>
>
Received on Thursday, 9 September 2010 16:28:26 UTC