- From: David Bolter <david.bolter@gmail.com>
- Date: Thu, 09 Sep 2010 13:52:22 -0400
- To: Satish Sampath <satish@google.com>
- CC: Bjorn Bringert <bringert@google.com>, Marc Schroeder <marc.schroeder@dfki.de>, public-xg-htmlspeech@w3.org
Indeed it is! Thanks Satish and T.V. and sorry all for my confusion. cheers, David On 09/09/10 12:27 PM, Satish Sampath wrote: > Hi David, > > Speech synthesis is part of this incubator group's scope, as mentioned > in the charter (http://www.w3.org/2005/Incubator/htmlspeech/charter). > > Cheers > Satish > > > > On Thu, Sep 9, 2010 at 5:25 PM, David Bolter<david.bolter@gmail.com> wrote: >> Hi all, >> >> I am actually more interested in TTS than speech recogntion however I wasn't >> aware TTS is in scope for this group? Perhaps Dan can clarify. >> >> cheers, >> David >> >> On 09/09/10 11:56 AM, Bjorn Bringert wrote: >>> On Thu, Sep 9, 2010 at 4:17 PM, Marc Schroeder<marc.schroeder@dfki.de> >>> wrote: >>>> Hi all, >>>> >>>> let me try and bring the TTS topic into the discussion. >>>> >>>> I am the core developer of DFKI's open source MARY TTS platform >>>> http://mary.dfki.de/, written in pure Java. Our TTS server provides an >>>> HTTP >>>> based interface with a simple AJAX user frontend (which you can try at >>>> http://mary.dfki.de:59125/); we are currently sending synthesis results >>>> via >>>> a GET request into an HTML 5<audio> tag, which works (in Firefox 3.5+) >>>> but >>>> seems suboptimal in some ways. >>> I was just going to send out this TTS proposal: >>> http://docs.google.com/View?id=dcfg79pz_4gnmp96cz >>> >>> The basic idea is to add a<tts> element which extends >>> HTMLMediaElement (like<audio> and<video> do). I think that it >>> addresses most of the points that you bring up, see below. >>> >>> >>>> I think<audio> is suboptimal even for server-side TTS, for the following >>>> reasons/requirements: >>>> >>>> *<audio> provides no temporal structure of the synthesised speech. One >>>> feature that you often need is to know the time at which a given word is >>>> spoken, e.g., >>>> - to highlight the word in a visual rendition of the speech; >>>> - to synchronize with other modalities in a multimodal presentation >>>> (think >>>> of an arrow appearing in a picture when a deictic is used -- "THIS >>>> person", >>>> or of a talking head, or gesture animation in avatars); >>>> - to know when to interrupt (you might not want to cut off the speech in >>>> the middle of a sentence) >>> The web app is notified when SSML<mark> events are reached (using the >>> HTMLMediaElement timeupdate event). >>> >>> >>>> * For longer stretches of spoken output, it is not obvious to me how to >>>> do >>>> "streaming" with an<audio> tag. Let's say a TTS can process one sentence >>>> at >>>> a time, and is requested to read an email consisting of three paragraphs. >>>> At >>>> the moment we would have to render the full email on the server before >>>> sending the result, which prolongs time-to-audio much more than >>>> necessary, >>>> for a simple transport/scheduling reason: IIRC, we need to indicate the >>>> Content-Length when sending the response, or else the audio wouldn't be >>>> played... >>> While this is outside the scope of the proposal (since the proposal >>> doesn't specify how the browser talks to the synthesizer), streaming >>> from a server-side synthesizer can be done with chunked transfer >>> encoding. >>> >>> >>>> * There are certain properties of speech output that could be provided in >>>> an >>>> API, such as gender of the voice, language of the text to be spoken, >>>> preferred pronounciations, etc. -- of course SSML comes to mind >>>> (http://www.w3.org/TR/speech-synthesis11/ -- congratulations for reaching >>>> Recommendation status, Dan!) >>> SSML documents can be used as the source in<tts>, so all these >>> parameters are supported. >>> >>> >>>> BTW, I have seen a number of emails on the whatwg list and here taking an >>>> a >>>> priori stance regarding the question whether ASR (and TTS) would happen >>>> in >>>> the browser ("user agent", you guys seem to call it) or on the server. I >>>> don't think the choice is a priori clear, I am sure there are good use >>>> cases >>>> for either choice. The question is whether there is a way to cater for >>>> both >>>> in an HTML speech API... >>> The proposal leaves the choice of client or server synthesis >>> completely up to the browser. The web app just provides the text or >>> SSML to synthesize. The browser may even use both client- and >>> server-side synthesis, for example using a server-side synthesizer for >>> languages that the client-side one doesn't support, or using a simple >>> client-side synthesizer as a fallback if the network connection fails. >>> >>> >> >>
Received on Thursday, 9 September 2010 17:52:59 UTC