- From: Satish Sampath <satish@google.com>
- Date: Thu, 9 Sep 2010 17:27:56 +0100
- To: David Bolter <david.bolter@gmail.com>
- Cc: Bjorn Bringert <bringert@google.com>, Marc Schroeder <marc.schroeder@dfki.de>, public-xg-htmlspeech@w3.org
Hi David, Speech synthesis is part of this incubator group's scope, as mentioned in the charter (http://www.w3.org/2005/Incubator/htmlspeech/charter). Cheers Satish On Thu, Sep 9, 2010 at 5:25 PM, David Bolter <david.bolter@gmail.com> wrote: > Hi all, > > I am actually more interested in TTS than speech recogntion however I wasn't > aware TTS is in scope for this group? Perhaps Dan can clarify. > > cheers, > David > > On 09/09/10 11:56 AM, Bjorn Bringert wrote: >> >> On Thu, Sep 9, 2010 at 4:17 PM, Marc Schroeder<marc.schroeder@dfki.de> >> wrote: >>> >>> Hi all, >>> >>> let me try and bring the TTS topic into the discussion. >>> >>> I am the core developer of DFKI's open source MARY TTS platform >>> http://mary.dfki.de/, written in pure Java. Our TTS server provides an >>> HTTP >>> based interface with a simple AJAX user frontend (which you can try at >>> http://mary.dfki.de:59125/); we are currently sending synthesis results >>> via >>> a GET request into an HTML 5<audio> tag, which works (in Firefox 3.5+) >>> but >>> seems suboptimal in some ways. >> >> I was just going to send out this TTS proposal: >> http://docs.google.com/View?id=dcfg79pz_4gnmp96cz >> >> The basic idea is to add a<tts> element which extends >> HTMLMediaElement (like<audio> and<video> do). I think that it >> addresses most of the points that you bring up, see below. >> >> >>> I think<audio> is suboptimal even for server-side TTS, for the following >>> reasons/requirements: >>> >>> *<audio> provides no temporal structure of the synthesised speech. One >>> feature that you often need is to know the time at which a given word is >>> spoken, e.g., >>> - to highlight the word in a visual rendition of the speech; >>> - to synchronize with other modalities in a multimodal presentation >>> (think >>> of an arrow appearing in a picture when a deictic is used -- "THIS >>> person", >>> or of a talking head, or gesture animation in avatars); >>> - to know when to interrupt (you might not want to cut off the speech in >>> the middle of a sentence) >> >> The web app is notified when SSML<mark> events are reached (using the >> HTMLMediaElement timeupdate event). >> >> >>> * For longer stretches of spoken output, it is not obvious to me how to >>> do >>> "streaming" with an<audio> tag. Let's say a TTS can process one sentence >>> at >>> a time, and is requested to read an email consisting of three paragraphs. >>> At >>> the moment we would have to render the full email on the server before >>> sending the result, which prolongs time-to-audio much more than >>> necessary, >>> for a simple transport/scheduling reason: IIRC, we need to indicate the >>> Content-Length when sending the response, or else the audio wouldn't be >>> played... >> >> While this is outside the scope of the proposal (since the proposal >> doesn't specify how the browser talks to the synthesizer), streaming >> from a server-side synthesizer can be done with chunked transfer >> encoding. >> >> >>> * There are certain properties of speech output that could be provided in >>> an >>> API, such as gender of the voice, language of the text to be spoken, >>> preferred pronounciations, etc. -- of course SSML comes to mind >>> (http://www.w3.org/TR/speech-synthesis11/ -- congratulations for reaching >>> Recommendation status, Dan!) >> >> SSML documents can be used as the source in<tts>, so all these >> parameters are supported. >> >> >>> BTW, I have seen a number of emails on the whatwg list and here taking an >>> a >>> priori stance regarding the question whether ASR (and TTS) would happen >>> in >>> the browser ("user agent", you guys seem to call it) or on the server. I >>> don't think the choice is a priori clear, I am sure there are good use >>> cases >>> for either choice. The question is whether there is a way to cater for >>> both >>> in an HTML speech API... >> >> The proposal leaves the choice of client or server synthesis >> completely up to the browser. The web app just provides the text or >> SSML to synthesize. The browser may even use both client- and >> server-side synthesis, for example using a server-side synthesizer for >> languages that the client-side one doesn't support, or using a simple >> client-side synthesizer as a fallback if the network connection fails. >> >> > > >
Received on Thursday, 9 September 2010 16:28:26 UTC