- From: Olli Pettay <Olli.Pettay@helsinki.fi>
- Date: Thu, 09 Sep 2010 23:06:11 +0300
- To: Bjorn Bringert <bringert@google.com>
- CC: Marc Schroeder <marc.schroeder@dfki.de>, public-xg-htmlspeech@w3.org
On 09/09/2010 06:56 PM, Bjorn Bringert wrote: > On Thu, Sep 9, 2010 at 4:17 PM, Marc Schroeder<marc.schroeder@dfki.de> wrote: >> Hi all, >> >> let me try and bring the TTS topic into the discussion. >> >> I am the core developer of DFKI's open source MARY TTS platform >> http://mary.dfki.de/, written in pure Java. Our TTS server provides an HTTP >> based interface with a simple AJAX user frontend (which you can try at >> http://mary.dfki.de:59125/); we are currently sending synthesis results via >> a GET request into an HTML 5<audio> tag, which works (in Firefox 3.5+) but >> seems suboptimal in some ways. > > I was just going to send out this TTS proposal: > http://docs.google.com/View?id=dcfg79pz_4gnmp96cz > > The basic idea is to add a<tts> element which extends > HTMLMediaElement (like<audio> and<video> do). I think that it > addresses most of the points that you bring up, see below. Why do we need a new element? If we had a proper JS API, it could be just something like TTS.play(someElement); and that would synthesize someElement.textContent. TTS object could support queuing and events, play, pause, etc. -Olli > > >> I think<audio> is suboptimal even for server-side TTS, for the following >> reasons/requirements: >> >> *<audio> provides no temporal structure of the synthesised speech. One >> feature that you often need is to know the time at which a given word is >> spoken, e.g., >> - to highlight the word in a visual rendition of the speech; >> - to synchronize with other modalities in a multimodal presentation (think >> of an arrow appearing in a picture when a deictic is used -- "THIS person", >> or of a talking head, or gesture animation in avatars); >> - to know when to interrupt (you might not want to cut off the speech in >> the middle of a sentence) > > The web app is notified when SSML<mark> events are reached (using the > HTMLMediaElement timeupdate event). > > >> * For longer stretches of spoken output, it is not obvious to me how to do >> "streaming" with an<audio> tag. Let's say a TTS can process one sentence at >> a time, and is requested to read an email consisting of three paragraphs. At >> the moment we would have to render the full email on the server before >> sending the result, which prolongs time-to-audio much more than necessary, >> for a simple transport/scheduling reason: IIRC, we need to indicate the >> Content-Length when sending the response, or else the audio wouldn't be >> played... > > While this is outside the scope of the proposal (since the proposal > doesn't specify how the browser talks to the synthesizer), streaming > from a server-side synthesizer can be done with chunked transfer > encoding. > > >> * There are certain properties of speech output that could be provided in an >> API, such as gender of the voice, language of the text to be spoken, >> preferred pronounciations, etc. -- of course SSML comes to mind >> (http://www.w3.org/TR/speech-synthesis11/ -- congratulations for reaching >> Recommendation status, Dan!) > > SSML documents can be used as the source in<tts>, so all these > parameters are supported. > > >> BTW, I have seen a number of emails on the whatwg list and here taking an a >> priori stance regarding the question whether ASR (and TTS) would happen in >> the browser ("user agent", you guys seem to call it) or on the server. I >> don't think the choice is a priori clear, I am sure there are good use cases >> for either choice. The question is whether there is a way to cater for both >> in an HTML speech API... > > The proposal leaves the choice of client or server synthesis > completely up to the browser. The web app just provides the text or > SSML to synthesize. The browser may even use both client- and > server-side synthesis, for example using a server-side synthesizer for > languages that the client-side one doesn't support, or using a simple > client-side synthesizer as a fallback if the network connection fails. > >
Received on Thursday, 9 September 2010 20:06:49 UTC