- From: Dominic Mazzoni <dmazzoni@google.com>
- Date: Wed, 2 Nov 2011 23:22:39 -0700
- To: public-xg-htmlspeech@w3.org
- Message-ID: <CAFz-FYw7Ph2+w=MY2jFknvMn7fDz+GEdtXPCBPG9zJVQPE3nmQ@mail.gmail.com>
Hello, My apologies for not joining this conversation sooner. I'm a Google Chrome developer working on accessibility, and I recently helped to author the Chrome TTS extension API that we launched a couple of months ago. If you haven't already seen it, check out the docs here - http://code.google.com/chrome/extensions/tts.html - this has been live in Chrome since version 14 and there are a number of talking extensions and voices in the Chrome web store now. I'd very much like for this extension API to be compatible with the proposed HTML TTS API, and in fact I'm hoping to help implement it in Chrome and for the two APIs to share a lot of code. Here are some comments and questions on the draft. For TTS, I don't understand where the content to be spoken is supposed to go if it's not specified in the inner HTML. Are the only options to use <tts>Hello, world</tts> which inserts undesired text in older browsers, or <tts src="text.ssml"/>, which forces me to put the text in a separate document or use a cumbersome data url? The previous draft from a year ago that I had looked at previously had a value attribute, so I could write <tts value="Hello, world"/> - why was that deprecated? The spec for both reco and TTS now allow the user to specify a service URL. Could you clarify what the value would be if the developer wishes to use a local (client-side) engine, if available? Some of the spec seems to assume a network speech implementation, but client-side reco and TTS are very much possible and quite desirable for applications that require extremely low latency, like accessibility in particular. Is there any possibility the spec could specify how a user agent could choose to offer local TTS and reco, or to give the user or developer a choice of multiple TTS or reco engines, which might be local or remote? Note that the Chrome TTS extension API has a way for the client to query the list of possible voices and present the choice to the user or choose one based on its characteristics. We've implemented support for OS-native TTS, native client TTS, pure-javascript TTS (yes, it really works!), and server-based TTS. I think it's particularly important that whenever possible the user, not the developer, should get to choose the TTS engine and voice. For accessibility, visually-impaired users often prefer voices that can be sped up to incredible speeds of 2 - 3x normal, and low latency is also extremely important. Other users might only want to hear speech if the voice is incredibly realistic, and latency may not matter to them. Still others might prefer a voice that speaks with a particular accent - all male English voices are not interchangable! Android is a great example of what can happen when users can choose the TTS engine independently - there are dozens of third-party voices available supporting lots of languages, at a variety of prices. All of the voices are compatible with any Android app that uses the system TTS API, including screen readers, driving direction apps, book readers, and more. Right now the proposed spec implies that it's up to the developer to choose an appropriate engine, but ideally that'd be the exception rather than the rule - ideally the developer would just leave this absent and the user agent would select the most appropriate speech engine based on the language, user preferences, status of the network, etc. An earlier draft had the ability to set lastMark, but now it looks like it's read-only, is that correct? That actually may be easier to implement, because many speech engines don't support seeking to the middle of a speech stream without first synthesizing the whole thing. When I posted the initial version of the TTS extension API on the chromium-extensions list, the primary feature request I got from developers was the ability to get sentence, word, and even phoneme-level callbacks, so that got added to the API before we launched it. Having callbacks at ssml markers is great, but many applications require synchronizing closely with the speech, and it seems really cumbersome and wasteful to have to add an ssml mark between every word in the source document, when what the client really wants is just constant notification at the finest level of detail available. Any chance you could add a way to request more frequent callbacks? Thanks very much for considering my thoughts. - Dominic
Received on Thursday, 3 November 2011 06:23:16 UTC