- From: Bjorn Bringert <bringert@google.com>
- Date: Thu, 3 Nov 2011 14:03:38 +0000
- To: Dominic Mazzoni <dmazzoni@google.com>
- Cc: public-xg-htmlspeech@w3.org
On Thu, Nov 3, 2011 at 6:22 AM, Dominic Mazzoni <dmazzoni@google.com> wrote: > Hello, > My apologies for not joining this conversation sooner. I'm a Google Chrome > developer working on accessibility, and I recently helped to author > the Chrome TTS extension API that we launched a couple of months ago. If you > haven't already seen it, check out the docs here > - http://code.google.com/chrome/extensions/tts.html - this has been live in > Chrome since version 14 and there are a number of talking extensions and > voices in the Chrome web store now. I'd very much like for this extension > API to be compatible with the proposed HTML TTS API, and in fact I'm hoping > to help implement it in Chrome and for the two APIs to share a lot of code. > Here are some comments and questions on the draft. > For TTS, I don't understand where the content to be spoken is supposed to go > if it's not specified in the inner HTML. Are the only options to use > <tts>Hello, world</tts> which inserts undesired text in older browsers, or > <tts src="text.ssml"/>, which forces me to put the text in a separate > document or use a cumbersome data url? The previous draft from a year ago > that I had looked at previously had a value attribute, so I could write <tts > value="Hello, world"/> - why was that deprecated? I sent the same comment to this list yesterday. We should have value and lang attributes to allow simple synthesis use cases. > The spec for both reco and TTS now allow the user to specify a service URL. > Could you clarify what the value would be if the developer wishes to use a > local (client-side) engine, if available? Some of the spec seems to assume a > network speech implementation, but client-side reco and TTS are very much > possible and quite desirable for applications that require extremely low > latency, like accessibility in particular. Is there any possibility the spec > could specify how a user agent could choose to offer local TTS and reco, or > to give the user or developer a choice of multiple TTS or reco engines, > which might be local or remote? Since the web app rely on any particular client-side engine to be installed, there is no explicit way to ask for a client-side engine. However, if the app doesn't specify a service at all, it's up to the user-agent to select one. This could be a client-side engine, if one is available. If the user agent wants, it can have a setting that lets the user pick an engine to use as the default. > Note that the Chrome TTS extension API has a way for the client to query the > list of possible voices and present the choice to the user or choose one > based on its characteristics. We've implemented support for OS-native TTS, > native client TTS, pure-javascript TTS (yes, it really works!), and > server-based TTS. > I think it's particularly important that whenever possible the user, not the > developer, should get to choose the TTS engine and voice. For accessibility, > visually-impaired users often prefer voices that can be sped up to > incredible speeds of 2 - 3x normal, and low latency is also extremely > important. Other users might only want to hear speech if the voice is > incredibly realistic, and latency may not matter to them. Still others might > prefer a voice that speaks with a particular accent - all male English > voices are not interchangable! Android is a great example of what can happen > when users can choose the TTS engine independently - there are dozens of > third-party voices available supporting lots of languages, at a variety of > prices. All of the voices are compatible with any Android app that uses the > system TTS API, including screen readers, driving direction apps, book > readers, and more. Right now the proposed spec implies that it's up to the > developer to choose an appropriate engine, but ideally that'd be the > exception rather than the rule - ideally the developer would just leave this > absent and the user agent would select the most appropriate speech engine > based on the language, user preferences, status of the network, etc. What you describe is how the API is designed to work. App-selected services are for developers who have special needs. Simple speech apps use the default user-agent engine, which the user agent can allow the user to select. > An earlier draft had the ability to set lastMark, but now it looks like it's > read-only, is that correct? That actually may be easier to implement, > because many speech engines don't support seeking to the middle of a speech > stream without first synthesizing the whole thing. Yes, I think that lastMark is intentionally read-only. > When I posted the initial version of the TTS extension API on the > chromium-extensions list, the primary feature request I got from developers > was the ability to get sentence, word, and even phoneme-level callbacks, so > that got added to the API before we launched it. Having callbacks at ssml > markers is great, but many applications require synchronizing closely with > the speech, and it seems really cumbersome and wasteful to have to add an > ssml mark between every word in the source document, when what the client > really wants is just constant notification at the finest level of detail > available. Any chance you could add a way to request more frequent > callbacks? Sounds reasonable. Some other people have brought that up in the past IIRC. -- Bjorn Bringert Google UK Limited, Registered Office: Belgrave House, 76 Buckingham Palace Road, London, SW1W 9TQ Registered in England Number: 3977902
Received on Thursday, 3 November 2011 14:04:07 UTC