Re: Comments on latest speech api proposal from Olli Pettay on 2011-11-03 (public-xg-htmlspeech@w3.org from November 2011)

From: Olli Pettay <Olli.Pettay@helsinki.fi>
Date: Thu, 03 Nov 2011 16:11:33 +0200
To: Bjorn Bringert <bringert@google.com>
CC: Dominic Mazzoni <dmazzoni@google.com>, public-xg-htmlspeech@w3.org
Message-ID: <4EB2A115.8040703@helsinki.fi>
On 11/03/2011 04:03 PM, Bjorn Bringert wrote:
> On Thu, Nov 3, 2011 at 6:22 AM, Dominic Mazzoni<dmazzoni@google.com>  wrote:
>> Hello,
>> My apologies for not joining this conversation sooner. I'm a Google Chrome
>> developer working on accessibility, and I recently helped to author
>> the Chrome TTS extension API that we launched a couple of months ago. If you
>> haven't already seen it, check out the docs here
>> - http://code.google.com/chrome/extensions/tts.html - this has been live in
>> Chrome since version 14 and there are a number of talking extensions and
>> voices in the Chrome web store now. I'd very much like for this extension
>> API to be compatible with the proposed HTML TTS API, and in fact I'm hoping
>> to help implement it in Chrome and for the two APIs to share a lot of code.
>> Here are some comments and questions on the draft.
>
>> For TTS, I don't understand where the content to be spoken is supposed to go
>> if it's not specified in the inner HTML. Are the only options to use
>> <tts>Hello, world</tts>  which inserts undesired text in older browsers, or
>> <tts src="text.ssml"/>, which forces me to put the text in a separate
>> document or use a cumbersome data url? The previous draft from a year ago
>> that I had looked at previously had a value attribute, so I could write<tts
>> value="Hello, world"/>  - why was that deprecated?
>
> I sent the same comment to this list yesterday. We should have value
> and lang attributes to allow simple synthesis use cases.

(Not very surprising comment from me:)
lang should be obviously just a hint for the UA/speech services. Web
page should not be able to query supported languages without user
permission.


>
>
>> The spec for both reco and TTS now allow the user to specify a service URL.
>> Could you clarify what the value would be if the developer wishes to use a
>> local (client-side) engine, if available? Some of the spec seems to assume a
>> network speech implementation, but client-side reco and TTS are very much
>> possible and quite desirable for applications that require extremely low
>> latency, like accessibility in particular. Is there any possibility the spec
>> could specify how a user agent could choose to offer local TTS and reco, or
>> to give the user or developer a choice of multiple TTS or reco engines,
>> which might be local or remote?
>
> Since the web app rely on any particular client-side engine to be
> installed, there is no explicit way to ask for a client-side engine.
> However, if the app doesn't specify a service at all, it's up to the
> user-agent to select one. This could be a client-side engine, if one
> is available. If the user agent wants, it can have a setting that lets
> the user pick an engine to use as the default.
>
>
>> Note that the Chrome TTS extension API has a way for the client to query the
>> list of possible voices and present the choice to the user or choose one
>> based on its characteristics. We've implemented support for OS-native TTS,
>> native client TTS, pure-javascript TTS (yes, it really works!), and
>> server-based TTS.
>> I think it's particularly important that whenever possible the user, not the
>> developer, should get to choose the TTS engine and voice. For accessibility,
>> visually-impaired users often prefer voices that can be sped up to
>> incredible speeds of 2 - 3x normal, and low latency is also extremely
>> important. Other users might only want to hear speech if the voice is
>> incredibly realistic, and latency may not matter to them. Still others might
>> prefer a voice that speaks with a particular accent - all male English
>> voices are not interchangable! Android is a great example of what can happen
>> when users can choose the TTS engine independently - there are dozens of
>> third-party voices available supporting lots of languages, at a variety of
>> prices. All of the voices are compatible with any Android app that uses the
>> system TTS API, including screen readers, driving direction apps, book
>> readers, and more. Right now the proposed spec implies that it's up to the
>> developer to choose an appropriate engine, but ideally that'd be the
>> exception rather than the rule - ideally the developer would just leave this
>> absent and the user agent would select the most appropriate speech engine
>> based on the language, user preferences, status of the network, etc.
>
> What you describe is how the API is designed to work. App-selected
> services are for developers who have special needs. Simple speech apps
> use the default user-agent engine, which the user agent can allow the
> user to select.
>
>
>> An earlier draft had the ability to set lastMark, but now it looks like it's
>> read-only, is that correct? That actually may be easier to implement,
>> because many speech engines don't support seeking to the middle of a speech
>> stream without first synthesizing the whole thing.
>
> Yes, I think that lastMark is intentionally read-only.
>
>
>> When I posted the initial version of the TTS extension API on the
>> chromium-extensions list, the primary feature request I got from developers
>> was the ability to get sentence, word, and even phoneme-level callbacks, so
>> that got added to the API before we launched it. Having callbacks at ssml
>> markers is great, but many applications require synchronizing closely with
>> the speech, and it seems really cumbersome and wasteful to have to add an
>> ssml mark between every word in the source document, when what the client
>> really wants is just constant notification at the finest level of detail
>> available. Any chance you could add a way to request more frequent
>> callbacks?
>
> Sounds reasonable. Some other people have brought that up in the past IIRC.
>
>
>
Received on Thursday, 3 November 2011 14:12:15 UTC