Comments on latest speech api proposal from Dominic Mazzoni on 2011-11-03 (public-xg-htmlspeech@w3.org from November 2011)

From: Dominic Mazzoni <dmazzoni@google.com>
Date: Wed, 2 Nov 2011 23:22:39 -0700
To: public-xg-htmlspeech@w3.org
Message-ID: <CAFz-FYw7Ph2+w=MY2jFknvMn7fDz+GEdtXPCBPG9zJVQPE3nmQ@mail.gmail.com>
Hello,

My apologies for not joining this conversation sooner. I'm a Google Chrome
developer working on accessibility, and I recently helped to author
the Chrome TTS extension API that we launched a couple of months ago. If
you haven't already seen it, check out the docs here -
http://code.google.com/chrome/extensions/tts.html - this has been live in
Chrome since version 14 and there are a number of talking extensions and
voices in the Chrome web store now. I'd very much like for this extension
API to be compatible with the proposed HTML TTS API, and in fact I'm hoping
to help implement it in Chrome and for the two APIs to share a lot of code.

Here are some comments and questions on the draft.

For TTS, I don't understand where the content to be spoken is supposed to
go if it's not specified in the inner HTML. Are the only options to use
<tts>Hello, world</tts> which inserts undesired text in older browsers, or
<tts src="text.ssml"/>, which forces me to put the text in a separate
document or use a cumbersome data url? The previous draft from a year ago
that I had looked at previously had a value attribute, so I could write
<tts value="Hello, world"/> - why was that deprecated?

The spec for both reco and TTS now allow the user to specify a service URL.
Could you clarify what the value would be if the developer wishes to use a
local (client-side) engine, if available? Some of the spec seems to assume
a network speech implementation, but client-side reco and TTS are very much
possible and quite desirable for applications that require extremely low
latency, like accessibility in particular. Is there any possibility the
spec could specify how a user agent could choose to offer local TTS and
reco, or to give the user or developer a choice of multiple TTS or reco
engines, which might be local or remote?

Note that the Chrome TTS extension API has a way for the client to query
the list of possible voices and present the choice to the user or choose
one based on its characteristics. We've implemented support for OS-native
TTS, native client TTS, pure-javascript TTS (yes, it really works!), and
server-based TTS.

I think it's particularly important that whenever possible the user, not
the developer, should get to choose the TTS engine and voice. For
accessibility, visually-impaired users often prefer voices that can be sped
up to incredible speeds of 2 - 3x normal, and low latency is also extremely
important. Other users might only want to hear speech if the voice is
incredibly realistic, and latency may not matter to them. Still others
might prefer a voice that speaks with a particular accent - all male
English voices are not interchangable! Android is a great example of what
can happen when users can choose the TTS engine independently - there are
dozens of third-party voices available supporting lots of languages, at a
variety of prices. All of the voices are compatible with any Android app
that uses the system TTS API, including screen readers, driving direction
apps, book readers, and more. Right now the proposed spec implies that it's
up to the developer to choose an appropriate engine, but ideally that'd be
the exception rather than the rule - ideally the developer would just leave
this absent and the user agent would select the most appropriate speech
engine based on the language, user preferences, status of the network, etc.

An earlier draft had the ability to set lastMark, but now it looks like
it's read-only, is that correct? That actually may be easier to implement,
because many speech engines don't support seeking to the middle of a speech
stream without first synthesizing the whole thing.

When I posted the initial version of the TTS extension API on the
chromium-extensions list, the primary feature request I got from developers
was the ability to get sentence, word, and even phoneme-level callbacks, so
that got added to the API before we launched it. Having callbacks at ssml
markers is great, but many applications require synchronizing closely with
the speech, and it seems really cumbersome and wasteful to have to add an
ssml mark between every word in the source document, when what the client
really wants is just constant notification at the finest level of detail
available. Any chance you could add a way to request more frequent
callbacks?

Thanks very much for considering my thoughts.

- Dominic
Received on Thursday, 3 November 2011 06:23:16 UTC