Re: Comments on latest speech api proposal from Bjorn Bringert on 2011-11-03 (public-xg-htmlspeech@w3.org from November 2011)

From: Bjorn Bringert <bringert@google.com>
Date: Thu, 3 Nov 2011 14:03:38 +0000
To: Dominic Mazzoni <dmazzoni@google.com>
Cc: public-xg-htmlspeech@w3.org
Message-ID: <CAJtyJaXebcHWDHVSkwtkebvXQBpWCbvxcaJP2jjRgn=wWqmf7w@mail.gmail.com>
On Thu, Nov 3, 2011 at 6:22 AM, Dominic Mazzoni <dmazzoni@google.com> wrote:
> Hello,
> My apologies for not joining this conversation sooner. I'm a Google Chrome
> developer working on accessibility, and I recently helped to author
> the Chrome TTS extension API that we launched a couple of months ago. If you
> haven't already seen it, check out the docs here
> - http://code.google.com/chrome/extensions/tts.html - this has been live in
> Chrome since version 14 and there are a number of talking extensions and
> voices in the Chrome web store now. I'd very much like for this extension
> API to be compatible with the proposed HTML TTS API, and in fact I'm hoping
> to help implement it in Chrome and for the two APIs to share a lot of code.
> Here are some comments and questions on the draft.

> For TTS, I don't understand where the content to be spoken is supposed to go
> if it's not specified in the inner HTML. Are the only options to use
> <tts>Hello, world</tts> which inserts undesired text in older browsers, or
> <tts src="text.ssml"/>, which forces me to put the text in a separate
> document or use a cumbersome data url? The previous draft from a year ago
> that I had looked at previously had a value attribute, so I could write <tts
> value="Hello, world"/> - why was that deprecated?

I sent the same comment to this list yesterday. We should have value
and lang attributes to allow simple synthesis use cases.


> The spec for both reco and TTS now allow the user to specify a service URL.
> Could you clarify what the value would be if the developer wishes to use a
> local (client-side) engine, if available? Some of the spec seems to assume a
> network speech implementation, but client-side reco and TTS are very much
> possible and quite desirable for applications that require extremely low
> latency, like accessibility in particular. Is there any possibility the spec
> could specify how a user agent could choose to offer local TTS and reco, or
> to give the user or developer a choice of multiple TTS or reco engines,
> which might be local or remote?

Since the web app rely on any particular client-side engine to be
installed, there is no explicit way to ask for a client-side engine.
However, if the app doesn't specify a service at all, it's up to the
user-agent to select one. This could be a client-side engine, if one
is available. If the user agent wants, it can have a setting that lets
the user pick an engine to use as the default.


> Note that the Chrome TTS extension API has a way for the client to query the
> list of possible voices and present the choice to the user or choose one
> based on its characteristics. We've implemented support for OS-native TTS,
> native client TTS, pure-javascript TTS (yes, it really works!), and
> server-based TTS.
> I think it's particularly important that whenever possible the user, not the
> developer, should get to choose the TTS engine and voice. For accessibility,
> visually-impaired users often prefer voices that can be sped up to
> incredible speeds of 2 - 3x normal, and low latency is also extremely
> important. Other users might only want to hear speech if the voice is
> incredibly realistic, and latency may not matter to them. Still others might
> prefer a voice that speaks with a particular accent - all male English
> voices are not interchangable! Android is a great example of what can happen
> when users can choose the TTS engine independently - there are dozens of
> third-party voices available supporting lots of languages, at a variety of
> prices. All of the voices are compatible with any Android app that uses the
> system TTS API, including screen readers, driving direction apps, book
> readers, and more. Right now the proposed spec implies that it's up to the
> developer to choose an appropriate engine, but ideally that'd be the
> exception rather than the rule - ideally the developer would just leave this
> absent and the user agent would select the most appropriate speech engine
> based on the language, user preferences, status of the network, etc.

What you describe is how the API is designed to work. App-selected
services are for developers who have special needs. Simple speech apps
use the default user-agent engine, which the user agent can allow the
user to select.


> An earlier draft had the ability to set lastMark, but now it looks like it's
> read-only, is that correct? That actually may be easier to implement,
> because many speech engines don't support seeking to the middle of a speech
> stream without first synthesizing the whole thing.

Yes, I think that lastMark is intentionally read-only.


> When I posted the initial version of the TTS extension API on the
> chromium-extensions list, the primary feature request I got from developers
> was the ability to get sentence, word, and even phoneme-level callbacks, so
> that got added to the API before we launched it. Having callbacks at ssml
> markers is great, but many applications require synchronizing closely with
> the speech, and it seems really cumbersome and wasteful to have to add an
> ssml mark between every word in the source document, when what the client
> really wants is just constant notification at the finest level of detail
> available. Any chance you could add a way to request more frequent
> callbacks?

Sounds reasonable. Some other people have brought that up in the past IIRC.



-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902
Received on Thursday, 3 November 2011 14:04:07 UTC