Re: Proposal to add start, stop, and update events to TTS from Dominic Mazzoni on 2012-10-02 (public-speech-api@w3.org from October 2012)

From: Dominic Mazzoni <dmazzoni@google.com>
Date: Mon, 1 Oct 2012 23:07:04 -0700
To: Glen Shires <gshires@google.com>
Cc: public-speech-api@w3.org
Message-ID: <CAFz-FYwWcpKuJJ8a5fPgRufZASZhfyHnd=zh-4FE595_JTb7Dg@mail.gmail.com>

On Mon, Oct 1, 2012 at 4:14 PM, Glen Shires <gshires@google.com> wrote:
> - Should markerName be changed from "DOMString" to "type ( enumerated string
> ["word", "sentence", "marker"] )".

I'd be fine with an enum, as long as it's clear that we have the
option to expand on this in the future - for example, an engine might
be able to do a callback for each phoneme or syllable.

> If so, how are named markers returned?
> (We could add a "DOMString namedMarker" parameter or add an "onnamedmarker()
> event".

I'd prefer the DOMString namedMarker over a separate event. I think
one event for all progress updates is simpler for both server and
client.

> - What should be the format for elapsedTime?

Perhaps this should be analogous to currentTime in a HTMLMediaElement?
That'd be the time since speech on this utterance began, in seconds.
Double-precision float.

> I propose the following definition:
>
> SpeechSynthesisMarkerCallback parameters
>
> charIndex parameter
> The zero-based character index into the original utterance string of the
> word, sentence or marker about to be spoken.

I'd word this slightly differently. In my experience, some engines
support callbacks *before* a word, others support callbacks *after* a
word. Practically they're almost the same thing, but not quite - the
time is slightly different due to pauses, and the charIndex is either
before the first letter or after the last letter in the word.

My suggestion: The zero-based character index into the original
utterance string that most closely approximates the current speaking
position of the speech engine. No guarantee is given as to where
charIndex will be with respect to word boundaries (such as at the end
of the previous word or the beginning of the next word), only that all
text before charIndex has already been spoken, and all text after
charIndex has not yet been spoken.

What do you think? Feel free to edit / refine, I'd just like to word
it in such a way that we can support a wide variety of existing speech
engines and that clients don't make assumptions about the callbacks
that won't always be true.

- Dominic

Received on Tuesday, 2 October 2012 06:07:31 UTC