Re: Proposal to add start, stop, and update events to TTS from Glen Shires on 2012-10-03 (public-speech-api@w3.org from October 2012)

From: Glen Shires <gshires@google.com>
Date: Tue, 2 Oct 2012 18:48:12 -0700
To: Dominic Mazzoni <dmazzoni@google.com>
Cc: public-speech-api@w3.org
Message-ID: <CAEE5bcgYD8-AXDw=_QjYA8bdEWwoLQH9b81k02BiRKArB-NJMw@mail.gmail.com>
Yes, I agree. Based on this, here's my proposal for the spec. If there's no
disagreement, I'll add this to the spec on Thursday.

IDL:

  const unsigned long NAMED_MARKER = 1;
  const unsigned long WORD_MARKER = 2;

  callback SpeechSynthesisMarkerCallback = void(const unsigned long
markerType, DOMString markerName, unsigned long charIndex, double
elapsedTime);

Definitions:

  SpeechSynthesisMarkerCallback parameters

  markerType parameter
  An enumeration indicating the type of marker that caused this event,
either NAMED_MARKER or WORD_MARKER.

  markerName parameter
  For events with markerType of NAMED_MARKER, contains the name of the
marker, as defined in SSML as the name attribute of a mark element.  For
events with markerType of WORD_MARKER, this value should be undefined.

  charIndex parameter
  The zero-based character index into the original utterance string that
most closely approximates the current speaking position of the speech
engine. No guarantee is given as to where charIndex will be with respect to
word boundaries (such as at the end of the previous word or the beginning
of the next word), only that all text before charIndex has already been
spoken, and all text after charIndex has not yet been spoken.  The User
Agent must return this value if the speech synthesis engine supports it,
otherwise the User Agent must return undefined.

  elapsedTime parameter
  The time, in seconds, that this marker triggered, relative to when
this utterance has begun to be spoken. The User Agent must return this
value if the speech synthesis engine supports it, otherwise the User Agent
must return undefined.


  SpeechSynthesisUtterance Attributes

  text attribute
  The text to be synthesized and spoken for this utterance. This may
be either plain text or a complete, well-formed SSML document. For speech
synthesis engines that do not support SSML, or only support certain tags,
the User Agent or speech engine must strip away the tags they do not
support and speak the text. There may be a maximum length of the text of
32,767 characters.

  SpeechSynthesisUtterance Events

  marker event
  Fired when the spoken utterance reaches a word boundary or a named
marker. User Agent should fire event if the speech synthesis engine
provides the event.

/Glen Shires


On Mon, Oct 1, 2012 at 11:07 PM, Dominic Mazzoni <dmazzoni@google.com>wrote:

> On Mon, Oct 1, 2012 at 4:14 PM, Glen Shires <gshires@google.com> wrote:
> > - Should markerName be changed from "DOMString" to "type ( enumerated
> string
> > ["word", "sentence", "marker"] )".
>
> I'd be fine with an enum, as long as it's clear that we have the
> option to expand on this in the future - for example, an engine might
> be able to do a callback for each phoneme or syllable.
>
> > If so, how are named markers returned?
> > (We could add a "DOMString namedMarker" parameter or add an
> "onnamedmarker()
> > event".
>
> I'd prefer the DOMString namedMarker over a separate event. I think
> one event for all progress updates is simpler for both server and
> client.
>
> > - What should be the format for elapsedTime?
>
> Perhaps this should be analogous to currentTime in a HTMLMediaElement?
> That'd be the time since speech on this utterance began, in seconds.
> Double-precision float.
>
> > I propose the following definition:
> >
> > SpeechSynthesisMarkerCallback parameters
> >
> > charIndex parameter
> > The zero-based character index into the original utterance string of the
> > word, sentence or marker about to be spoken.
>
> I'd word this slightly differently. In my experience, some engines
> support callbacks *before* a word, others support callbacks *after* a
> word. Practically they're almost the same thing, but not quite - the
> time is slightly different due to pauses, and the charIndex is either
> before the first letter or after the last letter in the word.
>
> My suggestion: The zero-based character index into the original
> utterance string that most closely approximates the current speaking
> position of the speech engine. No guarantee is given as to where
> charIndex will be with respect to word boundaries (such as at the end
> of the previous word or the beginning of the next word), only that all
> text before charIndex has already been spoken, and all text after
> charIndex has not yet been spoken.
>
> What do you think? Feel free to edit / refine, I'd just like to word
> it in such a way that we can support a wide variety of existing speech
> engines and that clients don't make assumptions about the callbacks
> that won't always be true.
>
> - Dominic
>
Received on Wednesday, 3 October 2012 01:49:21 UTC