Re: Proposal to add start, stop, and update events to TTS from Dominic Mazzoni on 2012-10-03 (public-speech-api@w3.org from October 2012)

From: Dominic Mazzoni <dmazzoni@google.com>
Date: Wed, 3 Oct 2012 00:59:10 -0700
To: Glen Shires <gshires@google.com>
Cc: public-speech-api@w3.org
Message-ID: <CAFz-FYzqb5jeEHg9Z6BvgUkMivkdHfzuB9V-_TCHUf3DLy3UQA@mail.gmail.com>
I don't think of word callbacks as markers. A marker is something the
client adds to the input, saying they want to be notified when it's
reached. But word callbacks are more general status updates - the client
just wants to know the progress to the greatest detail possible, not only
when a particular marker is reached.

How about:

IDL:

  const unsigned long MARKER_EVENT = 1;
  const unsigned long WORD_EVENT = 2;

  callback SpeechSynthesisEventCallback = void(const unsigned long
eventType, DOMString eventName, unsigned long charIndex, double
elapsedTime);

Definitions:

  SpeechSynthesisEventCallback parameters

  eventType parameter
  An enumeration indicating the type of event, either MARKER_EVENT
or WORD_EVENT.

  markerName parameter
  For events with eventType of MARKER_EVENT, contains the name of the
marker, as defined in SSML as the name attribute of a mark element.  For
events with eventType of WORD_MARKER, this value should be undefined.

  charIndex parameter
  The zero-based character index into the original utterance string that
most closely approximates the current speaking position of the speech
engine. No guarantee is given as to where charIndex will be with respect to
word boundaries (such as at the end of the previous word or the beginning
of the next word), only that all text before charIndex has already been
spoken, and all text after charIndex has not yet been spoken.  The User
Agent must return this value if the speech synthesis engine supports it,
otherwise the User Agent must return undefined.

  elapsedTime parameter
  The time, in seconds, that this event triggered, relative to when
this utterance has begun to be spoken. The User Agent must return this
value if the speech synthesis engine supports it, otherwise the User Agent
must return undefined.


  SpeechSynthesisUtterance Attributes

  text attribute
  The text to be synthesized and spoken for this utterance. This may
be either plain text or a complete, well-formed SSML document. For speech
synthesis engines that do not support SSML, or only support certain tags,
the User Agent or speech engine must strip away the tags they do not
support and speak the text. There may be a maximum length of the text of
32,767 characters.

  SpeechSynthesisUtterance Events

  update event
  Fired when the spoken utterance reaches a word boundary or a named
marker. User Agent should fire event if the speech synthesis engine
provides the event.

On Tue, Oct 2, 2012 at 6:48 PM, Glen Shires <gshires@google.com> wrote:

> Yes, I agree. Based on this, here's my proposal for the spec. If there's
> no disagreement, I'll add this to the spec on Thursday.
>
> IDL:
>
>   const unsigned long NAMED_MARKER = 1;
>   const unsigned long WORD_MARKER = 2;
>
>   callback SpeechSynthesisMarkerCallback = void(const unsigned long
> markerType, DOMString markerName, unsigned long charIndex, double
> elapsedTime);
>
> Definitions:
>
>   SpeechSynthesisMarkerCallback parameters
>
>   markerType parameter
>   An enumeration indicating the type of marker that caused this event,
> either NAMED_MARKER or WORD_MARKER.
>
>   markerName parameter
>   For events with markerType of NAMED_MARKER, contains the name of the
> marker, as defined in SSML as the name attribute of a mark element.  For
> events with markerType of WORD_MARKER, this value should be undefined.
>
>   charIndex parameter
>   The zero-based character index into the original utterance string that
> most closely approximates the current speaking position of the speech
> engine. No guarantee is given as to where charIndex will be with respect to
> word boundaries (such as at the end of the previous word or the beginning
> of the next word), only that all text before charIndex has already been
> spoken, and all text after charIndex has not yet been spoken.  The User
> Agent must return this value if the speech synthesis engine supports it,
> otherwise the User Agent must return undefined.
>
>   elapsedTime parameter
>   The time, in seconds, that this marker triggered, relative to when
> this utterance has begun to be spoken. The User Agent must return this
> value if the speech synthesis engine supports it, otherwise the User Agent
> must return undefined.
>
>
>   SpeechSynthesisUtterance Attributes
>
>   text attribute
>   The text to be synthesized and spoken for this utterance. This may
> be either plain text or a complete, well-formed SSML document. For speech
> synthesis engines that do not support SSML, or only support certain tags,
> the User Agent or speech engine must strip away the tags they do not
> support and speak the text. There may be a maximum length of the text of
> 32,767 characters.
>
>   SpeechSynthesisUtterance Events
>
>   marker event
>   Fired when the spoken utterance reaches a word boundary or a named
> marker. User Agent should fire event if the speech synthesis engine
> provides the event.
>
> /Glen Shires
>
>
> On Mon, Oct 1, 2012 at 11:07 PM, Dominic Mazzoni <dmazzoni@google.com>
>  wrote:
>
>> On Mon, Oct 1, 2012 at 4:14 PM, Glen Shires <gshires@google.com> wrote:
>> > - Should markerName be changed from "DOMString" to "type ( enumerated
>> string
>> > ["word", "sentence", "marker"] )".
>>
>> I'd be fine with an enum, as long as it's clear that we have the
>> option to expand on this in the future - for example, an engine might
>> be able to do a callback for each phoneme or syllable.
>>
>> > If so, how are named markers returned?
>> > (We could add a "DOMString namedMarker" parameter or add an
>> "onnamedmarker()
>> > event".
>>
>> I'd prefer the DOMString namedMarker over a separate event. I think
>> one event for all progress updates is simpler for both server and
>> client.
>>
>> > - What should be the format for elapsedTime?
>>
>> Perhaps this should be analogous to currentTime in a HTMLMediaElement?
>> That'd be the time since speech on this utterance began, in seconds.
>> Double-precision float.
>>
>> > I propose the following definition:
>> >
>> > SpeechSynthesisMarkerCallback parameters
>> >
>> > charIndex parameter
>> > The zero-based character index into the original utterance string of the
>> > word, sentence or marker about to be spoken.
>>
>> I'd word this slightly differently. In my experience, some engines
>> support callbacks *before* a word, others support callbacks *after* a
>> word. Practically they're almost the same thing, but not quite - the
>> time is slightly different due to pauses, and the charIndex is either
>> before the first letter or after the last letter in the word.
>>
>> My suggestion: The zero-based character index into the original
>> utterance string that most closely approximates the current speaking
>> position of the speech engine. No guarantee is given as to where
>> charIndex will be with respect to word boundaries (such as at the end
>> of the previous word or the beginning of the next word), only that all
>> text before charIndex has already been spoken, and all text after
>> charIndex has not yet been spoken.
>>
>> What do you think? Feel free to edit / refine, I'd just like to word
>> it in such a way that we can support a wide variety of existing speech
>> engines and that clients don't make assumptions about the callbacks
>> that won't always be true.
>>
>> - Dominic
>>
>
>
Received on Wednesday, 3 October 2012 07:59:38 UTC