Re: Proposal to add start, stop, and update events to TTS from Glen Shires on 2012-10-04 (public-speech-api@w3.org from October 2012)

From: Glen Shires <gshires@google.com>
Date: Wed, 3 Oct 2012 18:22:13 -0700
To: Dominic Mazzoni <dmazzoni@google.com>
Cc: public-speech-api@w3.org
Message-ID: <CAEE5bchMnjreSea_LNaJL7DNQYUmU7egmUD7f8+B09Sjoz0Gvg@mail.gmail.com>
OK, so no changes to the definitions, just some renaming and minor IDL
changes: replacing the constants with an enumeration, and providing the
same attributes for onpause and onend.  Here's the new IDL:

enum UpdateType { "mark", "word" };

callback SpeechSynthesisCallback = void(unsigned long charIndex, double
elapsedTime);

callback SpeechSynthesisUpdateCallback = void(unsigned long charIndex,
double elapsedTime, UpdateType type, DOMString markerName);

interface SpeechSynthesisUtterance {
  ...
  attribute Function onstart;
  attribute SpeechSynthesisCallback onend;
  attribute SpeechSynthesisCallback onpause;
  attribute Function onresume;
  attribute SpeechSynthesisUpdateCallback onupdate;
};

/Glen Shires

On Wed, Oct 3, 2012 at 12:59 AM, Dominic Mazzoni <dmazzoni@google.com>wrote:

> I don't think of word callbacks as markers. A marker is something the
> client adds to the input, saying they want to be notified when it's
> reached. But word callbacks are more general status updates - the client
> just wants to know the progress to the greatest detail possible, not only
> when a particular marker is reached.
>
> How about:
>
> IDL:
>
>   const unsigned long MARKER_EVENT = 1;
>   const unsigned long WORD_EVENT = 2;
>
>   callback SpeechSynthesisEventCallback = void(const unsigned long
> eventType, DOMString eventName, unsigned long charIndex, double
> elapsedTime);
>
> Definitions:
>
>   SpeechSynthesisEventCallback parameters
>
>   eventType parameter
>   An enumeration indicating the type of event, either MARKER_EVENT
> or WORD_EVENT.
>
>   markerName parameter
>   For events with eventType of MARKER_EVENT, contains the name of the
> marker, as defined in SSML as the name attribute of a mark element.  For
> events with eventType of WORD_MARKER, this value should be undefined.
>
>   charIndex parameter
>   The zero-based character index into the original utterance string that
> most closely approximates the current speaking position of the speech
> engine. No guarantee is given as to where charIndex will be with respect to
> word boundaries (such as at the end of the previous word or the beginning
> of the next word), only that all text before charIndex has already been
> spoken, and all text after charIndex has not yet been spoken.  The User
> Agent must return this value if the speech synthesis engine supports it,
> otherwise the User Agent must return undefined.
>
>   elapsedTime parameter
>   The time, in seconds, that this event triggered, relative to when
> this utterance has begun to be spoken. The User Agent must return this
> value if the speech synthesis engine supports it, otherwise the User Agent
> must return undefined.
>
>
>   SpeechSynthesisUtterance Attributes
>
>   text attribute
>   The text to be synthesized and spoken for this utterance. This may
> be either plain text or a complete, well-formed SSML document. For speech
> synthesis engines that do not support SSML, or only support certain tags,
> the User Agent or speech engine must strip away the tags they do not
> support and speak the text. There may be a maximum length of the text of
> 32,767 characters.
>
>   SpeechSynthesisUtterance Events
>
>   update event
>   Fired when the spoken utterance reaches a word boundary or a named
> marker. User Agent should fire event if the speech synthesis engine
> provides the event.
>
> On Tue, Oct 2, 2012 at 6:48 PM, Glen Shires <gshires@google.com> wrote:
>
> Yes, I agree. Based on this, here's my proposal for the spec. If there's
>> no disagreement, I'll add this to the spec on Thursday.
>>
>> IDL:
>>
>>   const unsigned long NAMED_MARKER = 1;
>>   const unsigned long WORD_MARKER = 2;
>>
>>   callback SpeechSynthesisMarkerCallback = void(const unsigned long
>> markerType, DOMString markerName, unsigned long charIndex, double
>> elapsedTime);
>>
>> Definitions:
>>
>>   SpeechSynthesisMarkerCallback parameters
>>
>>   markerType parameter
>>   An enumeration indicating the type of marker that caused this event,
>> either NAMED_MARKER or WORD_MARKER.
>>
>>   markerName parameter
>>   For events with markerType of NAMED_MARKER, contains the name of the
>> marker, as defined in SSML as the name attribute of a mark element.  For
>> events with markerType of WORD_MARKER, this value should be undefined.
>>
>>   charIndex parameter
>>   The zero-based character index into the original utterance string that
>> most closely approximates the current speaking position of the speech
>> engine. No guarantee is given as to where charIndex will be with respect to
>> word boundaries (such as at the end of the previous word or the beginning
>> of the next word), only that all text before charIndex has already been
>> spoken, and all text after charIndex has not yet been spoken.  The User
>> Agent must return this value if the speech synthesis engine supports it,
>> otherwise the User Agent must return undefined.
>>
>>   elapsedTime parameter
>>   The time, in seconds, that this marker triggered, relative to when
>> this utterance has begun to be spoken. The User Agent must return this
>> value if the speech synthesis engine supports it, otherwise the User Agent
>> must return undefined.
>>
>>
>>   SpeechSynthesisUtterance Attributes
>>
>>   text attribute
>>   The text to be synthesized and spoken for this utterance. This may
>> be either plain text or a complete, well-formed SSML document. For speech
>> synthesis engines that do not support SSML, or only support certain tags,
>> the User Agent or speech engine must strip away the tags they do not
>> support and speak the text. There may be a maximum length of the text of
>> 32,767 characters.
>>
>>   SpeechSynthesisUtterance Events
>>
>>   marker event
>>   Fired when the spoken utterance reaches a word boundary or a named
>> marker. User Agent should fire event if the speech synthesis engine
>> provides the event.
>>
>> /Glen Shires
>>
>>
>> On Mon, Oct 1, 2012 at 11:07 PM, Dominic Mazzoni <dmazzoni@google.com>
>>  wrote:
>>
>>> On Mon, Oct 1, 2012 at 4:14 PM, Glen Shires <gshires@google.com> wrote:
>>> > - Should markerName be changed from "DOMString" to "type ( enumerated
>>> string
>>> > ["word", "sentence", "marker"] )".
>>>
>>> I'd be fine with an enum, as long as it's clear that we have the
>>> option to expand on this in the future - for example, an engine might
>>> be able to do a callback for each phoneme or syllable.
>>>
>>> > If so, how are named markers returned?
>>> > (We could add a "DOMString namedMarker" parameter or add an
>>> "onnamedmarker()
>>> > event".
>>>
>>> I'd prefer the DOMString namedMarker over a separate event. I think
>>> one event for all progress updates is simpler for both server and
>>> client.
>>>
>>> > - What should be the format for elapsedTime?
>>>
>>> Perhaps this should be analogous to currentTime in a HTMLMediaElement?
>>> That'd be the time since speech on this utterance began, in seconds.
>>> Double-precision float.
>>>
>>> > I propose the following definition:
>>> >
>>> > SpeechSynthesisMarkerCallback parameters
>>> >
>>> > charIndex parameter
>>> > The zero-based character index into the original utterance string of
>>> the
>>> > word, sentence or marker about to be spoken.
>>>
>>> I'd word this slightly differently. In my experience, some engines
>>> support callbacks *before* a word, others support callbacks *after* a
>>> word. Practically they're almost the same thing, but not quite - the
>>> time is slightly different due to pauses, and the charIndex is either
>>> before the first letter or after the last letter in the word.
>>>
>>> My suggestion: The zero-based character index into the original
>>> utterance string that most closely approximates the current speaking
>>> position of the speech engine. No guarantee is given as to where
>>> charIndex will be with respect to word boundaries (such as at the end
>>> of the previous word or the beginning of the next word), only that all
>>> text before charIndex has already been spoken, and all text after
>>> charIndex has not yet been spoken.
>>>
>>> What do you think? Feel free to edit / refine, I'd just like to word
>>> it in such a way that we can support a wide variety of existing speech
>>> engines and that clients don't make assumptions about the callbacks
>>> that won't always be true.
>>>
>>> - Dominic
>>>
>>
>>
Received on Thursday, 4 October 2012 01:23:21 UTC