Re: Proposal to add start, stop, and update events to TTS from Glen Shires on 2012-10-04 (public-speech-api@w3.org from October 2012)

From: Glen Shires <gshires@google.com>
Date: Thu, 4 Oct 2012 13:08:02 -0700
To: public-speech-api@w3.org, Dominic Mazzoni <dmazzoni@google.com>
Message-ID: <CAEE5bcj9jPZ0gUXufYWASPYcZQ77PmjYmLudXM7ZuEV8o-Va1g@mail.gmail.com>
I've updated the spec with the these changes:
https://dvcs.w3.org/hg/speech-api/rev/95fa61bdb089

As always, the current draft spec is at:
http://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html

/Glen Shires

On Wed, Oct 3, 2012 at 6:22 PM, Glen Shires <gshires@google.com> wrote:

> OK, so no changes to the definitions, just some renaming and minor IDL
> changes: replacing the constants with an enumeration, and providing the
> same attributes for onpause and onend.  Here's the new IDL:
>
> enum UpdateType { "mark", "word" };
>
> callback SpeechSynthesisCallback = void(unsigned long charIndex, double
> elapsedTime);
>
> callback SpeechSynthesisUpdateCallback = void(unsigned long charIndex,
> double elapsedTime, UpdateType type, DOMString markerName);
>
> interface SpeechSynthesisUtterance {
>   ...
>   attribute Function onstart;
>   attribute SpeechSynthesisCallback onend;
>   attribute SpeechSynthesisCallback onpause;
>   attribute Function onresume;
>   attribute SpeechSynthesisUpdateCallback onupdate;
> };
>
> /Glen Shires
>
> On Wed, Oct 3, 2012 at 12:59 AM, Dominic Mazzoni <dmazzoni@google.com>wrote:
>
>> I don't think of word callbacks as markers. A marker is something the
>> client adds to the input, saying they want to be notified when it's
>> reached. But word callbacks are more general status updates - the client
>> just wants to know the progress to the greatest detail possible, not only
>> when a particular marker is reached.
>>
>> How about:
>>
>> IDL:
>>
>>   const unsigned long MARKER_EVENT = 1;
>>   const unsigned long WORD_EVENT = 2;
>>
>>   callback SpeechSynthesisEventCallback = void(const unsigned long
>> eventType, DOMString eventName, unsigned long charIndex, double
>> elapsedTime);
>>
>> Definitions:
>>
>>   SpeechSynthesisEventCallback parameters
>>
>>   eventType parameter
>>   An enumeration indicating the type of event, either MARKER_EVENT
>> or WORD_EVENT.
>>
>>   markerName parameter
>>   For events with eventType of MARKER_EVENT, contains the name of the
>> marker, as defined in SSML as the name attribute of a mark element.  For
>> events with eventType of WORD_MARKER, this value should be undefined.
>>
>>   charIndex parameter
>>   The zero-based character index into the original utterance string that
>> most closely approximates the current speaking position of the speech
>> engine. No guarantee is given as to where charIndex will be with respect to
>> word boundaries (such as at the end of the previous word or the beginning
>> of the next word), only that all text before charIndex has already been
>> spoken, and all text after charIndex has not yet been spoken.  The User
>> Agent must return this value if the speech synthesis engine supports it,
>> otherwise the User Agent must return undefined.
>>
>>   elapsedTime parameter
>>   The time, in seconds, that this event triggered, relative to when
>> this utterance has begun to be spoken. The User Agent must return this
>> value if the speech synthesis engine supports it, otherwise the User Agent
>> must return undefined.
>>
>>
>>   SpeechSynthesisUtterance Attributes
>>
>>   text attribute
>>   The text to be synthesized and spoken for this utterance. This may
>> be either plain text or a complete, well-formed SSML document. For speech
>> synthesis engines that do not support SSML, or only support certain tags,
>> the User Agent or speech engine must strip away the tags they do not
>> support and speak the text. There may be a maximum length of the text of
>> 32,767 characters.
>>
>>   SpeechSynthesisUtterance Events
>>
>>   update event
>>   Fired when the spoken utterance reaches a word boundary or a named
>> marker. User Agent should fire event if the speech synthesis engine
>> provides the event.
>>
>> On Tue, Oct 2, 2012 at 6:48 PM, Glen Shires <gshires@google.com> wrote:
>>
>> Yes, I agree. Based on this, here's my proposal for the spec. If there's
>>> no disagreement, I'll add this to the spec on Thursday.
>>>
>>> IDL:
>>>
>>>   const unsigned long NAMED_MARKER = 1;
>>>   const unsigned long WORD_MARKER = 2;
>>>
>>>   callback SpeechSynthesisMarkerCallback = void(const unsigned long
>>> markerType, DOMString markerName, unsigned long charIndex, double
>>> elapsedTime);
>>>
>>> Definitions:
>>>
>>>   SpeechSynthesisMarkerCallback parameters
>>>
>>>   markerType parameter
>>>   An enumeration indicating the type of marker that caused this event,
>>> either NAMED_MARKER or WORD_MARKER.
>>>
>>>   markerName parameter
>>>   For events with markerType of NAMED_MARKER, contains the name of the
>>> marker, as defined in SSML as the name attribute of a mark element.  For
>>> events with markerType of WORD_MARKER, this value should be undefined.
>>>
>>>   charIndex parameter
>>>   The zero-based character index into the original utterance string that
>>> most closely approximates the current speaking position of the speech
>>> engine. No guarantee is given as to where charIndex will be with respect to
>>> word boundaries (such as at the end of the previous word or the beginning
>>> of the next word), only that all text before charIndex has already been
>>> spoken, and all text after charIndex has not yet been spoken.  The User
>>> Agent must return this value if the speech synthesis engine supports it,
>>> otherwise the User Agent must return undefined.
>>>
>>>   elapsedTime parameter
>>>   The time, in seconds, that this marker triggered, relative to when
>>> this utterance has begun to be spoken. The User Agent must return this
>>> value if the speech synthesis engine supports it, otherwise the User Agent
>>> must return undefined.
>>>
>>>
>>>   SpeechSynthesisUtterance Attributes
>>>
>>>   text attribute
>>>   The text to be synthesized and spoken for this utterance. This may
>>> be either plain text or a complete, well-formed SSML document. For speech
>>> synthesis engines that do not support SSML, or only support certain tags,
>>> the User Agent or speech engine must strip away the tags they do not
>>> support and speak the text. There may be a maximum length of the text of
>>> 32,767 characters.
>>>
>>>   SpeechSynthesisUtterance Events
>>>
>>>   marker event
>>>   Fired when the spoken utterance reaches a word boundary or a named
>>> marker. User Agent should fire event if the speech synthesis engine
>>> provides the event.
>>>
>>> /Glen Shires
>>>
>>>
>>> On Mon, Oct 1, 2012 at 11:07 PM, Dominic Mazzoni <dmazzoni@google.com>
>>>  wrote:
>>>
>>>> On Mon, Oct 1, 2012 at 4:14 PM, Glen Shires <gshires@google.com> wrote:
>>>> > - Should markerName be changed from "DOMString" to "type ( enumerated
>>>> string
>>>> > ["word", "sentence", "marker"] )".
>>>>
>>>> I'd be fine with an enum, as long as it's clear that we have the
>>>> option to expand on this in the future - for example, an engine might
>>>> be able to do a callback for each phoneme or syllable.
>>>>
>>>> > If so, how are named markers returned?
>>>> > (We could add a "DOMString namedMarker" parameter or add an
>>>> "onnamedmarker()
>>>> > event".
>>>>
>>>> I'd prefer the DOMString namedMarker over a separate event. I think
>>>> one event for all progress updates is simpler for both server and
>>>> client.
>>>>
>>>> > - What should be the format for elapsedTime?
>>>>
>>>> Perhaps this should be analogous to currentTime in a HTMLMediaElement?
>>>> That'd be the time since speech on this utterance began, in seconds.
>>>> Double-precision float.
>>>>
>>>> > I propose the following definition:
>>>> >
>>>> > SpeechSynthesisMarkerCallback parameters
>>>> >
>>>> > charIndex parameter
>>>> > The zero-based character index into the original utterance string of
>>>> the
>>>> > word, sentence or marker about to be spoken.
>>>>
>>>> I'd word this slightly differently. In my experience, some engines
>>>> support callbacks *before* a word, others support callbacks *after* a
>>>> word. Practically they're almost the same thing, but not quite - the
>>>> time is slightly different due to pauses, and the charIndex is either
>>>> before the first letter or after the last letter in the word.
>>>>
>>>> My suggestion: The zero-based character index into the original
>>>> utterance string that most closely approximates the current speaking
>>>> position of the speech engine. No guarantee is given as to where
>>>> charIndex will be with respect to word boundaries (such as at the end
>>>> of the previous word or the beginning of the next word), only that all
>>>> text before charIndex has already been spoken, and all text after
>>>> charIndex has not yet been spoken.
>>>>
>>>> What do you think? Feel free to edit / refine, I'd just like to word
>>>> it in such a way that we can support a wide variety of existing speech
>>>> engines and that clients don't make assumptions about the callbacks
>>>> that won't always be true.
>>>>
>>>> - Dominic
>>>>
>>>
>>>
>
Received on Thursday, 4 October 2012 20:09:10 UTC