Re: Proposal to add start, stop, and update events to TTS from Glen Shires on 2012-10-05 (public-speech-api@w3.org from October 2012)

From: Glen Shires <gshires@google.com>
Date: Thu, 4 Oct 2012 17:27:53 -0700
To: public-speech-api@w3.org, Dominic Mazzoni <dmazzoni@google.com>
Message-ID: <CAEE5bcjFQw_FyyJ1mOVSkXzeL-=P+gVbaUTf9+MFxXWkMQP+Cw@mail.gmail.com>
Instead of the callbacks, I propose this cleaner, more extensible IDL
(which is similar in style to SpeechRecognition).  This also has the
benefit that if a web author wanted to, he could write a single function to
handle all of the events (by assigning each event that he was interested in
handling to that function).

The definitions and functionality don't change.  If there's no
disagreement, I'll update the spec with this on Monday.


All the SpeechSynthesisUtterance events would be simple Functions with
a SpeechSynthesisEvent object, and that object contains all the attributes
(and supports future extensibility):

    interface SpeechSynthesisUtterance {
        ...
        attribute Function onstart;
        attribute Function onend;
        attribute Function onpause;
        attribute Function onresume;
        attribute Function onupdate;
    };

    interface SpeechSynthesisEvent : Event {
        readonly attribute EventType eventType;
        readonly attribute double elapsedTime;
        readonly attribute unsigned long charIndex;
        readonly attribute DOMString name;
    };

    enum EventType { "start", "stop", "pause", "resume", "mark", "word",
"sentence" };


/Glen Shires


On Thu, Oct 4, 2012 at 1:08 PM, Glen Shires <gshires@google.com> wrote:

> I've updated the spec with the these changes:
> https://dvcs.w3.org/hg/speech-api/rev/95fa61bdb089
>
> As always, the current draft spec is at:
> http://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html
>
> /Glen Shires
>
> On Wed, Oct 3, 2012 at 6:22 PM, Glen Shires <gshires@google.com> wrote:
>
>> OK, so no changes to the definitions, just some renaming and minor IDL
>> changes: replacing the constants with an enumeration, and providing the
>> same attributes for onpause and onend.  Here's the new IDL:
>>
>> enum UpdateType { "mark", "word" };
>>
>> callback SpeechSynthesisCallback = void(unsigned long charIndex, double
>> elapsedTime);
>>
>> callback SpeechSynthesisUpdateCallback = void(unsigned long charIndex,
>> double elapsedTime, UpdateType type, DOMString markerName);
>>
>> interface SpeechSynthesisUtterance {
>>   ...
>>   attribute Function onstart;
>>   attribute SpeechSynthesisCallback onend;
>>   attribute SpeechSynthesisCallback onpause;
>>   attribute Function onresume;
>>   attribute SpeechSynthesisUpdateCallback onupdate;
>> };
>>
>> /Glen Shires
>>
>> On Wed, Oct 3, 2012 at 12:59 AM, Dominic Mazzoni <dmazzoni@google.com>wrote:
>>
>>> I don't think of word callbacks as markers. A marker is something the
>>> client adds to the input, saying they want to be notified when it's
>>> reached. But word callbacks are more general status updates - the client
>>> just wants to know the progress to the greatest detail possible, not only
>>> when a particular marker is reached.
>>>
>>> How about:
>>>
>>> IDL:
>>>
>>>   const unsigned long MARKER_EVENT = 1;
>>>   const unsigned long WORD_EVENT = 2;
>>>
>>>   callback SpeechSynthesisEventCallback = void(const unsigned long
>>> eventType, DOMString eventName, unsigned long charIndex, double
>>> elapsedTime);
>>>
>>> Definitions:
>>>
>>>   SpeechSynthesisEventCallback parameters
>>>
>>>   eventType parameter
>>>   An enumeration indicating the type of event, either MARKER_EVENT
>>> or WORD_EVENT.
>>>
>>>   markerName parameter
>>>   For events with eventType of MARKER_EVENT, contains the name of the
>>> marker, as defined in SSML as the name attribute of a mark element.  For
>>> events with eventType of WORD_MARKER, this value should be undefined.
>>>
>>>   charIndex parameter
>>>   The zero-based character index into the original utterance string that
>>> most closely approximates the current speaking position of the speech
>>> engine. No guarantee is given as to where charIndex will be with respect to
>>> word boundaries (such as at the end of the previous word or the beginning
>>> of the next word), only that all text before charIndex has already been
>>> spoken, and all text after charIndex has not yet been spoken.  The User
>>> Agent must return this value if the speech synthesis engine supports it,
>>> otherwise the User Agent must return undefined.
>>>
>>>   elapsedTime parameter
>>>   The time, in seconds, that this event triggered, relative to when
>>> this utterance has begun to be spoken. The User Agent must return this
>>> value if the speech synthesis engine supports it, otherwise the User Agent
>>> must return undefined.
>>>
>>>
>>>   SpeechSynthesisUtterance Attributes
>>>
>>>   text attribute
>>>   The text to be synthesized and spoken for this utterance. This may
>>> be either plain text or a complete, well-formed SSML document. For speech
>>> synthesis engines that do not support SSML, or only support certain tags,
>>> the User Agent or speech engine must strip away the tags they do not
>>> support and speak the text. There may be a maximum length of the text of
>>> 32,767 characters.
>>>
>>>   SpeechSynthesisUtterance Events
>>>
>>>   update event
>>>   Fired when the spoken utterance reaches a word boundary or a named
>>> marker. User Agent should fire event if the speech synthesis engine
>>> provides the event.
>>>
>>> On Tue, Oct 2, 2012 at 6:48 PM, Glen Shires <gshires@google.com> wrote:
>>>
>>> Yes, I agree. Based on this, here's my proposal for the spec. If there's
>>>> no disagreement, I'll add this to the spec on Thursday.
>>>>
>>>> IDL:
>>>>
>>>>   const unsigned long NAMED_MARKER = 1;
>>>>   const unsigned long WORD_MARKER = 2;
>>>>
>>>>   callback SpeechSynthesisMarkerCallback = void(const unsigned long
>>>> markerType, DOMString markerName, unsigned long charIndex, double
>>>> elapsedTime);
>>>>
>>>> Definitions:
>>>>
>>>>   SpeechSynthesisMarkerCallback parameters
>>>>
>>>>   markerType parameter
>>>>   An enumeration indicating the type of marker that caused this event,
>>>> either NAMED_MARKER or WORD_MARKER.
>>>>
>>>>   markerName parameter
>>>>   For events with markerType of NAMED_MARKER, contains the name of the
>>>> marker, as defined in SSML as the name attribute of a mark element.  For
>>>> events with markerType of WORD_MARKER, this value should be undefined.
>>>>
>>>>   charIndex parameter
>>>>   The zero-based character index into the original utterance string
>>>> that most closely approximates the current speaking position of the speech
>>>> engine. No guarantee is given as to where charIndex will be with respect to
>>>> word boundaries (such as at the end of the previous word or the beginning
>>>> of the next word), only that all text before charIndex has already been
>>>> spoken, and all text after charIndex has not yet been spoken.  The User
>>>> Agent must return this value if the speech synthesis engine supports it,
>>>> otherwise the User Agent must return undefined.
>>>>
>>>>   elapsedTime parameter
>>>>   The time, in seconds, that this marker triggered, relative to when
>>>> this utterance has begun to be spoken. The User Agent must return this
>>>> value if the speech synthesis engine supports it, otherwise the User Agent
>>>> must return undefined.
>>>>
>>>>
>>>>   SpeechSynthesisUtterance Attributes
>>>>
>>>>   text attribute
>>>>   The text to be synthesized and spoken for this utterance. This may
>>>> be either plain text or a complete, well-formed SSML document. For speech
>>>> synthesis engines that do not support SSML, or only support certain tags,
>>>> the User Agent or speech engine must strip away the tags they do not
>>>> support and speak the text. There may be a maximum length of the text of
>>>> 32,767 characters.
>>>>
>>>>   SpeechSynthesisUtterance Events
>>>>
>>>>   marker event
>>>>   Fired when the spoken utterance reaches a word boundary or a named
>>>> marker. User Agent should fire event if the speech synthesis engine
>>>> provides the event.
>>>>
>>>> /Glen Shires
>>>>
>>>>
>>>> On Mon, Oct 1, 2012 at 11:07 PM, Dominic Mazzoni <dmazzoni@google.com>
>>>>  wrote:
>>>>
>>>>> On Mon, Oct 1, 2012 at 4:14 PM, Glen Shires <gshires@google.com>
>>>>> wrote:
>>>>> > - Should markerName be changed from "DOMString" to "type (
>>>>> enumerated string
>>>>> > ["word", "sentence", "marker"] )".
>>>>>
>>>>> I'd be fine with an enum, as long as it's clear that we have the
>>>>> option to expand on this in the future - for example, an engine might
>>>>> be able to do a callback for each phoneme or syllable.
>>>>>
>>>>> > If so, how are named markers returned?
>>>>> > (We could add a "DOMString namedMarker" parameter or add an
>>>>> "onnamedmarker()
>>>>> > event".
>>>>>
>>>>> I'd prefer the DOMString namedMarker over a separate event. I think
>>>>> one event for all progress updates is simpler for both server and
>>>>> client.
>>>>>
>>>>> > - What should be the format for elapsedTime?
>>>>>
>>>>> Perhaps this should be analogous to currentTime in a HTMLMediaElement?
>>>>> That'd be the time since speech on this utterance began, in seconds.
>>>>> Double-precision float.
>>>>>
>>>>> > I propose the following definition:
>>>>> >
>>>>> > SpeechSynthesisMarkerCallback parameters
>>>>> >
>>>>> > charIndex parameter
>>>>> > The zero-based character index into the original utterance string of
>>>>> the
>>>>> > word, sentence or marker about to be spoken.
>>>>>
>>>>> I'd word this slightly differently. In my experience, some engines
>>>>> support callbacks *before* a word, others support callbacks *after* a
>>>>> word. Practically they're almost the same thing, but not quite - the
>>>>> time is slightly different due to pauses, and the charIndex is either
>>>>> before the first letter or after the last letter in the word.
>>>>>
>>>>> My suggestion: The zero-based character index into the original
>>>>> utterance string that most closely approximates the current speaking
>>>>> position of the speech engine. No guarantee is given as to where
>>>>> charIndex will be with respect to word boundaries (such as at the end
>>>>> of the previous word or the beginning of the next word), only that all
>>>>> text before charIndex has already been spoken, and all text after
>>>>> charIndex has not yet been spoken.
>>>>>
>>>>> What do you think? Feel free to edit / refine, I'd just like to word
>>>>> it in such a way that we can support a wide variety of existing speech
>>>>> engines and that clients don't make assumptions about the callbacks
>>>>> that won't always be true.
>>>>>
>>>>> - Dominic
>>>>>
>>>>
>>>>
>>
>
Received on Friday, 5 October 2012 00:29:02 UTC