- From: Glen Shires <gshires@google.com>
- Date: Thu, 4 Oct 2012 13:08:02 -0700
- To: public-speech-api@w3.org, Dominic Mazzoni <dmazzoni@google.com>
- Message-ID: <CAEE5bcj9jPZ0gUXufYWASPYcZQ77PmjYmLudXM7ZuEV8o-Va1g@mail.gmail.com>
I've updated the spec with the these changes: https://dvcs.w3.org/hg/speech-api/rev/95fa61bdb089 As always, the current draft spec is at: http://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html /Glen Shires On Wed, Oct 3, 2012 at 6:22 PM, Glen Shires <gshires@google.com> wrote: > OK, so no changes to the definitions, just some renaming and minor IDL > changes: replacing the constants with an enumeration, and providing the > same attributes for onpause and onend. Here's the new IDL: > > enum UpdateType { "mark", "word" }; > > callback SpeechSynthesisCallback = void(unsigned long charIndex, double > elapsedTime); > > callback SpeechSynthesisUpdateCallback = void(unsigned long charIndex, > double elapsedTime, UpdateType type, DOMString markerName); > > interface SpeechSynthesisUtterance { > ... > attribute Function onstart; > attribute SpeechSynthesisCallback onend; > attribute SpeechSynthesisCallback onpause; > attribute Function onresume; > attribute SpeechSynthesisUpdateCallback onupdate; > }; > > /Glen Shires > > On Wed, Oct 3, 2012 at 12:59 AM, Dominic Mazzoni <dmazzoni@google.com>wrote: > >> I don't think of word callbacks as markers. A marker is something the >> client adds to the input, saying they want to be notified when it's >> reached. But word callbacks are more general status updates - the client >> just wants to know the progress to the greatest detail possible, not only >> when a particular marker is reached. >> >> How about: >> >> IDL: >> >> const unsigned long MARKER_EVENT = 1; >> const unsigned long WORD_EVENT = 2; >> >> callback SpeechSynthesisEventCallback = void(const unsigned long >> eventType, DOMString eventName, unsigned long charIndex, double >> elapsedTime); >> >> Definitions: >> >> SpeechSynthesisEventCallback parameters >> >> eventType parameter >> An enumeration indicating the type of event, either MARKER_EVENT >> or WORD_EVENT. >> >> markerName parameter >> For events with eventType of MARKER_EVENT, contains the name of the >> marker, as defined in SSML as the name attribute of a mark element. For >> events with eventType of WORD_MARKER, this value should be undefined. >> >> charIndex parameter >> The zero-based character index into the original utterance string that >> most closely approximates the current speaking position of the speech >> engine. No guarantee is given as to where charIndex will be with respect to >> word boundaries (such as at the end of the previous word or the beginning >> of the next word), only that all text before charIndex has already been >> spoken, and all text after charIndex has not yet been spoken. The User >> Agent must return this value if the speech synthesis engine supports it, >> otherwise the User Agent must return undefined. >> >> elapsedTime parameter >> The time, in seconds, that this event triggered, relative to when >> this utterance has begun to be spoken. The User Agent must return this >> value if the speech synthesis engine supports it, otherwise the User Agent >> must return undefined. >> >> >> SpeechSynthesisUtterance Attributes >> >> text attribute >> The text to be synthesized and spoken for this utterance. This may >> be either plain text or a complete, well-formed SSML document. For speech >> synthesis engines that do not support SSML, or only support certain tags, >> the User Agent or speech engine must strip away the tags they do not >> support and speak the text. There may be a maximum length of the text of >> 32,767 characters. >> >> SpeechSynthesisUtterance Events >> >> update event >> Fired when the spoken utterance reaches a word boundary or a named >> marker. User Agent should fire event if the speech synthesis engine >> provides the event. >> >> On Tue, Oct 2, 2012 at 6:48 PM, Glen Shires <gshires@google.com> wrote: >> >> Yes, I agree. Based on this, here's my proposal for the spec. If there's >>> no disagreement, I'll add this to the spec on Thursday. >>> >>> IDL: >>> >>> const unsigned long NAMED_MARKER = 1; >>> const unsigned long WORD_MARKER = 2; >>> >>> callback SpeechSynthesisMarkerCallback = void(const unsigned long >>> markerType, DOMString markerName, unsigned long charIndex, double >>> elapsedTime); >>> >>> Definitions: >>> >>> SpeechSynthesisMarkerCallback parameters >>> >>> markerType parameter >>> An enumeration indicating the type of marker that caused this event, >>> either NAMED_MARKER or WORD_MARKER. >>> >>> markerName parameter >>> For events with markerType of NAMED_MARKER, contains the name of the >>> marker, as defined in SSML as the name attribute of a mark element. For >>> events with markerType of WORD_MARKER, this value should be undefined. >>> >>> charIndex parameter >>> The zero-based character index into the original utterance string that >>> most closely approximates the current speaking position of the speech >>> engine. No guarantee is given as to where charIndex will be with respect to >>> word boundaries (such as at the end of the previous word or the beginning >>> of the next word), only that all text before charIndex has already been >>> spoken, and all text after charIndex has not yet been spoken. The User >>> Agent must return this value if the speech synthesis engine supports it, >>> otherwise the User Agent must return undefined. >>> >>> elapsedTime parameter >>> The time, in seconds, that this marker triggered, relative to when >>> this utterance has begun to be spoken. The User Agent must return this >>> value if the speech synthesis engine supports it, otherwise the User Agent >>> must return undefined. >>> >>> >>> SpeechSynthesisUtterance Attributes >>> >>> text attribute >>> The text to be synthesized and spoken for this utterance. This may >>> be either plain text or a complete, well-formed SSML document. For speech >>> synthesis engines that do not support SSML, or only support certain tags, >>> the User Agent or speech engine must strip away the tags they do not >>> support and speak the text. There may be a maximum length of the text of >>> 32,767 characters. >>> >>> SpeechSynthesisUtterance Events >>> >>> marker event >>> Fired when the spoken utterance reaches a word boundary or a named >>> marker. User Agent should fire event if the speech synthesis engine >>> provides the event. >>> >>> /Glen Shires >>> >>> >>> On Mon, Oct 1, 2012 at 11:07 PM, Dominic Mazzoni <dmazzoni@google.com> >>> wrote: >>> >>>> On Mon, Oct 1, 2012 at 4:14 PM, Glen Shires <gshires@google.com> wrote: >>>> > - Should markerName be changed from "DOMString" to "type ( enumerated >>>> string >>>> > ["word", "sentence", "marker"] )". >>>> >>>> I'd be fine with an enum, as long as it's clear that we have the >>>> option to expand on this in the future - for example, an engine might >>>> be able to do a callback for each phoneme or syllable. >>>> >>>> > If so, how are named markers returned? >>>> > (We could add a "DOMString namedMarker" parameter or add an >>>> "onnamedmarker() >>>> > event". >>>> >>>> I'd prefer the DOMString namedMarker over a separate event. I think >>>> one event for all progress updates is simpler for both server and >>>> client. >>>> >>>> > - What should be the format for elapsedTime? >>>> >>>> Perhaps this should be analogous to currentTime in a HTMLMediaElement? >>>> That'd be the time since speech on this utterance began, in seconds. >>>> Double-precision float. >>>> >>>> > I propose the following definition: >>>> > >>>> > SpeechSynthesisMarkerCallback parameters >>>> > >>>> > charIndex parameter >>>> > The zero-based character index into the original utterance string of >>>> the >>>> > word, sentence or marker about to be spoken. >>>> >>>> I'd word this slightly differently. In my experience, some engines >>>> support callbacks *before* a word, others support callbacks *after* a >>>> word. Practically they're almost the same thing, but not quite - the >>>> time is slightly different due to pauses, and the charIndex is either >>>> before the first letter or after the last letter in the word. >>>> >>>> My suggestion: The zero-based character index into the original >>>> utterance string that most closely approximates the current speaking >>>> position of the speech engine. No guarantee is given as to where >>>> charIndex will be with respect to word boundaries (such as at the end >>>> of the previous word or the beginning of the next word), only that all >>>> text before charIndex has already been spoken, and all text after >>>> charIndex has not yet been spoken. >>>> >>>> What do you think? Feel free to edit / refine, I'd just like to word >>>> it in such a way that we can support a wide variety of existing speech >>>> engines and that clients don't make assumptions about the callbacks >>>> that won't always be true. >>>> >>>> - Dominic >>>> >>> >>> >
Received on Thursday, 4 October 2012 20:09:10 UTC