- From: Glen Shires <gshires@google.com>
- Date: Wed, 3 Oct 2012 18:22:13 -0700
- To: Dominic Mazzoni <dmazzoni@google.com>
- Cc: public-speech-api@w3.org
- Message-ID: <CAEE5bchMnjreSea_LNaJL7DNQYUmU7egmUD7f8+B09Sjoz0Gvg@mail.gmail.com>
OK, so no changes to the definitions, just some renaming and minor IDL changes: replacing the constants with an enumeration, and providing the same attributes for onpause and onend. Here's the new IDL: enum UpdateType { "mark", "word" }; callback SpeechSynthesisCallback = void(unsigned long charIndex, double elapsedTime); callback SpeechSynthesisUpdateCallback = void(unsigned long charIndex, double elapsedTime, UpdateType type, DOMString markerName); interface SpeechSynthesisUtterance { ... attribute Function onstart; attribute SpeechSynthesisCallback onend; attribute SpeechSynthesisCallback onpause; attribute Function onresume; attribute SpeechSynthesisUpdateCallback onupdate; }; /Glen Shires On Wed, Oct 3, 2012 at 12:59 AM, Dominic Mazzoni <dmazzoni@google.com>wrote: > I don't think of word callbacks as markers. A marker is something the > client adds to the input, saying they want to be notified when it's > reached. But word callbacks are more general status updates - the client > just wants to know the progress to the greatest detail possible, not only > when a particular marker is reached. > > How about: > > IDL: > > const unsigned long MARKER_EVENT = 1; > const unsigned long WORD_EVENT = 2; > > callback SpeechSynthesisEventCallback = void(const unsigned long > eventType, DOMString eventName, unsigned long charIndex, double > elapsedTime); > > Definitions: > > SpeechSynthesisEventCallback parameters > > eventType parameter > An enumeration indicating the type of event, either MARKER_EVENT > or WORD_EVENT. > > markerName parameter > For events with eventType of MARKER_EVENT, contains the name of the > marker, as defined in SSML as the name attribute of a mark element. For > events with eventType of WORD_MARKER, this value should be undefined. > > charIndex parameter > The zero-based character index into the original utterance string that > most closely approximates the current speaking position of the speech > engine. No guarantee is given as to where charIndex will be with respect to > word boundaries (such as at the end of the previous word or the beginning > of the next word), only that all text before charIndex has already been > spoken, and all text after charIndex has not yet been spoken. The User > Agent must return this value if the speech synthesis engine supports it, > otherwise the User Agent must return undefined. > > elapsedTime parameter > The time, in seconds, that this event triggered, relative to when > this utterance has begun to be spoken. The User Agent must return this > value if the speech synthesis engine supports it, otherwise the User Agent > must return undefined. > > > SpeechSynthesisUtterance Attributes > > text attribute > The text to be synthesized and spoken for this utterance. This may > be either plain text or a complete, well-formed SSML document. For speech > synthesis engines that do not support SSML, or only support certain tags, > the User Agent or speech engine must strip away the tags they do not > support and speak the text. There may be a maximum length of the text of > 32,767 characters. > > SpeechSynthesisUtterance Events > > update event > Fired when the spoken utterance reaches a word boundary or a named > marker. User Agent should fire event if the speech synthesis engine > provides the event. > > On Tue, Oct 2, 2012 at 6:48 PM, Glen Shires <gshires@google.com> wrote: > > Yes, I agree. Based on this, here's my proposal for the spec. If there's >> no disagreement, I'll add this to the spec on Thursday. >> >> IDL: >> >> const unsigned long NAMED_MARKER = 1; >> const unsigned long WORD_MARKER = 2; >> >> callback SpeechSynthesisMarkerCallback = void(const unsigned long >> markerType, DOMString markerName, unsigned long charIndex, double >> elapsedTime); >> >> Definitions: >> >> SpeechSynthesisMarkerCallback parameters >> >> markerType parameter >> An enumeration indicating the type of marker that caused this event, >> either NAMED_MARKER or WORD_MARKER. >> >> markerName parameter >> For events with markerType of NAMED_MARKER, contains the name of the >> marker, as defined in SSML as the name attribute of a mark element. For >> events with markerType of WORD_MARKER, this value should be undefined. >> >> charIndex parameter >> The zero-based character index into the original utterance string that >> most closely approximates the current speaking position of the speech >> engine. No guarantee is given as to where charIndex will be with respect to >> word boundaries (such as at the end of the previous word or the beginning >> of the next word), only that all text before charIndex has already been >> spoken, and all text after charIndex has not yet been spoken. The User >> Agent must return this value if the speech synthesis engine supports it, >> otherwise the User Agent must return undefined. >> >> elapsedTime parameter >> The time, in seconds, that this marker triggered, relative to when >> this utterance has begun to be spoken. The User Agent must return this >> value if the speech synthesis engine supports it, otherwise the User Agent >> must return undefined. >> >> >> SpeechSynthesisUtterance Attributes >> >> text attribute >> The text to be synthesized and spoken for this utterance. This may >> be either plain text or a complete, well-formed SSML document. For speech >> synthesis engines that do not support SSML, or only support certain tags, >> the User Agent or speech engine must strip away the tags they do not >> support and speak the text. There may be a maximum length of the text of >> 32,767 characters. >> >> SpeechSynthesisUtterance Events >> >> marker event >> Fired when the spoken utterance reaches a word boundary or a named >> marker. User Agent should fire event if the speech synthesis engine >> provides the event. >> >> /Glen Shires >> >> >> On Mon, Oct 1, 2012 at 11:07 PM, Dominic Mazzoni <dmazzoni@google.com> >> wrote: >> >>> On Mon, Oct 1, 2012 at 4:14 PM, Glen Shires <gshires@google.com> wrote: >>> > - Should markerName be changed from "DOMString" to "type ( enumerated >>> string >>> > ["word", "sentence", "marker"] )". >>> >>> I'd be fine with an enum, as long as it's clear that we have the >>> option to expand on this in the future - for example, an engine might >>> be able to do a callback for each phoneme or syllable. >>> >>> > If so, how are named markers returned? >>> > (We could add a "DOMString namedMarker" parameter or add an >>> "onnamedmarker() >>> > event". >>> >>> I'd prefer the DOMString namedMarker over a separate event. I think >>> one event for all progress updates is simpler for both server and >>> client. >>> >>> > - What should be the format for elapsedTime? >>> >>> Perhaps this should be analogous to currentTime in a HTMLMediaElement? >>> That'd be the time since speech on this utterance began, in seconds. >>> Double-precision float. >>> >>> > I propose the following definition: >>> > >>> > SpeechSynthesisMarkerCallback parameters >>> > >>> > charIndex parameter >>> > The zero-based character index into the original utterance string of >>> the >>> > word, sentence or marker about to be spoken. >>> >>> I'd word this slightly differently. In my experience, some engines >>> support callbacks *before* a word, others support callbacks *after* a >>> word. Practically they're almost the same thing, but not quite - the >>> time is slightly different due to pauses, and the charIndex is either >>> before the first letter or after the last letter in the word. >>> >>> My suggestion: The zero-based character index into the original >>> utterance string that most closely approximates the current speaking >>> position of the speech engine. No guarantee is given as to where >>> charIndex will be with respect to word boundaries (such as at the end >>> of the previous word or the beginning of the next word), only that all >>> text before charIndex has already been spoken, and all text after >>> charIndex has not yet been spoken. >>> >>> What do you think? Feel free to edit / refine, I'd just like to word >>> it in such a way that we can support a wide variety of existing speech >>> engines and that clients don't make assumptions about the callbacks >>> that won't always be true. >>> >>> - Dominic >>> >> >>
Received on Thursday, 4 October 2012 01:23:21 UTC