- From: Glen Shires <gshires@google.com>
- Date: Tue, 2 Oct 2012 18:48:12 -0700
- To: Dominic Mazzoni <dmazzoni@google.com>
- Cc: public-speech-api@w3.org
- Message-ID: <CAEE5bcgYD8-AXDw=_QjYA8bdEWwoLQH9b81k02BiRKArB-NJMw@mail.gmail.com>
Yes, I agree. Based on this, here's my proposal for the spec. If there's no disagreement, I'll add this to the spec on Thursday. IDL: const unsigned long NAMED_MARKER = 1; const unsigned long WORD_MARKER = 2; callback SpeechSynthesisMarkerCallback = void(const unsigned long markerType, DOMString markerName, unsigned long charIndex, double elapsedTime); Definitions: SpeechSynthesisMarkerCallback parameters markerType parameter An enumeration indicating the type of marker that caused this event, either NAMED_MARKER or WORD_MARKER. markerName parameter For events with markerType of NAMED_MARKER, contains the name of the marker, as defined in SSML as the name attribute of a mark element. For events with markerType of WORD_MARKER, this value should be undefined. charIndex parameter The zero-based character index into the original utterance string that most closely approximates the current speaking position of the speech engine. No guarantee is given as to where charIndex will be with respect to word boundaries (such as at the end of the previous word or the beginning of the next word), only that all text before charIndex has already been spoken, and all text after charIndex has not yet been spoken. The User Agent must return this value if the speech synthesis engine supports it, otherwise the User Agent must return undefined. elapsedTime parameter The time, in seconds, that this marker triggered, relative to when this utterance has begun to be spoken. The User Agent must return this value if the speech synthesis engine supports it, otherwise the User Agent must return undefined. SpeechSynthesisUtterance Attributes text attribute The text to be synthesized and spoken for this utterance. This may be either plain text or a complete, well-formed SSML document. For speech synthesis engines that do not support SSML, or only support certain tags, the User Agent or speech engine must strip away the tags they do not support and speak the text. There may be a maximum length of the text of 32,767 characters. SpeechSynthesisUtterance Events marker event Fired when the spoken utterance reaches a word boundary or a named marker. User Agent should fire event if the speech synthesis engine provides the event. /Glen Shires On Mon, Oct 1, 2012 at 11:07 PM, Dominic Mazzoni <dmazzoni@google.com>wrote: > On Mon, Oct 1, 2012 at 4:14 PM, Glen Shires <gshires@google.com> wrote: > > - Should markerName be changed from "DOMString" to "type ( enumerated > string > > ["word", "sentence", "marker"] )". > > I'd be fine with an enum, as long as it's clear that we have the > option to expand on this in the future - for example, an engine might > be able to do a callback for each phoneme or syllable. > > > If so, how are named markers returned? > > (We could add a "DOMString namedMarker" parameter or add an > "onnamedmarker() > > event". > > I'd prefer the DOMString namedMarker over a separate event. I think > one event for all progress updates is simpler for both server and > client. > > > - What should be the format for elapsedTime? > > Perhaps this should be analogous to currentTime in a HTMLMediaElement? > That'd be the time since speech on this utterance began, in seconds. > Double-precision float. > > > I propose the following definition: > > > > SpeechSynthesisMarkerCallback parameters > > > > charIndex parameter > > The zero-based character index into the original utterance string of the > > word, sentence or marker about to be spoken. > > I'd word this slightly differently. In my experience, some engines > support callbacks *before* a word, others support callbacks *after* a > word. Practically they're almost the same thing, but not quite - the > time is slightly different due to pauses, and the charIndex is either > before the first letter or after the last letter in the word. > > My suggestion: The zero-based character index into the original > utterance string that most closely approximates the current speaking > position of the speech engine. No guarantee is given as to where > charIndex will be with respect to word boundaries (such as at the end > of the previous word or the beginning of the next word), only that all > text before charIndex has already been spoken, and all text after > charIndex has not yet been spoken. > > What do you think? Feel free to edit / refine, I'd just like to word > it in such a way that we can support a wide variety of existing speech > engines and that clients don't make assumptions about the callbacks > that won't always be true. > > - Dominic >
Received on Wednesday, 3 October 2012 01:49:21 UTC