Copyright © 2011 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
This proposed API represents the web API for doing speech in HTML. This proposal is the HTML bindings and JS functions that sits on top of the protocol work that is also being proposed by the HTML Speech Incubator Group. This includes:
The section on Design Decisions [DESIGN] covers the design decisions the group agreed to that helped direct this API proposal.
The section on Requirements and Use Cases [REQ] covers the motivation behind this proposal.
This API is designed to be used in conjunction with other APIs and elements on the web platform, including APIs to capture input and APIs to do bidirectional communications with a server (WebSockets).
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document is the 13 October 2011 Editor's Draft of the HTML Speech Web API proposal. It is not a web standards track document and does not define a web standard. This proposal, or one similar to it, is likely to be included in Incubator Group's final report, along with Requirements, Design Decisions, and the Protocol proposal. The hope is an official web standards group will develop a web standard based on all of these inputs.
This document is produced by the HTML Speech Incubator Group.
This document being an Editor's Draft does not imply endorsement by the W3C Membership nor necessarily the membership of the HTML Speech incubator group. It is inteded to reflect and colate previous discussions and proposals that have taken place on the public email alias and in group teleconferences. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
Web applications should have the ability to use speech to interact with users. That speech could be for output through synthesized speech, or could be for input through the user speaking to fill form items, the user speaking to control page navagation or many other collected use cases. A web application author should be able to add speech to a web application using methods familiar to web developers and should not require extensive specialized speech expertise. The web application should build on existing W3C web standards and support a wide variety of use cases. The web application author should have the flexibility to control the recognition service the web application uses, but should not have the obligation of needing to support a service. This proposal defines the basic representations for how to use grammars, parameters, and recognition results and how to process them. The interfaces and API defined in this proposal can be used with other interfaces and APIs exposed to the web platform.
Note that privacy and security concerns exist around allowing web applications to do speech recognition. User agents should make sure that end users are aware that speech recognition is occuring, and that the end users have given informed consent for this to occur. The exact mechanism of consent is user agent specific, but the privacy and security concerns have shaped many aspects of the proposal.
In the example below the speech API is used to do basic speech web search.
insert examples
Everything in this proposal is informative since this is not a standards track document. However, RFC2119 normative language is uesd where appropriate to aid in the future should this proposal be moved into a standards track process.
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "RECOMMENDED", "MAY" and "OPTIONAL" in this document are to be interpreted as described in Key words for use in RFCs to Indicate Requirement Levels [RFC2119].
The reco element is the way to do speech recognition using markup bindings. The reco element is legal where ever phrasing content is expected, and can contain any phrasing content, except with no descendant recoable elements unless it is the element's reco comtrol, and no descendant reco elements.
This section is based on Michael Bodell's proposal and the meeting discussion.
[NamedConstructor=Reco(),
NamedConstructor=Reco(in DOMString for)]
interface HTMLRecoElement : HTMLElement {
// Attributes
readonly attribute HTMLFormElement? form;
attribute DOMString htmlFor;
readonly attribute HTMLElement? control;
attribute SpeechInputRequest request;
attribute DOMString serviceURI;
};
The reco represents a speech input in a user interface. The speech input can be associated with a specific form control, known as the reco element's reco control, either using for attribute, or by putting the form control inside the reco element itself.
Except where otherwise specified by the following rules, a reco element has no reco control.
Some elements are catagorized as recoable elements. These are elements that can be associated with a reco element:
The reco element's exact default presentation and behavior, in particular what its activation behavior might be and what implicit grammars might be defined, if any, is unspecified and user agent specific. The activation behavior of a reco element for events targetted at interactive content descendants of a reco element, and any descendants of those interactive content descendants, MUST be to do nothing. When a reco element with a reco control is activated and gets a reco result, the default action of the recognition event MUST be to set the value of the reco control to the top n-best interpretation of the recognition (in the case of single recognition) or an appended latest top n-best interpretation (in the case of dictation mode with multiple inputs). In addition for input of type checkbox and radiobutton the checked property MUST be set.
The form attribute is used to explicitly associate the reco element with its form owner.
The form IDL attribute is part of the element's forms API.
The htmlFor IDL attribute MUST reflect the for content attribute.
The for attribute MAY be specified to indicate a form control with which a speech input is to be associated. If the attribute is specified, the attribute's value MUST be the ID of a recoable element in the same Document as the reco element. If the attribute is specified and there is an element in the Document whose ID is equal to the value of the for attribute, and the first such element is a recoable element, then that element is the reco element's reco control.
If the for attribute is not specified, but the reco element has a recoable element descendant, then the first such descendant in tree order is the reco element's reco control.
The control attribute returns the form control that is associated with this element. The control IDL attribute MUST return the reco element's reco control, if any, or null if there isn't one.
control . recos returns a NodeList of all the reco elements that the form control is associated with.
Recoable elements have a NodeList object associated with them that represents the list of reco elements, in tree order, whose reco control is the element in question. The reco IDL attribute of recoable elements, on getting, MUST return that NodeList object.
The request attribute represents the SpeechInputRequest associated with this reco element. By default the User Agent sets up the speech service specified by serviceURI and the default speech input request associated with this reco. The author MAY set this attribute to associate a markup reco element with a author created speech input request. In this way the author has control over the reco involved.
The serviceURI attribute specifies the speech service to use in the constructed default request. If the serviceURI is unset then the User Agent MUST use the User Agent default service.
Two constructors are provided for creating HTMLRecoElement objects (in addition to the factory methods from DOM Core such as createElement()): Reco() and Reco(for). When invoked as constructors, these MUST return a new HTMLRecoElement object (a new reco element). If the for argument is present, the object created MUST have its for content attribute set to the provided value. The element's document MUST be the active document of the browsing context of the Window object on which the interface object of the invoked constructor is found.
The TTS element is the way to do speech synthesis using markup bindings. The TTS element is legal where embedded content is expected. If the TTS element has a src attibute, then its content model is zero or more track elements, then transparent, but with no media element descendants. If the element does not have a src attibute, then one or more source elements, then zero or more track elements, then transparent, but with no media element descendants.
This section is based on Michael Bodell's proposal and the meeting discussion.
[NamedConstructor=TTS(),
NamedConstructor=TTS(in DOMString src)]
interface HTMLTTSElement : HTMLMediaElement {
attribute DOMString serviceURI;
attribute DOMString lastMark;
};
A TTS element represents a synthesized audio stream. A TTS element is a media element whose media data is ostensibly synthesized audio data.
When a TTS element is potentially playing, it must have its TTS data played synchronized with the current playback position, at the element's effective media volume.
When a TTS element is not potentially playing, TTS must not play for the element.
Content MAY be provided inside the TTS element. User agents SHOULD NOT show this content to the user; it is intended for older Web browsers which do not support TTS.
In particular, this content is not intended to address accessibility concerns. To make TTS content accessible to those with physical or cognitive disabilities, authors are expected to provide alternative media streams and/or to embed accessibility aids (such as transcriptions) into their media streams.
Implementations SHOULD support at least UTF-8 encoded text/plain and application/ssml+xml (both SSML 1.0 and 1.1 SHOULD be supported).
The existing timeupdate event is dispatched to report progress through the synthesized speech. If the synthesis is of type application/ssml+xml, timeupdate events should be fired for each mark element that is encountered.
The src, preload, autoplay, mediagroup, loop, muted, and controls attributes are the attributes common to all media elements.
The serviceURI attribute specifies the speech service to use in the constructed default request. If the serviceURI is unset then the User Agent MUST use the User Agent default service.
The new lastMark attribute must, on getting, return the name of the last SSML mark element that was encountered during playback. If no mark has been encountered yet, the attribute must return null.
Two constructors are provided for creating HTMLTTSElement objects (in addition to the factory methods from DOM Core such as createElement()): TTS() and TTS(src). When invoked as constructors, these MUST return a new HTMLTTSElement object (a new tts element). The element MUST have its preload attribute set to the literal value "auto". If the src argument is present, the object created MUST have its src content attribute set to the provided value, and the user agent MUST invoke the object's resource selection algorithm before returning. The element's document MUST be the active document of the browsing context of the Window object on which the interface object of the invoked constructor is found.
The speech input request interface is the scripted web API for controlling a given recognition.
This section is based on Debbie Dahl's proposal, Bjorn Bringert's proposal, and Olli Pettay's proposal.
[Constructor]
interface SpeechInputRequest {
// recognition property methods
// grammar methods
void resetGrammars();
void addGrammar(in DOMString src,
optional float weight,
optional boolean modal);
void disableGrammar(in DOMString src);
// misc parameter methods
void setmaxnbest(in integer maxnbest);
void setlanguage(in DOMString language);
void setsaveforrereco(in boolean saveforrereco);
void setendpointdetection(in boolean endpointdetection);
void setfinalizebeforeend(in boolean finalizebeforeend);
void setinterimresults(in integer interimresults);
void setconfidencethreshold(in float confidencethreshold);
void setsensitivity(in float sensitivity);
void setspeedversusaccuracy(in float speedvsaccuracy);
void setcompletetimeout(in integer completetimeout);
void setincompletetimeout(in integer incompletetimeout);
void setmaxspeechtimeout(in integer maxspeechtimeout);
// the generic set parameter
void setcustomparameter(in DOMString name, in DOMString value);
// waveform methods
void setinputwaveformURI(in DOMString inputwaveformURI);
// the generic send info method
void sendInfo(in DOMString type, in DOMString value)
// methods to drive the speech interaction
void open();
void start();
void stop();
void abort();
TODOStill need to turn methods into attributes.
// attributes
attribute DOMString uri;
attribute MediaStream input;
const unsigned short SPEECH_AUTHORIZATION_UNKNOWN = 0;
const unsigned short SPEECH_AUTHORIZATION_AUTHROIZED = 1;
const unsigned short SPEECH_AUTHORIZATION_NOT_AUTHORIZED = 2;
readonly attribute unsigned short authorizationState;
attribute boolean continuous;
// event methods
attribute Function onaudiostart;
attribute Function onsoundstart;
attribute Function onspeechstart;
attribute Function onspeechend;
attribute Function onsoundend;
attribute Function onaudioend;
attribute Function onresult;
attribute Function onnomatch;
attribute Function onerror;
attribute Function onauthorizationchange;
attribute Function onopen;
attribute Function onstart;
attribute Function onend;
};
SpeechInputRequest implements EventTargt;
interface SpeechInputNomatchEvent : Event {
readonly attribute SpeechInputResult result;
};
interface SpeechInputErrorEvent : Event {
readonly attribute SpeechInputError error;
};
interface SpeechInputError {
const unsigned short SPEECH_INPUT_ERR_OTHER = 0;
const unsigned short SPEECH_INPUT_ERR_NO_SPEECH = 1;
const unsigned short SPEECH_INPUT_ERR_ABORTED = 2;
const unsigned short SPEECH_INPUT_ERR_AUDIO_CAPTURE = 3;
const unsigned short SPEECH_INPUT_ERR_NETWORK = 4;
const unsigned short SPEECH_INPUT_ERR_NOT_ALLOWED = 5;
const unsigned short SPEECH_INPUT_ERR_SERVICE_NOT_ALLOWED = 6;
const unsigned short SPEECH_INPUT_ERR_BAD_GRAMMAR = 7;
const unsigned short SPEECH_INPUT_ERR_LANGUAGE_NOT_SUPPORTED = 8;
readonly attribute unsigned short code;
readonly attribute DOMString message;
};
// Item in N-best list
interface SpeechInputAlternative {
readonly attribute DOMString utterance;
readonly attribute float confidence;
readonly attribute any interpretation;
};
// A complete one-shot simple response
interface SpeechInputResult {
readonly attribute Document resultEMMAXML;
readonly attribute DOMString resultEMMAText;
readonly attribute unsigned long length;
getter SpeechInputAlternative item(in unsigned long index);
readonly attribute boolean final;
};
// A full response, which could be interim or final, part of a continuous response or not
interface SpeechInputResultEvent : Event {
readonly attribute SpeechInputResult result;
readonly attribute short resultIndex;
readonly attribute SpeechInputResult[] results;
readonly attribute DOMString sessionId;
};
The DOM Level 2 Event Model is used for speech recognition events. The methods in the EventTarget interface should be used for registering event listeners. The SpeechInputRequest interface also contains convenience attributes for registering a single event handler for each event type.
For all these events, the timeStamp attribute defined in the DOM Level 2 Event interface must be set to the best possible estimate of when the real-world event which the event object represents occurred.
Unless specified below, the ordering of the different events is undefined. For example, some implementations may fire audioend before speechstart or speechend if the audio detector is client-side and the speech detector is server-side.
The speech input error object has two attributes code
and message
.
The SpeechInputAlternative represents a simple view of the response that gets used in a n-best list.
The SpeechInputResult object represents a single one-shot recognition match, either as one small part of a continous recognition or as the complete return result of a non-continuous recognition.
The Speech Input Result event is the event that is raised each time there is an interim or final result. The event contains both the current most recent recognized bit (in the result object) as well as a history of the complete recognition session so far (in the results object).
Here are the design decisions from the XG that are relevant to the Web API proposal:
insert other design decisions as we receive them and review them
This section covers what some of the requirements were for this API, as well as illustrates some use cases. Note more extensive information can be found at HTML Speech XG Use Cases and Requirements as well as in the final XG note including requirements and use cases.
Voice Web Search. A user can speak a query and get a result.
Speech Comand Interface. A Speech Command and Control Shell that allows multiple comands, many of which take arguments, such as "call [number]", "call [person]", "calculate [math expression]", "play [song]", or "search for [query]".
Speech UI present when no visible UI need be present. Some speech applications are oriented around determining the user's intent before gathering any specific input, and hence their first interaction may have no visible input fields whatsoever, or may accept speech input that is far less constrained than the fields on the screen. For example, the user may simply be presented with the text "how may I help you?" (maybe with some speech synthesis or an earcon), and then utter their request, which the application analyzes in order to route the user to an appropriate part of the application.
A Speech Enabled Email Client. The application reads out subjects and contents of email and also listens for commands, for instance, "archive", "reply: ok, let's meet at 2 pm", "forward to bob", "read message". when an email message is received, a summary notification may be raised that displays a small amount of content (for instance the person the email is from and a couple of words of the subject). It is desirable that a speech API be present and listening for the duration of this notification, allowing a user experience of being able to say "Reply to that" or "Read that email message". Note that this recognition UI could not be contingent on the user clicking a button, as that would defeat much of the benefit of this scenario (being able to reply and control the email without using the keyboard or mouse).
This proposal was developed by the HTML Speech XG.
This work builds on the existing work including:
Special thanks to the members of the XG: Andrei Popescu, Andy Mauro, Björn Bringert, Chaitanya Gharpure, Charles Chen, Dan Druta, Daniel Burnett, Dave Burke, David Bolter, Deborah Dahl, Fabio Paternò, Glen Shires, Ingmar Kliche, Jerry Carter, Jim Larson, Kazuyuki Ashimura, Marc Schröder, Markus Gylling, Masahiro Araki, Matt Womer, Michael Bodell, Michael Johnston, Milan Young, Olli Pettay, Paolo Baggia, Patrick Ehlen, Raj Tumuluri, Rania Elnaggar, Ravi Reddy, Robert Brown, Satish Kumar Sampath, Somnath Chandra, and T.V. Raman.