HTML Speech Web API

1. Introduction

Web applications should have the ability to use speech to interact with users. That speech could be for output through synthesized speech, or could be for input through the user speaking to fill form items, the user speaking to control page navagation or many other collected use cases. A web application author should be able to add speech to a web application using methods familiar to web developers and should not require extensive specialized speech expertise. The web application should build on existing W3C web standards and support a wide variety of use cases. The web application author should have the flexibility to control the recognition service the web application uses, but should not have the obligation of needing to support a service. This proposal defines the basic representations for how to use grammars, parameters, and recognition results and how to process them. The interfaces and API defined in this proposal can be used with other interfaces and APIs exposed to the web platform.

Note that privacy and security concerns exist around allowing web applications to do speech recognition. User agents should make sure that end users are aware that speech recognition is occuring, and that the end users have given informed consent for this to occur. The exact mechanism of consent is user agent specific, but the privacy and security concerns have shaped many aspects of the proposal.

Example

In the example below the speech API is used to do basic speech web search.

Speech Web Search


  <html>
    <head>
      <title>Example Speech Web Search</title>
    </head>
    <body>
      <form id="f" action="/search" method="GET">
        <label for="q">Search</label>
        <reco for="q"/>
        <input id="q" name="q" type="text"/>
        <input type="submit" value="Example Search"/>
      </form>
    </body>
  </html>

2. Conformance

Everything in this proposal is informative since this is not a standards track document. However, RFC2119 normative language is uesd where appropriate to aid in the future should this proposal be moved into a standards track process.

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "RECOMMENDED", "MAY" and "OPTIONAL" in this document are to be interpreted as described in Key words for use in RFCs to Indicate Requirement Levels [RFC2119].

3. Reco Element

The reco element is the way to do speech recognition using markup bindings. The reco element is legal where ever phrasing content is expected, and can contain any phrasing content, except with no descendant recoable elements unless it is the element's reco comtrol, and no descendant reco elements.

Reference

This section is based on Michael Bodell's proposal and the meeting discussion.

IDL


  [NamedConstructor=Reco(),
  NamedConstructor=Reco(in DOMString for)]
    interface HTMLRecoElement : HTMLElement {
        // Attributes
        readonly attribute HTMLFormElement? form;
        attribute DOMString htmlFor;
        readonly attribute HTMLElement? control;
        attribute SpeechInputRequest request;

        attribute DOMString grammar;

        // From the SpeechInputRequest
        integer maxNBest;
        DOMString language;
        boolean saveForRereco;
        boolean endpointDetection;
        boolean finalizeBeforeEnd;
        integer interimResults;
        float confidenceThreshold;
        float sensitivity;
        float speedVsAccuracy;
        integer completeTimeout;
        integer incompleteTimeout;
        integer maxSpeechTimeout;
        DOMString inputWaveformURI;
        attribute DOMString serviceURI;
        attribute boolean continuous;

        // event handlers
        attribute Function onaudiostart;
        attribute Function onsoundstart;
        attribute Function onspeechstart;
        attribute Function onspeechend;
        attribute Function onsoundend;
        attribute Function onaudioend;
        attribute Function onresult;
        attribute Function onnomatch;
        attribute Function onerror;
        attribute Function onauthorizationchange;
        attribute Function onopen;
        attribute Function onstart;
    };

The reco represents a speech input in a user interface. The speech input can be associated with a specific form control, known as the reco element's reco control, either using for attribute, or by putting the form control inside the reco element itself.

Except where otherwise specified by the following rules, a reco element has no reco control.

Some elements are catagorized as recoable elements. These are elements that can be associated with a reco element:

The reco element's exact default presentation and behavior, in particular what its activation behavior might be is unspecified and user agent specific. When the reco element is bound to a recoable element if no grammar attribute is specified, then by default the default builtin uri is used. The activation behavior of a reco element for events targetted at interactive content descendants of a reco element, and any descendants of those interactive content descendants, MUST be to do nothing. When a reco element with a reco control is activated and gets a reco result, the default action of the recognition event MUST be to use the value of the top n-best interpretation of the current result event. The exact binding depends on the recoable element in question and is covered in the binding results section.

Warning:

Not all implementors see value in linking the recognition behavior to markup, versus an all scripting API. Some user agents like the possiblity of good defaults based on the associations. Some user agents like the idea of different consent bars based on the user clicking a markup button, rather then just relying on scripting. User agents are cautioned to remember click-jacking and SHOULD NOT automatically assume that when a reco element is activated it means the user meant to start recognition in all situations.

3.1. Reco Attributes

form

The form attribute is used to explicitly associate the reco element with its form owner.

The form IDL attribute is part of the element's forms API.

htmlFor

The htmlFor IDL attribute MUST reflect the for content attribute.

The for attribute MAY be specified to indicate a form control with which a speech input is to be associated. If the attribute is specified, the attribute's value MUST be the ID of a recoable element in the same Document as the reco element. If the attribute is specified and there is an element in the Document whose ID is equal to the value of the for attribute, and the first such element is a recoable element, then that element is the reco element's reco control.

If the for attribute is not specified, but the reco element has a recoable element descendant, then the first such descendant in tree order is the reco element's reco control.

control

The control attribute returns the form control that is associated with this element. The control IDL attribute MUST return the reco element's reco control, if any, or null if there isn't one.

control . recos returns a NodeList of all the reco elements that the form control is associated with.

Recoable elements have a NodeList object associated with them that represents the list of reco elements, in tree order, whose reco control is the element in question. The reco IDL attribute of recoable elements, on getting, MUST return that NodeList object.

request

The request attribute represents the SpeechInputRequest associated with this reco element. By default the User Agent sets up the speech service specified by serviceURI and the default speech input request associated with this reco. The author MAY set this attribute to associate a markup reco element with a author created speech input request. In this way the author has control over the reco involved. When the request is set then the request's speech parameters take priority over the corrisponding parameters on the reco attributes.

grammar

The uri of a grammar associated with this reco. If unset, this defaults to the default builtin uri. Note that to use multiple grammars or different weights the user must use the scripted SpeechInputRequest API.

The other attributes are all defined identiaclly to how they appear in the SpeechInputResult section.

3.2. Reco Constructors

Two constructors are provided for creating HTMLRecoElement objects (in addition to the factory methods from DOM Core such as createElement()): Reco() and Reco(for). When invoked as constructors, these MUST return a new HTMLRecoElement object (a new reco element). If the for argument is present, the object created MUST have its for content attribute set to the provided value. The element's document MUST be the active document of the browsing context of the Window object on which the interface object of the invoked constructor is found.

3.3. Builtin Default Grammars

When the user agent needs to create a default grammar from a recoable element it builds a uri using the builtin scheme. The format of the uri is of the form: builtin:<tag-name>?<tag-attributes> where the tag-name is just the name of the recoable element, and the tag-attributes are the content attributes in the form name=value&name=value. Since this a uri, both name and value must be properly uri escaped. Note the ? character may be omitted when there are no tag-attributes. For example:

A simple textarea like <textarea /> would produce: builtin:textarea
A simple number like <input type="number" max="3" /> would produce: builtin:input?type=number&max=3
A zip code like <input type="text" pattern="[0-9]{5}" /> would produce: builtin:input?type=text&pattern=%5B0-9%5D%7B5%7D

Speech services may define other builtin grammars as well. It is recommended that speech services allow a builtin:dictation to represents a "say anything" grammar and builtin:websearch to represent a speech web search.

builtin uri for grammars can be used even when the reco element is not bound to any particular element and may also be used by the SpeechInputRequest object and as a rule reference in an SRGS grammar.

In addition to the content attribute other parameters may be specified. It is recommended that speech services support a filter parameter that can be set to the value noOffensiveWords to represent a desire to not recognize offensive words. Speech services may define other extension parameters.

Note the exact grammar that is generated from any builtin uri is specific to the recognition service and the content attributes are best thought of as hints for the service.

3.4. Default Binding of Results

When a reco element is bound to a recoable element and does not have an onresult attribute set then the default binding is used. The default binding can also be used if the SpeechInputResult's outputToElement method is called. In both cases the exact binding depends on the recoable element in question. In general, the binding will use the value associated with the interpretation of the top n-best element.

When the recoable element is a button then if the button is not disabled, then the result of a speech recognition is to activate the button.

When the recoable element is a input element then the exact binding depends on the control type. For basic text fields (input elements with a type attribute of text, search, tel, url, email, password, and number) the value of the result should be assigned to the input (inserted at the cursor and replacing any selected text). For button controls (submit, image, reset, button) the act of recognition just activates the input. For type checkbox, the input should be set to a checkedness of true. For type radiobutton, the input should be set to a checkedness of true, and all other inputs in the radio button group must be set to a checkedness of false. For date and time types (datetime, date, month, week, time, datetime-local) then the value should be assigned unless the value represents an non-empty string that is not valid for that type as described here. For type of color then the value should be assigned unless the value does not repersent a valid lowercase simple color. For type of range the assignment is only allowed if it is a valid floating point number, and before being assinged, it must undergo the value sanitization algorithm as described here.

When the recoable element is a keygen element then the element should regenerate the key.

When the recoable element is a meter element then the value of the metter should be set to the best representation of the value as a floating point number.

When the recoable element is a output element it assigns the recognized value to the output's value (which also must set the value mode flag to value).

When the recoable element is a progress element then the value of the progress bar should be set to the best representation of the value as a floating point number.

When the recoable element is a select element then the recognition result will be used to select any options that are named the same as the interpretations value (that is any that are returned by namedItem(value)).

When the recoable element is a textarea element then the recognized value is inserted in to the textarea where the text cursor is if it is in the textarea. If text in the textarea is selected then the new value replaces the high lighted text. If the text cursor is not in the textarea then the value is appended to the end of the textarea.

4. TTS Element

The TTS element is the way to do speech synthesis using markup bindings. The TTS element is legal where embedded content is expected. If the TTS element has a src attibute, then its content model is zero or more track elements, then transparent, but with no media element descendants. If the element does not have a src attibute, then one or more source elements, then zero or more track elements, then transparent, but with no media element descendants.

Reference

This section is based on Michael Bodell's proposal and the meeting discussion.

IDL


  [NamedConstructor=TTS(),
  NamedConstructor=TTS(in DOMString src)]
    interface HTMLTTSElement : HTMLMediaElement {
        attribute DOMString serviceURI;
        attribute DOMString lastMark;
    };

A TTS element represents a synthesized audio stream. A TTS element is a media element whose media data is ostensibly synthesized audio data.

When a TTS element is potentially playing, it must have its TTS data played synchronized with the current playback position, at the element's effective media volume.

When a TTS element is not potentially playing, TTS must not play for the element.

Content MAY be provided inside the TTS element. User agents SHOULD NOT show this content to the user; it is intended for older Web browsers which do not support TTS.

In particular, this content is not intended to address accessibility concerns. To make TTS content accessible to those with physical or cognitive disabilities, authors are expected to provide alternative media streams and/or to embed accessibility aids (such as transcriptions) into their media streams.

Implementations SHOULD support at least UTF-8 encoded text/plain and application/ssml+xml (both SSML 1.0 and 1.1 SHOULD be supported).

The existing timeupdate event is dispatched to report progress through the synthesized speech. If the synthesis is of type application/ssml+xml, timeupdate events should be fired for each mark element that is encountered.

4.1. Attributes

The src, preload, autoplay, mediagroup, loop, muted, and controls attributes are the attributes common to all media elements.

serviceURI: The serviceURI attribute specifies the speech service to use in the constructed default request. If the serviceURI is unset then the User Agent MUST use the User Agent default service.
lastMark: The new lastMark attribute must, on getting, return the name of the last SSML mark element that was encountered during playback. If no mark has been encountered yet, the attribute must return null.

4.2. Constructors

Two constructors are provided for creating HTMLTTSElement objects (in addition to the factory methods from DOM Core such as createElement()): TTS() and TTS(src). When invoked as constructors, these MUST return a new HTMLTTSElement object (a new tts element). The element MUST have its preload attribute set to the literal value "auto". If the src argument is present, the object created MUST have its src content attribute set to the provided value, and the user agent MUST invoke the object's resource selection algorithm before returning. The element's document MUST be the active document of the browsing context of the Window object on which the interface object of the invoked constructor is found.

5. The Speech Input Request Interface

The speech input request interface is the scripted web API for controlling a given recognition.

Reference

This section is based on Debbie Dahl's proposal, Bjorn Bringert's proposal, and Olli Pettay's proposal.

IDL


    [Constructor]
    interface SpeechInputRequest {
        // recognition parameters
        SpeechGrammars[] grammars;

        // misc parameter attributes
        integer maxNBest;
        DOMString language;
        boolean saveForRereco;
        boolean endpointDetection;
        boolean finalizeBeforeEnd;
        integer interimResults;
        float confidenceThreshold;
        float sensitivity;
        float speedVsAccuracy;
        integer completeTimeout;
        integer incompleteTimeout;
        integer maxSpeechTimeout;
        DOMString inputWaveformURI;

        // the generic set of parameters
        SpeechParameter[] parameters;

        // other attributes
        attribute DOMString serviceURI;
        attribute MediaStream input;
        const unsigned short SPEECH_AUTHORIZATION_UNKNOWN = 0;
        const unsigned short SPEECH_AUTHORIZATION_AUTHROIZED = 1;
        const unsigned short SPEECH_AUTHORIZATION_NOT_AUTHORIZED = 2;
        readonly attribute unsigned short authorizationState;
        attribute boolean continuous;

        // the generic send info method
        void sendInfo(in DOMString type, in DOMString value);

        // Default markup binding methods
        void addGrammarFrom(in Element inputElement, optional float weight, optional boolean modal);
        void outputToElement(in Element outputElement);
        
        // methods to drive the speech interaction
        void open();
        void start();
        void stop();
        void abort();
        void interpret(in DOMString text);

        // event methods
        attribute Function onaudiostart;
        attribute Function onsoundstart;
        attribute Function onspeechstart;
        attribute Function onspeechend;
        attribute Function onsoundend;
        attribute Function onaudioend;
        attribute Function onresult;
        attribute Function onnomatch;
        attribute Function onerror;
        attribute Function onauthorizationchange;
        attribute Function onopen;
        attribute Function onstart;
        attribute Function onend;
    };
    SpeechInputRequest implements EventTarget;

    interface SpeechInputNomatchEvent : Event {
        readonly attribute SpeechInputResult result;
    };

    interface SpeechInputErrorEvent : Event {
        readonly attribute SpeechInputError error;
    };

    interface SpeechInputError {
        const unsigned short SPEECH_INPUT_ERR_OTHER = 0;
        const unsigned short SPEECH_INPUT_ERR_NO_SPEECH = 1;
        const unsigned short SPEECH_INPUT_ERR_ABORTED = 2;
        const unsigned short SPEECH_INPUT_ERR_AUDIO_CAPTURE = 3;
        const unsigned short SPEECH_INPUT_ERR_NETWORK = 4;
        const unsigned short SPEECH_INPUT_ERR_NOT_ALLOWED = 5;
        const unsigned short SPEECH_INPUT_ERR_SERVICE_NOT_ALLOWED = 6;
        const unsigned short SPEECH_INPUT_ERR_BAD_GRAMMAR = 7;
        const unsigned short SPEECH_INPUT_ERR_LANGUAGE_NOT_SUPPORTED = 8;

        readonly attribute unsigned short code;
        readonly attribute DOMString message;
    };

    // Item in N-best list
    interface SpeechInputAlternative {
        readonly attribute DOMString utterance;
        readonly attribute float confidence;
        readonly attribute any interpretation;
    };

    // A complete one-shot simple response
    interface SpeechInputResult {
        readonly attribute Document resultEMMAXML;
        readonly attribute DOMString resultEMMAText;
        readonly attribute unsigned long length;
        getter SpeechInputAlternative item(in unsigned long index);
        readonly attribute boolean final;
    };

    // A full response, which could be interim or final, part of a continuous response or not
    interface SpeechInputResultEvent : Event {
        readonly attribute SpeechInputResult result;
        readonly attribute short resultIndex;
        readonly attribute SpeechInputResult[] results;
        readonly attribute DOMString sessionId;
    };

    // The object representing a speech grammar
    [Constructor]
    interface SpeechGrammar {
        attribute DOMString src;
        attribute float weight;
        attribute boolean modal;
    };

    // The object representing a speech parameter
    [Constructor]
    interface SpeechParameter {
        attribute DOMString name;
        attribute DOMString value;
    };

5.1. Speech Input Request Attributes

grammars attribute: The grammars attribute stores the array of SpeechGrammar objects which represent the grammars that are active for this recognition.
maxNBest attribute: This attribute will set the maximum number of recognition results that should be returned. The default value is 1.
langugage attribute: This attribute will set the language of the recognition for the request, using a valid BCP 47 language tag. If unset it remains unset for getting in script, but will default to use the lang of its recoable element, if tied to an html element, and the lang of the html document root element and associated heirachy are used when the SpeechInputRequest is not associated with a recoable element. This default value is computed and used when the input request opens a connection to the recognition service.
saveForRereco attribute: This attribute instructs the speech recognition service if the utterance should be saved for later use in a rerecognition (true means save). The default value is false.
endpointDetection attribute: This attribute instructs the user agent if it should do a low latency endpoint detection (true means do endpointing). The user agent default SHOULD be true.
finalizeBeforeEnd attribute: This attribute instructs the recognition service if it should send final results when it gets them, even if the user is not done talking (true means yes it should send the results early). The user agent default SHOULD be true.
interimResults attribute: If interimResults is set to 0, that instructs the recognition service that it MUST NOT send any interim results. Other vales represent a hint to the service that the web application would like interim results every this many milliseconds. The service MAY not follow the hint, as the exact interval between interim results depends on a combination of the recognition service, the grammars in use, and the utterance being recognized. The user agent default value SHOULD be 0.
confidenceThreshold attribute: This attribute represents the degree of confidence the recognition system needs in order to return a recognition match instead of a nomatch. The confidence threshold is a value between 0.0 (least confidence needed) and 1.0 (most confidence) with 0.5 as the default.
sensitivity attribute: This attribute represents the sensitivity to quiet input. The sensitivity is a value between 0.0 (least sensitive) and 1.0 (most sensitivity) with 0.5 as the default.
speedVsAccuracy attribute: This attribute instructs the recognition service on the desired trade off between low latency and high speed. The speedVsAccuracy is a value between 0.0 (least accurate) and 1.0 (most accurate) with 0.5 as the default.
completeTimeout attribute: This attribute represents the amount of silence, in milliseconds, needed to match a grammar when a hypothesis is at a complete match of the grammar (that is the hypothesis matches a grammar, and no larger input can possibly match a grammar).
incompleteTimeout attribute: This attribute represents the amount of silence, in milliseconds, needed to match a grammar when a hypothesis is not at a complete match of the grammar (that is the hypothesis does not match a grammar, or it does match a grammar but so could a larger input).
maxSpeechTimeout attribute: This attribute represents how much speech, in milliseconds, the recognition service should process before an end of speech or an error occurs.
inputWaveformURI attribute: This attribute, if set, instructs the speech recognition service to recognize from this URI instead of from the input MediaStream attribute.
parameters attribute: This attribute holds an array of arbitrary extension parameters. These parameters could set user specific information (such as profile, gender, or age information) or could be used to set recognition parameters specific to the recognition service in use.
serviceURI attribute: The serviceURI attribute specifies the location of the speech service the web application wishes to connect to. If this attribute is unset at the time of the open call, then the user agent MUST use the user agent default speech service.
input attribute: The input attibute is the MediaStream that we are recognizing against. If input is not set, the Speech Input Request uses the default UA provided capture (which MAY be nothing), in which case the value of input will be null. In cases where the MediaStream is set but the SpeechInputRequest hasn't yet called start the User Agent SHOULD NOT buffer the audio, the semantics are that the web application wants to start listening to the Media Stream at the moment it calls Start, and not earlier than that.
authorizationState attribute: The authorizedState variable tracks if the web application is authorized to do speech recognition. The UA SHOULD start in SPEECH_AUTHORIZATION_UNKNOWN if the user agent can not determine if the web application is able to be authorized. The state variable may change values in response to policies of the user agent and possibly security interactions with the end user. If the web application is authorized then the user agent MUST set this variable to SPEECH_AUTHORIZATION_AUTHORIZED. If the web applicaiton is not authorized then the user agent MUST set this variable to SPEECH_AUTHORIZATION_NOT_AUTHORIZED. Any time this state variable changes in value the user agent MUST raise a authorizationchange event.
continuous attribute: When the continuous attribute is set to false the service MUST only return a single simple recognition response as a result of starting recognition. This represents a request/response single turn pattern of interaction. When the continuous attribute is set to true the service MUST return a set of recognitions representing more a dictation of multiple recognitions in response to a single starting of recognition. The user agent default value SHOULD be false.

5.2. Speech Input Request Methods

The sendInfo method: The method allows one to pass arbitrary information to the recognition service, even while recognition is on going. Each set info call get transmitted immediately to the recognition service. The type attribute specifies the content-type of the info message and the value attribute specifies the payload of the info method.
The addGrammarFrom method: This method allows one to create a builtin grammar uri from a recoable element as outlined in the builtin uri description. This grammar is then appended to the grammars array parameter. The element in question is provided by the inputElement attribute. The optional weight and modal attributes set the corrisponding attribute values in the created SpeechGrammar object.
The outputToElement method: This method defines a result method handler that will perform the default binding of recognition matches to the recoable element passed in as the outputElement method argument. This is the same default binding that occurs when a <reco> element is bound as described at Default Binding of Results section.
The open method: When the open method is called the user agent MUST connect to the speech service. All of the attributes and parameters of the SpeechInputResult (I.e., languages, grammars, service uri, etc.) MUST be set before this method is called, because they will be fixed with the values they have at the time open is called, at least until open is called again. Note that the user agent MAY need to have a permissions dialog at this point to ensure that the end user has given informed consent for the web application to listen to the user and recognize. Errors MAY be raised at this point for a variety of reasons including: not authorized to do recognition, failure to connect to the service, the service can not handle the languages or grammars needed for this turn, etc. When the service is successfully completed the open with no errors the user agent MUST raise an open event.
The start method: When the start method is called it represents the moment in time the web application wishes to begin recognition. When the speech input is streaming live through the input media stream, then this start call represents the moment in time that the service MUST begin to listen and try to match the grammars associated with this request. If the SpeechInputRequest has not yet called open before the start call is made, a call to open is made by the start call (complete with the open event being raised). Once the system is successfully listening to the recognition the user agent MUST raise a start event.
The stop method: The stop method represents an instruction to the recognition service to stop listening to more audio, and to try and return a result using just the audio that it has received to date. A typical use of the stop method might be for a web application where the end user is doing the end pointing, similar to a walkie-talkie. The end user might press and hold the space bar to talk to the system and on the space down press the start call would have occurred and when the space bar is released the stop method is called to ensure that the system is no longer listening to the user. Once the stop method is called the speech service MUST NOT collect additional audio and MUST NOT continue to listen to the user. The speech service MUST attempt to return a recognition result (or a nomatch) based on the audio that it has collected to date.
The abort method: The abort method is a request to immediately stop listening and stop recognizing and do not return any information but that the system is done. When the stop method is called the speech service MUST stop recognizing. The user agent MUST raise a end event once the speech service is no longer connected.
The interpret method: The interpret method provides a mechanism to request recognition using text, rather than audio. The text parameter is the string of text to recognize against. When bypassing audio recognition a number of the normal parameters MAY be ignored and the sound and audio events SHOULD NOT be generated. Other normal SpeechInputRequest events SHOULD be generated.

5.3. Speech Input Request Events

The DOM Level 2 Event Model is used for speech recognition events. The methods in the EventTarget interface should be used for registering event listeners. The SpeechInputRequest interface also contains convenience attributes for registering a single event handler for each event type.

For all these events, the timeStamp attribute defined in the DOM Level 2 Event interface must be set to the best possible estimate of when the real-world event which the event object represents occurred.

Unless specified below, the ordering of the different events is undefined. For example, some implementations may fire audioend before speechstart or speechend if the audio detector is client-side and the speech detector is server-side.

audiostart event: Fired when the user agent has started to capture audio.
soundstart event: Some sound, possibly speech, has been detected. This MUST be fired with low latency, e.g. by using a client-side energy detector.
speechstart event: The speech that will be used for speech recognition has started.
speechend event: The speech that will be used for speech recognition has ended. speechstart MUST always have been fire before speechend.
soundend event: Some sound is no longer detected. This MUST be fired with low latency, e.g. by using a client-side energy detector. soundstart MUST always have been fired before soundend.
audioend event: Fired when the user agent has finished capturing audio. audiostart MUST always have been fired before audioend.
result event: Fired when the speech recognizer returns a result. See here for more information.
nomatch event: Fired when the speech recognizer returns a final result with no recognition hypothesis that meet or exceed the confidence threshold. The result field in the event MAY contain speech recognition results that are below the confidence threshold or MAY be null.
error event: Fired when a speech recognition error occurs. The error attribute MUST be set to a SpeechInputError object.
authorizationchange event: Fired whenever the state variable tracking if the web application is authorized to listen to the user and do speech recognition changes its value.
open event: Fired whenever the SpeechInputRequest has successfully connected to the speech service and the various parameters of the request can be satisfied with the service.
start event: Fired when the recognition service has begun to listen to the audio with the intention of recognizing.
end event: Fired when the service has disconnected. The event MUST always be generated when the session ends no matter the reason for the end.

5.4. Speech Input Error

The speech input error object has two attributes code and message.

code

The code is a numeric error code for has gone wrong. The values are:

SPEECH_INPUT_ERR_OTHER (numeric code 0): This is the catch all error code.
SPEECH_INPUT_ERR_NO_SPEECH (numeric code 1): No speech was detected.
SPEECH_INPUT_ERR_ABORTED (numeric code 2): Speech input was aborted somehow, maybe by some UA-specific behavior such as UI that lets the user cancel speech input.
SPEECH_INPUT_ERR_AUDIO_CAPTURE (numeric code 3): Audio capture failed.
SPEECH_INPUT_ERR_NETWORK (numeric code 4): Some network communication that was required to complete the recognition failed.
SPEECH_INPUT_ERR_NOT_ALLOWED (numeric code 5): The user agent is not allowing any speech input to occur for reasons of security, privacy or user preference.
SPEECH_INPUT_ERR_SERVICE_NOT_ALLOWED (numeric code 6): The user agent is not allowing the web application requested speech service, but would allow some speech service, to be used either because the user agent doesn't support the selected one or because of reasons of security, privacy or user preference.
SPEECH_INPUT_ERR_BAD_GRAMMAR (numeric code 7): There was an error in the speech recognition grammar.
SPEECH_INPUT_ERR_LANGUAGE_NOT_SUPPORTED (numeric code 8): The language was not supported.

message

The message content is implementation specific. This attribute is primarily intended for debugging and developers should not use it directly in their application user interface.

5.5. Speech Input Alternative

The SpeechInputAlternative represents a simple view of the response that gets used in a n-best list.

utterance: The utterance string represents the raw words that the user spoke.
confidence: The confidence represents a numeric estimate between 0 and 1 of how confident the recognition system is that the recognition is correct. A higher number means the system is more confident.
interpretation: The interpretation represnts the semantic meaning from what the user said. This might be determined, for instance, through the SISR specification of semantics in a grammar.

5.6. Speech Input Result

The SpeechInputResult object represents a single one-shot recognition match, either as one small part of a continuous recognition or as the complete return result of a non-continuous recognition.

resultEMMAXML: The resultEMMAXML is a Document that contains the complete EMMA document the recognition service returned from the recognition. The Document has all the normal XML DOM processing to inspect the content.
resultEMMAText: The resultEMMAText is a text representation of the resultEMMAXML.
length: The long attribute represents how many n-best alterantives are represented in the item array. The user agent MUST not return more SpeechInputAlternatives than the value of the maxNBest attribute on the recognition request.
item
: The item getter returns a SpeechInputAlternative from the index into an array of n-best values. The user agent MUST ensure that there are not more elements in the array then the maxNBest attribute was set. The user agent MUST ensure that the length attribute is set to the number of elements in the array. The user agent MUST ensure that the n-best list is sorted in non-increasing confidence order (each element must be less than or equal to the confidence of the preceeding elements).

5.7. Speech Input Result Event

The Speech Input Result event is the event that is raised each time there is an interim or final result. The event contains both the current most recent recognized bit (in the result object) as well as a history of the complete recognition session so far (in the results object).

result: The result element is the one single SpeechInputResult that is new as of this request.
resultIndex: The resultIndex MUST be set to the place in the results array that this particular new result goes. The resultIndex MAY refer to a previous occupied array index from a previous SpeechInputResultEvent. When this is the case this new result overwrites the earlier result and is a more accurate result; however, when this is the case the previous value MUST NOT have been a final result. When continuous was false, the resultIndex MUST always be 0.
results: The array of all of the recognition results that have been returned as part of this session. This array MUST be identical to the array that was present when the last SpeechInputResutlEvent was raised, with the exception of the new result value.
sessionId: The sessionId is a unique identifier of this SpeechInputRequest object that identifies the session. This id MAY be used to correlate logging and also as part of rerecognition.
final: The final boolean MUST be set to true if this is the final time the speech service will return this particular indx value. If the value is false, then this represents an interim result that could still be changed.

5.8. Speech Grammar Interface

The SpeechGrammar object represents a container for a grammar. This structure has the following attributes:

src attribute: The required src attribute is the URI for the grammar. Note some services may support builtin grammars that can be specified using a builtin URI scheme.
weight attribute: The optional weight attribute controls the weight that the speech recognition service should use with this grammar. By default, a grammar has a weight of 1. Larger weight values positively weight the grammar while smaller weight values make the grammar weighted less strongly.
modal attribute: The optional modal parameter determins if all other active grammars should be disabled. If the modal parameter is true then all other grammars should be disabled. The default value is false.

5.9. Speech Parameter Interface

The SpeechParameter object represents the container for arbitrary name/value parameters. This extensible mechanism allows developers to take advantage of extensions that recognition services may allow.

name attribute: The required name attribute is the name of the custom parameter.
value attribute: The required value attribute is the value of the custom parameter.

6. Extension Events

Some speech services may want to raise custom extension interim events either while doing speech recognition or while synthesizing audio. An example of this kind of event might be viseme events that encode lip and mouth positions while speech is being synthesized that can help with the creation of avatars and animation. These extension events MUST begin with "speech-x", so the hypothetical viseme event might be something like "speech-x-viseme".

7. Design Decisions

Here are the design decisions from the XG that are relevant to the Web API proposal:

DD9. It must be possible to reference ASR grammars by URI.
DD10. It must be possible to select the ASR language using language tags.
DD11. It must be possible to leave the ASR grammar unspecified. Behavior in this case is not yet defined.
DD21. A standard set of common-task grammars must be supported. The details of what those are is TBD.
DD28. A low-latency endpoint detector must be available. It should be possible for a web app to enable and disable it, although the default setting (enabled/disabled) is TBD. The detector detects both start of speech and end of speech and fires an event in each case.
DD29. The API will provide control over which portions of the captured audio are sent to the recognizer.
DD33. Support for streaming audio is required -- in particular, that ASR may begin processing before the user has finished speaking.
DD34. It must be possible for the recognizer to return a final result before the user is done speaking.
DD36. Maxresults should be an ASR parameter representing the maximum number of results to return.
DD55. The API will support multiple simultaneous grammars, any combination of allowed grammar formats. It will also support a weight on each grammar.
DD72. In Javascript, speech reco requests should have an attribute for a sequence of grammars, each of which can have properties, including weight (and possibly language, but that is TBD).
DD 73. In Javascript will be able to set parameters as dot properties and also via a getParameters method. Browser should also allow service-specific parameters to be set this way.in
DD76. It must be possible to do one or more re-recognitions with any request that you have indicated before first use that it can be re-recognized later. This will be indicated in the API by setting a parameter to indicate re-recognition. Any parameter can be changed, including the speech service.

To do

Need other design decisions for the Face-to-face

8. Requirements and Use Cases

This section covers what some of the requirements were for this API, as well as illustrates some use cases. Note more extensive information can be found at HTML Speech XG Use Cases and Requirements as well as in the final XG note including requirements and use cases.

Voice Web Search. A user can speak a query and get a result.
Speech Comand Interface. A Speech Command and Control Shell that allows multiple comands, many of which take arguments, such as "call [number]", "call [person]", "calculate [math expression]", "play [song]", or "search for [query]".
Speech UI present when no visible UI need be present. Some speech applications are oriented around determining the user's intent before gathering any specific input, and hence their first interaction may have no visible input fields whatsoever, or may accept speech input that is far less constrained than the fields on the screen. For example, the user may simply be presented with the text "how may I help you?" (maybe with some speech synthesis or an earcon), and then utter their request, which the application analyzes in order to route the user to an appropriate part of the application.
A Speech Enabled Email Client. The application reads out subjects and contents of email and also listens for commands, for instance, "archive", "reply: ok, let's meet at 2 pm", "forward to bob", "read message". when an email message is received, a summary notification may be raised that displays a small amount of content (for instance the person the email is from and a couple of words of the subject). It is desirable that a speech API be present and listening for the duration of this notification, allowing a user experience of being able to say "Reply to that" or "Read that email message". Note that this recognition UI could not be contingent on the user clicking a button, as that would defeat much of the benefit of this scenario (being able to reply and control the email without using the keyboard or mouse).

9. Acknowledgements

This proposal was developed by the HTML Speech XG.

This work builds on the existing work including:

Special thanks to the members of the XG: Andrei Popescu, Andy Mauro, Björn Bringert, Chaitanya Gharpure, Charles Chen, Dan Druta, Daniel Burnett, Dave Burke, David Bolter, Deborah Dahl, Fabio Paternò, Glen Shires, Ingmar Kliche, Jerry Carter, Jim Larson, Kazuyuki Ashimura, Marc Schröder, Markus Gylling, Masahiro Araki, Matt Womer, Michael Bodell, Michael Johnston, Milan Young, Olli Pettay, Paolo Baggia, Patrick Ehlen, Raj Tumuluri, Rania Elnaggar, Ravi Reddy, Robert Brown, Satish Kumar Sampath, Somnath Chandra, and T.V. Raman.

10. References

RFC2119: Key words for use in RFCs to Indicate Requirement Levels, S. Bradner. IETF.
HTML5: HTML 5: A vocabulary and associated APIs for HTML and XHTML (work in progress), I. Hickson. W3C.

HTML Speech Web API

Non-standards track internal editor's draft of webapi proposal, 27 October 2011

Abstract

Status of this Document

Table of Contents

1. Introduction

2. Conformance

3. Reco Element

3.1. Reco Attributes

3.2. Reco Constructors

3.3. Builtin Default Grammars

3.4. Default Binding of Results

4. TTS Element

4.1. Attributes

4.2. Constructors

5. The Speech Input Request Interface

5.1. Speech Input Request Attributes

5.2. Speech Input Request Methods

5.3. Speech Input Request Events

5.4. Speech Input Error

5.5. Speech Input Alternative

5.6. Speech Input Result

5.7. Speech Input Result Event

5.8. Speech Grammar Interface

5.9. Speech Parameter Interface

6. Extension Events

7. Design Decisions

8. Requirements and Use Cases

9. Acknowledgements

10. References