HTML Speech Web API

1. Introduction

Web applications should have the ability to use speech to interact with users. That speech could be for output through synthesized speech, or could be for input through the user speaking to fill form items, the user speaking to control page navagation or many other collected use cases. A web application author should be able to add speech to a web application using methods familiar to web developers and should not require extensive specialized speech expertise. The web application should build on existing W3C web standards and support a wide variety of use cases. The web application author should have the flexibility to control the recognition service the web application uses, but should not have the obligation of needing to support a service. This proposal defines the basic representations for how to use grammars, parameters, and recognition results and how to process them. The interfaces and API defined in this proposal can be used with other interfaces and APIs exposed to the web platform.

Note that privacy and security concerns exist around allowing web applications to do speech recognition. User agents should make sure that end users are aware that speech recognition is occuring, and that the end users have given informed consent for this to occur. The exact mechanism of consent is user agent specific, but the privacy and security concerns have shaped many aspects of the proposal.

Example

In the example below the speech API is used to do basic speech web search.

Speech Web Search

To do

insert examples

2. Conformance

Everything in this proposal is informative since this is not a standards track document. However, RFC2119 normative language is uesd where appropriate to aid in the future should this proposal be moved into a standards track process.

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "RECOMMENDED", "MAY" and "OPTIONAL" in this document are to be interpreted as described in Key words for use in RFCs to Indicate Requirement Levels [RFC2119].

3. The Speech Service Interface

The Speech Service interface represents the API to query and bind to the underlying speech service.

Reference

This section is based on Dan Druta's proposal and the meeting discussion.

IDL


    interface SpeechService {
        // attributes
        readonly attribute unsigned short serviceState;
        attribute unsigned short serviceType;
        attribute DOMString serviceURI;
        attribute DOMString serviceName;

        // states
        const unsigned short INITIALIZING = 0;
        const unsigned short AVAILABLE = 1;
        const unsigned short UNAVAILABLE = 2;

        // types
        const unsigned short TTS = 0;
        const unsigned short ASR = 1;
        const unsigned short TTSASR = 2;

        // methods for attaching to service
        void bind();
        void unbind();
    };

Web applictions get SpeechServices by using the SpeechServiceQuery interface.

3.1. Attributes

serviceState

Represents the state of the current service.

The serviceState can be in one of 3 states. The serviceState attribute, on getting, MUST return the current state, which MUST be one of the following values:

INITIALIZING (numeric value 0): The service is in the process of binding.
AVAILABLE (numeric value 1): The service is connected and available for operation.
UNAVAILABLE (numeric value 2): The service is not connected. This should be the default starting value in advance of a bind.

serviceType

Represents what sort of speech service capabilities are desired and supported.

The serviceType can be one of 3 types. The serviceType attribute, on getting, MUST return the type that the service supports, which must be one of the following values:

TTS (numeric value 0): The service need only support speech synthesis.
ASR (numeric value 1): The service need only support speech recognition.
TTSASR (numeric value2): The service needs to support both speech synthesis and speech recognition.

serviceURI

Represents the URI of the speech engine.

serviceName

Represents the name of the service as it can be presented to the user.

3.2. Methods

The bind method: Method that connects to the speech service. When first started the serviceState MUST be set to INITIALIZING. After connected successfully the serviceState will be set to AVAILABLE.
Question:
Should there not be an event raised when things are successful? Presumably there are also events raised on errors.
The unbind method: Method that disconnects from the speech service. When successful the serviceState MUST be set to UNAVAILABLE.
Question:
Do we even need this method? Can't the disconnect be done by deleting the speechservice object? Or is the idea that if we wanted to reuse the object (perhaps because it is a complex one that just implements this interface) and we wanted to change services we'd first call unbind, then change the serviceURI, then call connect to the different service?

Question:

How does this class play with the other reco element or reco JS object that we have discussed. Usually there we've said the service is just specified by URI. Does that URI specified implicitly instantiate a speechService? Should that object implement the speechService API? If so when does binding and unbinding happen? Can it be implicit?

4. Speech Service Query Interface

The Speech Service Query Interface provides the developer with a runtime capability to query the service, obtain specific information about the features it implements and allow the developer to implement a deterministic and satisfying user experience. This lets a web application find out if speech recognition is supported before displaying a microphone button to do recognition. It also allows the web application author to query about specific pieces of information (like is a certain language supported, or is a certain service supported). This is used in the same way to ensure the UI doesn't suggest a mode of input that is going to fail.

Reference

This section is based on Dan Druta's proposal and the meeting discussion.

IDL


    interface SpeechServiceQuery {
        // methods
        SpeechService query(optional Criteria filter,
                            optional QueryOptions options);
    }

    interface Criteria {
        Question:
 Nothing yet specified for how these filters or criteria are specified.
    }

    interface QueryOptions {
        // attributes
        attribute unsigned short timeout; // 0 means forever
        Question: not sure what other options we need, if it was just timeout don't
        have a class for it.
    }

Window implements SpeechServiceQuery;

4.1. Attributes

timeout: The timeout attribute of QueryOptions defines how long the query should take before timing out. A value of 0 means no timeout.
Question:
What is the units? milliseconds? Seconds?

4.2. Methods

The query method: This is the method that may be called to see if a speech service satisfies a set of filter criteria. If it does it returns the speech service.
Questions:
What happens if a service doesn't meet the criteria? Can you pass a URI of the service in the criteria? Somewhere here we need the bit about how there is a possible security issue if you ask about the supported languages, and how we wanted maybe to be an answer from the query. And again, how does this play with the reco element? Can you query the reco element, or do you query the window and then assume the reco element is the same (or set the reco element serviceURI to the one you get from the window?).

5. Reco Element

The reco element is the way to do speech recognition using markup bindings. The reco element is legal where ever phrasing content is expected, and can contain any phrasing content, except with no descendant recoable elements unless it is the element's reco comtrol, and no descendant reco elements.

Reference

This section is based on Michael Bodell's proposal and the meeting discussion.

IDL


  [NamedConstructor=Reco(),
  NamedConstructor=Reco(in DOMString for)]
    interface HTMLRecoElement : HTMLElement {
        // Attributes
        readonly attribute HTMLFormElement? form;
        attribute DOMString htmlFor;
        readonly attribute HTMLElement? control;
    };

HTMLRecoElement implements SpeechInputRequest;

The reco represents a speech input in a user interface. The speech input can be associated with a specific form control, known as the reco element's reco control, either using for attribute, or by putting the form control inside the reco element itself.

Except where otherwise specified by the following rules, a reco element has no reco control.

Some elements are catagorized as recoable elements. These are elements that can be associated with a reco element.

The reco element's exact default presentation and behavior, in particular what its activation behavior might be and what implicit grammars might be defined, if any, is unspecified and user agent specific. The activation behavior of a reco element for events targetted at interactive content descendants of a reco element, and any descendants of those interactive content descendants, MUST be to do nothing. When a reco element with a reco control is activated and gets a reco result, the default action of the recognition event SHOULD be to set the value of the reco control to the top n-best interpretation of the recognition (in the case of single recognition) or an appended latest top n-best interpretation (in the case of dictation mode with multiple inputs).

Warning:

Not all implementors see value in linking the recognition behavior to markup, versus an all scripting API. Some user agents like the possiblity of good defaults based on the associations. Some user agents like the idea of different consent bars based on the user clicking a markup button, rather then just relying on scripting. User agents are cautioned to remember click-jacking and SHOULD NOT automatically assume that when a reco element is activated it means the user meant to start recognition in all situations.

Question:

Does that above note sufficiently capture the concerns Olli or others had about the markup binding?

5.1. Attributes

form

The form attribute is used to explicitly associate the reco element with its form owner.

The form IDL attribute is part of the element's forms API.

htmlFor

The htmlFor IDL attribute MUST reflect the for content attribute.

The for attribute MAY be specified to indicate a form control with which a speech input is to be associated. If the attribute is specified, the attribute's value MUST be the ID of a recoable element in the same Document as the reco element. If the attribute is specified and there is an element in the Document whose ID is equal to the value of the for attribute, and the first such element is a recoable element, then that element is the reco element's reco control.

If the for attribute is not specified, but the reco element has a recoable element descendant, then the first such descendant in tree order is the reco element's reco control.

control

The control attribute returns the form control that is associated with this element. The control IDL attribute MUST return the reco element's reco control, if any, or null if there isn't one.

control . recos returns a NodeList of all the reco elements that the form control is associated with.

Recoable elements have a NodeList object associated with them that represents the list of reco elements, in tree order, whose reco control is the element in question. The reco IDL attribute of recoable elements, on getting, MUST return that NodeList object.

5.2. Constructors

Two constructors are provided for creating HTMLRecoElement objects (in addition to the factory methods from DOM Core such as createElement()): Reco() and Reco(for). When invoked as constructors, these MUST return a new HTMLRecoElement object (a new reco element). If the for argument is present, the object created MUST have its for content attribute set to the provided value. The element's document MUST be the active document of the browsing context of the Window object on which the interface object of the invoked constructor is found.

Questions:

Is the HTMLRecoElement implements SpeechInputRequest enough to hook this section to the events and paramaters section that Bjorn and Debbie defined?

6. TTS Element

The TTS element is the way to do speech synthesis using markup bindings. The TTS element is legal where embedded content is expected. If the TTS element has a src attibute, then its content model is zero or more track elements, then transparent, but with no media element descendants. If the element does not have a src attibute, then one or more source elements, then zero or more track elements, then transparent, but with no media element descendants.

Reference

This section is based on Michael Bodell's proposal and the meeting discussion.

IDL


  [NamedConstructor=TTS(),
  NamedConstructor=TTS(in DOMString src)]
    interface HTMLTTSElement : HTMLMediaElement {};

HTMLTTSElement implements SpeechOutputRequest;

A TTS element represents a synthesized audio stream. A TTS element is a media element whose media data is ostensibly synthesized audio data.

When a TTS element is potentially playing, it must have its TTS data played synchronized with the current playback position, at the element's effective media volume.

When a TTS element is not potentially playing, TTS must not play for the element.

Content MAY be provided inside the TTS element. User agents SHOULD NOT show this content to the user; it is intended for older Web browsers which do not support TTS.

In particular, this content is not intended to address accessibility concerns. To make TTS content accessible to those with physical or cognitive disabilities, authors are expected to provide alternative media streams and/or to embed accessibility aids (such as transcriptions) into their media streams.

6.1. Attributes

The src, preload, autoplay, mediagroup, loop, muted, and controls attributes are the attributes common to all media elements.

6.2. Constructors

Two constructors are provided for creating HTMLTTSElement objects (in addition to the factory methods from DOM Core such as createElement()): TTS() and TTS(src). When invoked as constructors, these MUST return a new HTMLTTSElement object (a new tts element). The element MUST have its preload attribute set to the literal value "auto". If the src argument is present, the object created MUST have its src content attribute set to the provided value, and the user agent MUST invoke the object's resource selection algorithm before returning. The element's document MUST be the active document of the browsing context of the Window object on which the interface object of the invoked constructor is found.

Question:

Is the connetion to Charles's details with implements SpeechInputRequest sufficient? Will that object specify all the bits we've talked about at the F2F with respect to marks, and timing information, as well as bargein for a service that supports both TTS and reco?

7. The Speech Input Request Interface

The speech input request interface is the scripted web API for controlling a given recognition.

Reference

This section is based on Debbie Dahl's proposal, Bjorn Bringert's proposal, and Olli Pettay's proposal.

IDL


    interface SpeechInputRequest {
        // recognition property methods
        // grammar methods
        void resetGrammars();
        void addGrammar(in DOMString src,
                        optional float weight,
                        optional boolean modal);
        void addGrammarName(in DOMString name,
                        optional float weight,
                        optional boolean modal);
        void disableGrammar(in DOMString src);

        // misc parameter methods
        void setmaxnbest(in integer maxnbest);
        void setlanguage(in DOMString language);
        void setsaveforrereco(in boolean saveforrereco);
        void setendpointdetection(in boolean endpointdetection);
        void setfinalizebeforeend(in boolean finalizebeforeend);
        void setinterimresults(in boolean interimresults);
        void setinterimresultsfreq(in integer interimresultsfreq);
        void setconfidencethreshold(in float confidencethreshold);
        void setsensitivity(in float sensitivity);
        void setspeedversusaccuracy(in float speedvsaccuracy);
        void setcompletetimeout(in integer completetimeout);
        void setincompletetimeout(in integer incompletetimeout);
        void setmaxspeechtimeout(in integer maxspeechtimeout);

        // the generic set parameter
        void setparameter(in DOMString name, in DOMString value);

        // waveform methods
        void setsavewaveformURI(in DOMString savewaveformURI);
        void setinputwaveformURI(in DOMString inputwaveformURI);

        Question:The properties proposal from Debbie had all of these as methods
        (modulo some renaming I did), is this all we want?  I feel like usually the attribute representation of this data would be
        reflected in the API, and someone could set the attributes directly, or using these helper functions.  I.e., if there
        is a maxresults attribute it could be set directly or through the call of the setmaxresults method.  But right now we
        just have the methods.

        // attributes
        attribute MediaStream input;

        // event methods
        attribute Function onaudiostart;
        attribute Function onsoundstart;
        attribute Function onspeechstart;
        attribute Function onspeechend;
        attribute Function onsoundend;
        attribute Function onaudioend;
        attribute Function onresult;
        attribute Function onnomatch;
        attribute Function onerror;
    };
    SpeechInputRequest implements EventTargt;

    interface SpeechInputResultEvent : Event {
        readonly attribute SpeechInputResult result;
    };

    interface SpeechInputNomatchEvent : Event {
        readonly attribute SpeechInputResult result;
    };

    interface SpeechInputErrorEvent : Event {
        readonly attribute SpeechInputError error;
    };

    interface SpeechInputError {
        const unsigned short SPEECH_INPUT_ERR_OTHER = 0;
        const unsigned short SPEECH_INPUT_ERR_NO_SPEECH = 1;
        const unsigned short SPEECH_INPUT_ERR_ABORTED = 2;
        const unsigned short SPEECH_INPUT_ERR_AUDIO_CAPTURE = 3;
        const unsigned short SPEECH_INPUT_ERR_NETWORK = 4;
        const unsigned short SPEECH_INPUT_ERR_NOT_ALLOWED = 5;
        const unsigned short SPEECH_INPUT_ERR_SERVICE_NOT_ALLOWED = 6;
        const unsigned short SPEECH_INPUT_ERR_BAD_GRAMMAR = 7;
        const unsigned short SPEECH_INPUT_ERR_LANGUAGE_NOT_SUPPORTED = 8;

        readonly attribute unsigned short code;
        readonly attribute DOMString message;
    };

    interface SpeechInputResult {
        To doNeed to fill in this set of inputs, including how
        interim results work.
    };

    Question:Should there be some sort of endpointing event
    that describes the timing information for the onsoundstart type events?  Give some sort of offset or other
    information? We say that the DOM 2 timestamp can be used, but does that capture all the bit we talked
    about how these events should be in the user agent's clock but representing the audio stream position - 
    not the wall clock time?

7.1. Speech Input Request Recognition Property Methods

The resetGrammars method: This means remove all explicitly set grammars and just "use the default language model" of the implementation.
The addGrammar method: This method adds a grammar to the set of active grammars. The URI for the grammar is specified by the src parameter, which represents the URI for the grammar. If the weight parameter is present it represents this grammar's weight relative to the other grammar. If the weight parameter is not present, the default value of 1.0 is used. If the modal parameter is set to true, then all other already active grammars are disabled. If the modal parameter is not present, the default value is false.
The addGrammarName method: This method adds a grammar to the set of active grammars. The builtin string for the grammar is specified by the name parameter, which represents the name for the grammar. If the weight parameter is present it represents this grammar's weight relative to the other grammar. If the weight parameter is not present, the default value of 1.0 is used. If the modal parameter is set to true, then all other already active grammars are disabled. If the modal parameter is not present, the default value is false.
The disableGrammar method: This method disables a grammar with the URI matching the src parameter.
The setmaxnbest method: This method will set the maximum number of recognition results that should be returned to the value of the maxnbest parameter. The default value is 1.
The setlangugage method: This method will set the language of the recognition to the language paramter, using the ISO language codes.
Default?
The setsaveforrereco method: This method will save the utterance for later use in a rerecognition depending on the value of the saveforrereco boolean value (true means save). The default value is false.
The setendpointdetection method: This method will determin if the user agent should do a low latency endpoint detection depending on the value of the endpointdetection parameter (true means do endpointing).
Default?
The setfinalizebeforeend method: This method sets if final results can be returned before the user is done talking based on the value of the finalizebeforeend paramter (true means yes).
Default?
The setinterimresults method: This method sets if interim results should be sent or if the recognizer should wait for final results only based on the value of the interimresults paramter (true means send interim results).
Default?
The setinterimresultsfreq method: This method sets the frequency with which interim results are desired from the recognition service. The recognition service may not be able to meet exactly this frequency as it depends on the details of the grammars and utterances that are being used how often an interim result is likely to occur or change. The value of the interimresultsfreq is the number of milliseconds desired between successive interim results.
The setconfidencethreshold method: This method sets the confidence threshold to the value of the paramter confidencethreshold which represents some value between 0.0 (least confidence needed) and 1.0 (most confidence) with 0.5 as the default.
The setsensitivity method: This method sets the sensitivity to the value of the paramter sensitivity which represents some value between 0.0 (least sensitive) and 1.0 (most sensitivity) with 0.5 as the default.
The setspeedvsaccuracy method: This method sets the trade off desired between low latency and high speed to the value of the paramter speedvsaccuracy which represents some value between 0.0 (least accurate) and 1.0 (most accurate) with 0.5 as the default.
The setcompletetimeout method: This method sets the completetimeout to the value of the completetimeout parameter. This represents the amount of silence needed to match a grammar when a hypothesis is at a complete match of the grammar (that is the hypothesis matches a grammar, and no larger input can possibly match a grammar).
The setincompletetimeout method: This method sets the incompletetimeout to the value of the incompletetimeout parameter. This represents the amount of silence needed to match a grammar when a hypothesis is not at a complete match of the grammar (that is the hypothesis does not match a grammar, or it does match a grammar but so could a larger input).
The setmaxspeechtimeout method: This method sets the maxspeechtimeout to the value of the maxspeechtimeout parameter. This represents how much speech we should have before an end of speech or an error.
The setparameter method: This method allows arbitrary recognition service paramters to be set. The name of the parameter is given by the name parameter and the value by the value parameter.
The setsavewaveformURI method: The setsavewaveformURI method specifies where the web application would like the utterance to be stored, if it is to be stored. The value of the parameter savewaveformURI is this URI.
The setinputwaveformURI method: The method says to get the input waveform URI from the URI specified in the inputwaveformURI parameter.
When do we start recognizing in this case? There is no reco() function, as we assume the capture does that.

7.2. Speech Input Request Attributes

input: The input attibute is the MediaStream that we are recognizing against. If input is not set, the Speech Input Request uses the default UA provided capture (which MAY be nothing), in which case the value of input will be null.

7.3. Speech Input Request Events

The DOM Level 2 Event Model is used for speech recognition events. The methods in the EventTarget interface should be used for registering event listeners. The SpeechInputRequest interface also contains convenience attributes for registering a single event handler for each event type.

For all these events, the timeStamp attribute defined in the DOM Level 2 Event interface must be set to the best possible estimate of when the real-world event which the event object represents occurred.

Unless specified below, the ordering of the different events is undefined. For example, some implementations may fire audioend before speechstart or speechend if the audio detector is client-side and the speech detector is server-side.

audiostart event: Fired when the user agent has started to capture audio.
soundstart event: Some sound, possibly speech, has been detected. This MUST be fired with low latency, e.g. by using a client-side energy detector.
speechstart event: The speech that will be used for speech recognition has started.
speechend event: The speech that will be used for speech recognition has ended. speechstart MUST always have been fire before speechend.
soundend event: Some sound is no longer detected. This MUST be fired with low latency, e.g. by using a client-side energy detector. soundstart MUST always have been fired before soundend.
audioend event: Fired when the user agent has finished capturing audio. audiostart MUST always have been fired before audioend.
result event: Fired when the speech recognizer returns a final result with at least one recognition hypothesis that meets or exceeds the confidence threshold. The result field in the event MUST contain the speech recognition result. All the following events MUST have been fired before result is fired: audiostart, soundstart, speechstart, speechend, soundend, audioend.
Is it really true all events need be received? What about if this is an interim result? how are they returned if not through this event?
nomatch event: Fired when the speech recognizer returns a final result with no recognition hypothesis that meet or exceed the confidence threshold. The result field in the event MAY contain speech recognition results that are below the confidence threshold or MAY be null.
onerror event: Fired when a speech recognition error occurs. The error attribute MUST be set to a SpeechInputError object.

7.4. Speech Input Error

The speech input error object has two attributes code and message.

code

The code is a numeric error code for has gone wrong. The values are:

SPEECH_INPUT_ERR_OTHER (numeric code 0): This is the catch all error code.
SPEECH_INPUT_ERR_NO_SPEECH (numeric code 1): No speech was detected.
SPEECH_INPUT_ERR_ABORTED (numeric code 2): Speech input was aborted somehow, maybe by some UA-specific behavior such as UI that lets the user cancel speech input.
SPEECH_INPUT_ERR_AUDIO_CAPTURE (numeric code 3): Audio capture failed.
SPEECH_INPUT_ERR_NETWORK (numeric code 4): Some network communication that was required to complete the recognition failed.
SPEECH_INPUT_ERR_NOT_ALLOWED (numeric code 5): The user agent is not allowing any speech input to occur for reasons of security, privacy or user preference.
SPEECH_INPUT_ERR_SERVICE_NOT_ALLOWED (numeric code 6): The user agent is not allowing the web application requested speech service, but would allow some speech service, to be used either because the user agent doesn't support the selected one or because of reasons of security, privacy or user preference.
SPEECH_INPUT_ERR_BAD_GRAMMAR (numeric code 7): There was an error in the speech recognition grammar.
SPEECH_INPUT_ERR_LANGUAGE_NOT_SUPPORTED (numeric code 8): The language was not supported.

message

The message content is implementation specific. This attribute is primarily intended for debugging and developers should not use it directly in their application user interface.

7.5. Speech Input Result

To do

Need to fill this in

8. The Speech Output Request Interface

The speech output request interface is where we hang all the TTS specific information (similar to the Speech Input Request Interface, but for synthesis).

Reference

This section is based on Charles Hemphill's proposal.

Questions:

Is there a more complete version of this? I wasn't able to easily incorporate Charles's proposal. I think the only thing I could see was the event handlers onspeechstart, onspeechend, onerror, and some descussion about a handler for the recognition result (does that have to do with bargein?).

Questions:

Are there any other major content sections missing? What about anything that the protocol requries of us? Do the above sections fit together with each other and with the protocol work?

9. Design Decisions

Here are the design decisions from the XG that are relevant to the Web API proposal:

DD9. It must be possible to reference ASR grammars by URI.
DD10. It must be possible to select the ASR language using language tags.
DD11. It must be possible to leave the ASR grammar unspecified. Behavior in this case is not yet defined.
DD21. A standard set of common-task grammars must be supported. The details of what those are is TBD.
DD28. A low-latency endpoint detector must be available. It should be possible for a web app to enable and disable it, although the default setting (enabled/disabled) is TBD. The detector detects both start of speech and end of speech and fires an event in each case.
DD29. The API will provide control over which portions of the captured audio are sent to the recognizer.
DD33. Support for streaming audio is required -- in particular, that ASR may begin processing before the user has finished speaking.
DD34. It must be possible for the recognizer to return a final result before the user is done speaking.
DD36. Maxresults should be an ASR parameter representing the maximum number of results to return.
DD55. The API will support multiple simultaneous grammars, any combination of allowed grammar formats. It will also support a weight on each grammar.
DD72. In Javascript, speech reco requests should have an attribute for a sequence of grammars, each of which can have properties, including weight (and possibly language, but that is TBD).
DD 73. In Javascript will be able to set parameters as dot properties and also via a getParameters method. Browser should also allow service-specific parameters to be set this way.in
DD76. It must be possible to do one or more re-recognitions with any request that you have indicated before first use that it can be re-recognized later. This will be indicated in the API by setting a parameter to indicate re-recognition. Any parameter can be changed, including the speech service.

To do

insert other design decisions as we receive them and review them

10. Requirements and Use Cases

This section covers what some of the requirements were for this API, as well as illustrates some use cases. Note more extensive information can be found at HTML Speech XG Use Cases and Requirements as well as in the final XG note including requirements and use cases.

Voice Web Search. A user can speak a query and get a result.
Speech Comand Interface. A Speech Command and Control Shell that allows multiple comands, many of which take arguments, such as "call [number]", "call [person]", "calculate [math expression]", "play [song]", or "search for [query]".
Speech UI present when no visible UI need be present. Some speech applications are oriented around determining the user's intent before gathering any specific input, and hence their first interaction may have no visible input fields whatsoever, or may accept speech input that is far less constrained than the fields on the screen. For example, the user may simply be presented with the text "how may I help you?" (maybe with some speech synthesis or an earcon), and then utter their request, which the application analyzes in order to route the user to an appropriate part of the application.
A Speech Enabled Email Client. The application reads out subjects and contents of email and also listens for commands, for instance, "archive", "reply: ok, let's meet at 2 pm", "forward to bob", "read message". when an email message is received, a summary notification may be raised that displays a small amount of content (for instance the person the email is from and a couple of words of the subject). It is desirable that a speech API be present and listening for the duration of this notification, allowing a user experience of being able to say "Reply to that" or "Read that email message". Note that this recognition UI could not be contingent on the user clicking a button, as that would defeat much of the benefit of this scenario (being able to reply and control the email without using the keyboard or mouse).

11. Acknowledgements

This proposal was developed by the HTML Speech XG.

This work builds on the existing work including:

Special thanks to the members of the XG: Andrei Popescu, Andy Mauro, Björn Bringert, Chaitanya Gharpure, Charles Chen, Dan Druta, Daniel Burnett, Dave Burke, David Bolter, Deborah Dahl, Fabio Paternò, Ingmar Kliche, Jerry Carter, Jim Larson, Kazuyuki Ashimura, Marc Schröder, Markus Gylling, Masahiro Araki, Matt Womer, Michael Bodell, Michael Johnston, Milan Young, Olli Pettay, Paolo Baggia, Patrick Ehlen, Raj Tumuluri, Rania Elnaggar, Ravi Reddy, Robert Brown, Satish Kumar Sampath, Somnath Chandra, and T.V. Raman.

12. References

RFC2119: Key words for use in RFCs to Indicate Requirement Levels, S. Bradner. IETF.
HTML5: HTML 5: A vocabulary and associated APIs for HTML and XHTML (work in progress), I. Hickson. W3C.

HTML Speech Web API

Non-standards track internal editor's draft of webapi proposal, 25 August 2011

Abstract

Status of this Document

Table of Contents

1. Introduction

2. Conformance

3. The Speech Service Interface

3.1. Attributes

3.2. Methods

4. Speech Service Query Interface

4.1. Attributes

4.2. Methods

5. Reco Element

5.1. Attributes

5.2. Constructors

6. TTS Element

6.1. Attributes

6.2. Constructors

7. The Speech Input Request Interface

7.1. Speech Input Request Recognition Property Methods

7.2. Speech Input Request Attributes

7.3. Speech Input Request Events

7.4. Speech Input Error

7.5. Speech Input Result

8. The Speech Output Request Interface

9. Design Decisions

10. Requirements and Use Cases

11. Acknowledgements

12. References