Speech Input API Specification

Editor's Draft 18 October 2010

Latest Editor's Draft:: http://dev.w3.org/...
Editors:: Satish Sampath, Google Inc.; Bjorn Bringert, Google Inc.

Abstract

This specification extends HTML and defines an API that provides speech recognition and input to web pages.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document is an API proposal from Google Inc. to the HTML Speech Incubator Group. If you wish to make comments regarding this document, please send them to public-xg-htmlspeech@w3.org (subscribe, archives).

All feedback is welcome.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

1 Conformance requirements

All diagrams, examples, and notes in this specification are non-normative, as are all sections explicitly marked non-normative. Everything else in this specification is normative.

The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in the normative parts of this document are to be interpreted as described in RFC2119. For readability, these words do not appear in all uppercase letters in this specification. [RFC2119]

Requirements phrased in the imperative as part of algorithms (such as "strip any leading space characters" or "return false and abort these steps") are to be interpreted with the meaning of the key word ("must", "should", "may", etc) used in introducing the algorithm.

Conformance requirements phrased as algorithms or specific steps may be implemented in any manner, so long as the end result is equivalent. (In particular, the algorithms defined in this specification are intended to be easy to follow, and not intended to be performant.)

User agents may impose implementation-specific limits on otherwise unconstrained inputs, e.g. to prevent denial of service attacks, to guard against running out of memory, or to work around platform-specific limitations.

Implementations that use ECMAScript to implement the APIs defined in this specification must implement them in a manner consistent with the ECMAScript Bindings defined in the Web IDL specification, as this specification uses that specification's terminology. [WEBIDL]

2 Introduction

This section is non-normative.

The HTML Speech Input API aims to provide web developers with features that are typically not available when using standard speech recognition software to replace keyboard input in a web browser. The API itself is agnostic of the underlying speech recognition implementation and can support both server based as well as embedded recognizers. The API is designed to enable both one-off speech input and continuous speech input requests. Speech recognition results are provided to the web page as a list of hypotheses along with other relevant information for each hypothesis.

Automatic actions at the end of a spoken input

This API allows web applications to be notified at the completion of successful and failed speech input. For e.g. at the end of a successful speech input session, a web application could perform:

Safe and idempotent actions, e.g. search or web site navigation. The actions have no side effects, and the cost of correcting the action is very low, e.g. by editing the query on the search results page or going back in the browser history.
Undoable actions, e.g. archiving e-mail or editing a document. It is ok to take action immediately, since actions can be easily undone, e.g. by an undo option in the app.
Final, time critical, but not dangerous actions, e.g. game inputs. These actions cannot be undone, but ease or speed of input is more important than correctness.
Final and dangerous actions, e.g. composing and sending a message (e-mail, SMS, IM, etc). Actions cannot be undone, so the user must be able to verify the action before it is taken.

Types 1, 2 & 3 work best when the web page can take action immediately at the end of speech input. This requires some facility other than a purely transparent speech input method that simulates keyboard input, since web pages typically do not want to take actions directly on text 'change' events.

Speech recognition grammars

This API allows web applications to specify grammars that the speech recognizer should use when recognizing the user's speech. Specifying a grammar is useful for apps which have limited vocabulary, for e.g. commands, navigation within page, maps etc. Such applications would not work as well with free-form text input. The existing HTML attribute pattern can be used to restrict the allowed inputs, but regular expressions are less expressive than context-free grammars. Also, SRGS grammars can include semantic annotations.

Application-specific handling of speech recognition hypotheses

This API gives the web application access to more information than just the most likely recognized utterance. Some applications can provide a better user experience when they have access to the list of recognition hypotheses produced by the speech recognizer. For example:

A web search application can accept speech input, and perform a search immediately when the input is recognized. If it has access to the additional recognition hypothesis (aka N-best list), it can display that on the search results page and let the user chose the correct query if the input was misrecognized. For example, Google search might display search results for "recognize speech", and show a link with the text "Did you say 'wreck a nice beach'?".
An application may accept input that can only be validated programmatically (i.e. that can't be expressed with a regular expression or a context free grammar). Examples may include credit card numbers, user names etc. In order to check that a certain combination of digits form a valid credit card number one can apply a simple algorithm to the input. Given a set of recognition hypotheses, the app can eliminate the invalid inputs by applying the algorithm and looking at the result.

Examples

The following code extracts illustrate how to use speech input in various cases:

Web search by voice

    <script type="text/javascript">
      function startSearch(event) {
        event.target.form.submit();
      }
    </script>

    <form action="http://www.google.com/search">
    <input type="search" name="q" speech required onspeechchange="startSearch">
    </form>

Behavior

User clicks speech input element.
Speech input element shows that it's active, starts capturing audio as user speaks.
Endpointer detects end of speech once user stops speaking.
Speech recognizer returns results and the onspeechchange event is dispatched.
startSearch() is called by onspeechchange event handler on speech input element.
Search results are loaded.

Web search by voice, with "Did you say..."

This example uses the second best result. The search results page will display a link with the text "Did you say $second_best?".

    <script type="text/javascript">
      function startSearch(event) {
        if (event.target.results.length > 1) {
          var second = event.target.results[1].utterance;
          document.getElementById("second_best").value = second;
        }
        event.target.form.submit();
      }
    </script>

    <form action="http://www.google.com/search">
    <input type="search" name="q" speech required onspeechchange="startSearch">
    <input type="hidden" name="second_best" id="second_best">
    </form>

Speech translator

    <script type="text/javascript" src="http://www.google.com/jsapi"></script>
    <script type="text/javascript">
      google.load("language", "1");  // Load the translator JS library.

      // These will most likely be set in a UI.
      var fromLang = "en";
      var toLang = "es";

      function handleSpeechInput(event) {
        var text = event.target.value;
        var callback = function(result) {
          if (result.translation)
            speak(result.translation, toLang);
        };
        google.language.translate(text, fromLang, toLang, callback);
      }

      function speak(output, lang) {
        // (Use <audio> or a TTS API to speak output in lang) 
      }
    </script>

    <form>
    <input speech onspeechchange="handleSpeechInput">
    </form>

Behavior

User clicks speech input element and speaks in English.
System recognizes the text in English.
A web service translates the text from English to Spanish.
System synthesizes and speaks the translated text in Spanish.

Card number input with n-best list validation

This example picks the first valid input from the n-best list (results).

    <script type="text/javascript">
      function pickValidCardNumber(event) {
        var results = event.target.results;
        for (var i = 0; i < results.length; i++) {
          if (isValidCardNumber(results[i].interpretation)) {
            event.target.value = results[i].interpretation;
            break;
          }
        }
      }
      function isValidCardNumber(number) {
        // Checks and returns true if the number is valid.
      }
    </script>
    <form>
    <input type="number" name="cardNumber" speech required pattern="[0-9]{16}" onspeechchange="pickValidCardNumber">
    </form>

Turn-by-turn navigation

HTML:

    <script type="text/javascript">
      var directions;

      function handleSpeechInput(event) {
        var results = event.target.results;
        if (results) {
          var dest = results[0].interpretation.destination;
          directions = getDirectionsTo(dest);  // Get directions from database/server.
          speakNextInstruction();
        }
      }

      function speakNextInstruction() {
        var instruction = directions.pop();
        // (Use <audio> tag or a TTS API to speak the instruction)
        // Start a wait/notify mechanism to speak next instruction later.
      }
    </script>

    <form>
    <input speech grammar="grammar-nav-en.grxml" onspeechchange="handleSpeechInput">
    </form>

English SRGS XML Grammar (grammar-nav-en.grxml):

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE grammar PUBLIC "-//W3C//DTD GRAMMAR 1.0//EN"
                      "http://www.w3.org/TR/speech-grammar/grammar.dtd">
    <grammar xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
             xsi:schemaLocation="http://www.w3.org/2001/06/grammar 
                                 http://www.w3.org/TR/speech-grammar/grammar.xsd"
             version="1.0" mode="voice" root="nav_cmd"
             tag-format="semantics/1.0">

    <rule id="nav_cmd" scope="public">
      <example> navigate to 76 Buckingham Palace Road, London </example>
      <example> go to Phantom of the Opera </example>
      <item>
        <ruleref uri="#nav_action" />
        <ruleref uri="builtin:search" />
        <tag>out.action="navigate_to"; out.destination=rules.latest();</tag>
      </item>
    </rule>

    <rule id="nav_action">
      <one-of>
        <item>navigate to</item>
        <item>go to</item>
      </one-of>
    </rule>

    </grammar>

Speech shell

This uses an SRGS grammar to declare the commands that are supported, and use SISR sematics so that the JavaScript code does not have to care about the language-specific representation.
Other similar examples: Speech-controlled E-mail client.

HTML.

    <script type="text/javascript">
      function doCommand(event) {
        var command = event.target.value;
        if (command.action == "call_contact") {
          var number = getContactNumber(command.contact);
          callNumber(number);
        } else if (command.action == "call_number") {
          callNumber(command.number);
        } else if (command.action == "calculate") {
          say(evaluate(command.expression));
        } else if {command.action == "search") {
          search(command.query);
        }
      }
      function callNumber(number) {
        window.location = "tel:" + number;
      }
      function search(query) {
        // Start web search for query.
      }
      function getContactNumber(contact) {
        // Get the phone number of the contact.
      }
      function say(text) {
        // Speak text.
      }
    </script>

    <form>
    <input speech grammar="commands.grxml" onspeechchange="doCommand">
    </form>

English SRGS XML Grammar (commands.grxml).

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE grammar PUBLIC "-//W3C//DTD GRAMMAR 1.0//EN"
                      "http://www.w3.org/TR/speech-grammar/grammar.dtd">
    <grammar xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
             xsi:schemaLocation="http://www.w3.org/2001/06/grammar 
                                 http://www.w3.org/TR/speech-grammar/grammar.xsd"
             version="1.0" mode="voice" root="command"
             tag-format="semantics/1.0">

    <rule id="command" scope="public">
      <example> call Bob </example>
      <example> calculate 4 plus 3 </example>
      <example> search for pictures of the Golden Gate bridge </example>

      <one-of>
        <item>
          call <ruleref uri="contacts.grxml">
          <tag>out.action="call_contact"; out.contact=rules.latest()</tag>
        </item>
        <item>
          call <ruleref uri="phonenumber.grxml">
          <tag>out.action="call_number"; out.number=rules.latest()</tag>
        </item>
        <item>
          calculate <ruleref uri="#expression">
          <tag>out.action="calculate"; out.expression=rules.latest()</tag>
        </item>
        <item>
          search for <ruleref uri="http://grammar.example.com/search-ngram-model.xml">
          <tag>out.action="search"; out.query=rules.latest()</tag>
        </item>
      </one-of>
    </rule>
    </grammar>

3 Scope

This section is non-normative.

This specification is limited to extending existing HTML elements with attributes and methods for allowing speech input. Speech recognition grammars are specified using SRGS and semantic interpretation of hypotheses are specified using SISR.

The scope of this specification does not include providing a new markup language of any kind.

The scope of this specification does not include defining or interpreting any definition or interpretation of dialog management instructions

The scope of this specification does not include interfacing with telephony systems of any kind.

4 Security and privacy considerations

User agents must only start speech input sessions with explicit, informed user consent. User consent can include, for example:
- User click on a visible speech input element which has an obvious graphical representation showing that it will start speech input.
- Accepting a permission prompt shown as the result of a call to startSpeechInput.
- Consent previously granted to always allow speech input for this web page.
User agents must give the user an obvious indication when audio is being recorded.

In a graphical user agent, this could be a mandatory notification displayed by the UA as part of its chrome and not accessible by the web page. This could for example be a pulsating/blinking record icon as part of the browser chrome/address bar, an indication in the status bar, an audible notification, or anything else relevant and accessible to the user. This UI element must also allow the user to stop recording.
In a speech-only user agent, the indication may for example take the form of the system speaking the label of the speech input element, followed by a short beep.

The user agent may also give the user a longer explanation the first time speech input is used, to let the user now what it is and how they can tune their privacy settings to disable speech recording if required.
To minimize the chance of users unwittingly let web pages record speech without their knowledge, implementations must abort an active speech input session if the web page lost input focus to another window or to another tab within the same user agent.

Implementation considerations

This section is non-normative.

Spoken password inputs can be problematic from a security perspective, but it is up to the user to decide if they want to speak their password.
Speech input could potentially be used to eavesdrop on users. Malicious webpages could use tricks such as hiding the input element or otherwise making the user believe that it has stopped recording speech while continuing to do so. They could also potentially style the input element to appear as something else and trick the user into clicking them. An example of styling the file input element can be seen at http://www.quirksmode.org/dom/inputfile.html. The above recommendations are intended to reduce this risk of such attacks.

5 API Description

5.1 Extending HTML elements

The API adds a set of new attributes and methods to HTMLInputElement and HTMLTextAreaElement.

  interface HTMLInputElement : HTMLElement {
    ...

    // speech input attributes
    attribute boolean speech;
    attribute DOMString grammar;
    attribute short maxresults;
    attribute long nospeechtimeout;

    // speech input event handler IDL attributes
    attribute Function oncapturestart();
    attribute Function onspeechstart();
    attribute Function onspeechchange(in SpeechInputEvent event);
    attribute Function onspeechend();
    attribute Function onspeecherror(in SpeechInputError error);

    // speech input methods
    void startSpeechInput();
    void stopSpeechInput();
    void cancelSpeechInput();
  };

The speech attribute indicates to the user agent that it should accept speech input for that element. This attribute is applicable for HTMLInputElement in the following states:

Text
Search
URL
Telephone
E-mail
Password

The grammar attribute gives the address of an external application-specific grammar, e.g. "http://example.com/grammars/pizza-order.grxml". The attribute, if present, must contain a valid non-empty URL potentially surrounded by spaces. The implementation should use the grammar to guide the speech recognizer. Implementations must support SRGS grammars and SISR annotations.

The maxresults attribute specifies the maximum number of items that the implementation should place in the results sequence. If not set, the maximum number of elements in the results sequence is implementation dependent.

The nospeechtimeout attribute specifies the time in milliseconds to wait for start of speech, after which the audio capture times out. If not set, the timeout is implementation dependent.

The capturestart event is dispatched when audio capture starts for the active speech input session.

The speechstart event is dispatched when the active speech input session detects that the user has started speaking.

The speechchange event is dispatched when a set of complete and valid utterances have been recognized. This event bubbles. A complete utterance ends when the implementation detects end-of-speech or if the stopSpeechInput() method was invoked.

Some implementations may dispatch the change event for elements when their value changes. When the new value was obtained as the result of a speech input session, such implementations must dispatch the speechchange event prior to the change event.

The speechend event is dispatched when the active speech input session detects that the user has stopped speaking.

The speecherror event is dispatched when the active speech input session resulted in an error.

When the startSpeechInput() method is invoked, a new speech input session is started. If there was already an active speech input session, that session is cancelled, as if cancelSpeechInput had been called. If the user has not already given consent to this web page to accept speech input, implementations must not start the speech input session until the user has given consent.

When the stopSpeechInput() method is invoked, if there was an active speech input session and this element had initiated it, the user agent must gracefully stop the session as if end-of-speech was detected. The user agent must perform speech recognition on audio that has already been recorded and the relevant events must be fired if necessary. If there was no active speech input session, if this element did not initiate the active speech input session or if end-of-speech was already detected, this method must return without doing anything.

When the cancelSpeechInput() method is invoked, if there was an active speech input session and this element had initiated it, the user agent must abort the session, discard any pending/buffered audio data and fire no events for the pending data. If there was no active speech input session or if this element did not initiate it, this method must return without doing anything.

5.2 SpeechInputEvent interface

  [NoInterfaceObject]
  interface SpeechInputEvent {
    readonly attribute SpeechInputResultList results;
    void feedback(DOMString correctUtterance);
  };

The results attribute returns a SpeechInputResultList object containing all the recognition results.

The feedback(correctUtterance) method is used to give feedback on the speech recognition results. Implementations may use this to improve the accuracy of future speech recognition requests.

Setting correctUtterance to null indicates that none of the items in results were a satisfactory match to the original speech.
Setting correctUtterance to the utterance of an item in results indicates that the item is the most satisfactory match to the original speech.
Setting correctUtterance to a value not matching any of the utterance values in results indicates that the given value is the expected match to the original speech.

5.3 SpeechInputResultList interface

  [NoInterfaceObject]
  interface SpeechInputResultList {
    readonly attribute unsigned long length;
    SpeechInputResult item(in unsigned long index);
  };

The length attribute must return the number of results represented by the list.

The item(index) method must return the indexth result in the list. If there is no indexth result in the list, then the method must return null.

5.4 SpeechInputResult interface

In ECMAScript, SpeechInputResult objects are represented as regular native objects with optional properties named utterance, confidence and interpretation.

  [NoInterfaceObject]
  interface SpeechInputResult {
    readonly attribute DOMString utterance;
    readonly attribute float confidence;
    readonly attribute object interpretation;
  };

The utterance attribute must, on getting, return the text of recognized speech.

The confidence attribute must, on getting, return a value in the inclusive range [0.0, 1.0] indicating the quality of the match. The higher the value, the more confident the recognizer is that this matches what the user spoke.

The interpretation attribute must, on getting, return the result of semantic interpretation of the recognized speech, using semantic annotations in the grammar. If no grammar was specified or if the grammar contained no semantic annotations for the utterance, this value should be the same as utterance.

5.5 SpeechInputError interface

  [NoInterfaceObject]
  interface SpeechInputError {
    const unsigned short ABORTED = 1;
    const unsigned short AUDIO = 2;
    const unsigned short NETWORK = 3;
    const unsigned short NO_SPEECH = 4;
    const unsigned short NO_MATCH = 5;
    const unsigned short BAD_GRAMMAR = 6;
    const unsigned short PERMISSION_DENIED = 7;
    const unsigned short UNSUPPORTED_LANGUAGE = 8;
    readonly attribute unsigned short code;
    readonly attribute DOMString message;
  };

The code attribute must return the appropriate code from the following list:

ABORTED (numeric value 1): The user or a script aborted speech input.
AUDIO (numeric value 2): There was an error with recording audio.
NETWORK (numeric value 3): There was a network error, for implementations that use server-side recognition.
NO_SPEECH (numeric value 4): No speech heard before nospeechtimeout.
NO_MATCH (numeric value 5): Speech was heard, but could not be interpreted in the specified language and language model.
BAD_GRAMMAR (numeric value 6): There was an error in the speech recognition grammar.
PERMISSION_DENIED (numeric value 7): The user did not consent to starting speech input.

The message attribute must return an error message describing the details of the error encountered. This attribute is primarily intended for debugging and developers should not use it directly in their application user interface.

5.6 Notes about existing attributes

The lang attribute, if present, sets the speech input language. If this attribute is not set the implementation must fall back to the language of the closest ancestor that has a lang attribute, and finally to the language of the document.

The pattern attribute, if present, should be used to guide the speech recognizer. If grammar was also specified, grammer takes precedence over pattern.

For HTMLInputElement the value attribute must be the utterance of the most likely recognition result after a successful recognition. This is equivalent to results[0].utterance. In the case of unsuccessful recognition value must remain unaffected.

5.7 Differences between single line and multi-line elements

Successful speech input to HTMLInputElement must set the value attribute to the utterance of most likely speech recognition hypothesis.
Successful speech input to HTMLTextAreaElement must insert the utterance of the most likely speech recognition hypothesis at the current caret position.

5.8 Backwards Compatibility

A DOM application can use the hasFeature(feature, version) method of the DOMImplementation interface with parameter values "SpeechInput" and "1.0" (respectively) to determine whether or not this module is supported by the implementation.

Implementations that don't support speech input will ignore the additional attributes and events defined in this module and the HTML elements with these attributes will continue to work with other forms of input.

6 Implementation

6.1 Control Flow

This section is non-normative.

The following diagram indicates the transitions and events relevant to an element with speech input in a typical scenario.

7 Use-Cases and Requirements

This section outlines the use cases and requirements that are covered by this specification.

7.1 Use-Cases

2.1 Speech Recognition

Yes
No
- U4. Continuous Recognition of Open Dialog
- U7. Rerecognition

2.2 Speech Synthesis

No
- U9. Temporal Structure of Synthesis to Provide Visual Feedback
- U10. Hello World Use Case

2.3 Integrated Speech Recognition and Synthesis

Yes, only for the speech input part.

7.2 Requirements

3.1 Web Authoring Feature Requirements
3.2 Web Authoring Convenience Requirements
3.3 Security and Privacy Requirements
- 3.3.1 Security and Privacy Speech System Requirements
  
  Yes
  - FPR16. User consent should be informed consent.
  - FPR20. The spec should not unnecessarily restrict the UA's choice in privacy policy.
  No
  - FPR55. Web application must be able to encrypt communications to remote speech service.
- 3.3.2 Security and Privacy Recognition Requirements
  
  Yes
  No
  - FPR37. Web application should be given captured audio access only after explicit consent from the user.

Acknowledgments

Andrei Popescu, Dave Burke, Jeremy Orlow

References

[RFC3066]: Tags for the Identification of Languages, Harald Tveit Alvestrand. Internet Engineering Task Force, January 2001. See http://www.ietf.org/rfc/rfc3066.txt
[WEBIDL]: Web IDL, Cameron McCormack, Editor. World Wide Web Consortium, 19 December 2008. See http://dev.w3.org/2006/webapi/WebIDL/

Speech Input API Specification

Editor's Draft 18 October 2010

Abstract

Status of This Document

Table of Contents

1 Conformance requirements

2 Introduction

Automatic actions at the end of a spoken input

Speech recognition grammars

Application-specific handling of speech recognition hypotheses

Examples

Web search by voice

Web search by voice, with "Did you say..."

Speech translator

Card number input with n-best list validation

Turn-by-turn navigation

Speech shell

3 Scope

4 Security and privacy considerations

Implementation considerations

5 API Description

5.1 Extending HTML elements

5.2 SpeechInputEvent interface

5.3 SpeechInputResultList interface

5.4 SpeechInputResult interface

5.5 SpeechInputError interface

5.6 Notes about existing attributes

5.7 Differences between single line and multi-line elements

5.8 Backwards Compatibility

6 Implementation

6.1 Control Flow

7 Use-Cases and Requirements

7.1 Use-Cases

7.2 Requirements

Acknowledgments

References