Copyright © 2010 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
This specification extends HTML and defines an API that provides speech recognition and input to web pages.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document is an API proposal from Google Inc. to the HTML Speech Incubator Group. If you wish to make comments regarding this document, please send them to public-xg-htmlspeech@w3.org (subscribe, archives).
All feedback is welcome.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
All diagrams, examples, and notes in this specification are non-normative, as are all sections explicitly marked non-normative. Everything else in this specification is normative.
The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in the normative parts of this document are to be interpreted as described in RFC2119. For readability, these words do not appear in all uppercase letters in this specification. [RFC2119]
Requirements phrased in the imperative as part of algorithms (such as "strip any leading space characters" or "return false and abort these steps") are to be interpreted with the meaning of the key word ("must", "should", "may", etc) used in introducing the algorithm.
Conformance requirements phrased as algorithms or specific steps may be implemented in any manner, so long as the end result is equivalent. (In particular, the algorithms defined in this specification are intended to be easy to follow, and not intended to be performant.)
User agents may impose implementation-specific limits on otherwise unconstrained inputs, e.g. to prevent denial of service attacks, to guard against running out of memory, or to work around platform-specific limitations.
Implementations that use ECMAScript to implement the APIs defined in this specification must implement them in a manner consistent with the ECMAScript Bindings defined in the Web IDL specification, as this specification uses that specification's terminology. [WEBIDL]
This section is non-normative.
The HTML Speech Input API aims to provide web developers with features that are typically not available when using standard speech recognition software to replace keyboard input in a web browser. The API itself is agnostic of the underlying speech recognition implementation and can support both server based as well as embedded recognizers. The API is designed to enable both one-off speech input and continuous speech input requests. Speech recognition results are provided to the web page as a list of hypotheses along with other relevant information for each hypothesis.
pattern
can be used to restrict the allowed inputs, but regular
expressions are less expressive than context-free grammars. Also, SRGS grammars can include semantic annotations.
The following code extracts illustrate how to use speech input in various cases:
<script type="text/javascript"> function startSearch(event) { event.target.form.submit(); } </script> <form action="http://www.google.com/search"> <input type="search" name="q" speech required onspeechchange="startSearch"> </form>Behavior
onspeechchange
event is dispatched.startSearch()
is called by onspeechchange
event handler on speech input element.<script type="text/javascript"> function startSearch(event) { if (event.target.results.length > 1) { var second = event.target.results[1].utterance; document.getElementById("second_best").value = second; } event.target.form.submit(); } </script> <form action="http://www.google.com/search"> <input type="search" name="q" speech required onspeechchange="startSearch"> <input type="hidden" name="second_best" id="second_best"> </form>
<script type="text/javascript" src="http://www.google.com/jsapi"></script> <script type="text/javascript"> google.load("language", "1"); // Load the translator JS library. // These will most likely be set in a UI. var fromLang = "en"; var toLang = "es"; function handleSpeechInput(event) { var text = event.target.value; var callback = function(result) { if (result.translation) speak(result.translation, toLang); }; google.language.translate(text, fromLang, toLang, callback); } function speak(output, lang) { // (Use <audio> or a TTS API to speak output in lang) } </script> <form> <input speech onspeechchange="handleSpeechInput"> </form>Behavior
results
).
<script type="text/javascript"> function pickValidCardNumber(event) { var results = event.target.results; for (var i = 0; i < results.length; i++) { if (isValidCardNumber(results[i].interpretation)) { event.target.value = results[i].interpretation; break; } } } function isValidCardNumber(number) { // Checks and returns true if the number is valid. } </script> <form> <input type="number" name="cardNumber" speech required pattern="[0-9]{16}" onspeechchange="pickValidCardNumber"> </form>
<script type="text/javascript"> var directions; function handleSpeechInput(event) { var results = event.target.results; if (results) { var dest = results[0].interpretation.destination; directions = getDirectionsTo(dest); // Get directions from database/server. speakNextInstruction(); } } function speakNextInstruction() { var instruction = directions.pop(); // (Use <audio> tag or a TTS API to speak the instruction) // Start a wait/notify mechanism to speak next instruction later. } </script> <form> <input speech grammar="grammar-nav-en.grxml" onspeechchange="handleSpeechInput"> </form>English SRGS XML Grammar (grammar-nav-en.grxml):
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE grammar PUBLIC "-//W3C//DTD GRAMMAR 1.0//EN" "http://www.w3.org/TR/speech-grammar/grammar.dtd"> <grammar xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/06/grammar http://www.w3.org/TR/speech-grammar/grammar.xsd" version="1.0" mode="voice" root="nav_cmd" tag-format="semantics/1.0"> <rule id="nav_cmd" scope="public"> <example> navigate to 76 Buckingham Palace Road, London </example> <example> go to Phantom of the Opera </example> <item> <ruleref uri="#nav_action" /> <ruleref uri="builtin:search" /> <tag>out.action="navigate_to"; out.destination=rules.latest();</tag> </item> </rule> <rule id="nav_action"> <one-of> <item>navigate to</item> <item>go to</item> </one-of> </rule> </grammar>
<script type="text/javascript"> function doCommand(event) { var command = event.target.value; if (command.action == "call_contact") { var number = getContactNumber(command.contact); callNumber(number); } else if (command.action == "call_number") { callNumber(command.number); } else if (command.action == "calculate") { say(evaluate(command.expression)); } else if {command.action == "search") { search(command.query); } } function callNumber(number) { window.location = "tel:" + number; } function search(query) { // Start web search for query. } function getContactNumber(contact) { // Get the phone number of the contact. } function say(text) { // Speak text. } </script> <form> <input speech grammar="commands.grxml" onspeechchange="doCommand"> </form>English SRGS XML Grammar (commands.grxml).
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE grammar PUBLIC "-//W3C//DTD GRAMMAR 1.0//EN" "http://www.w3.org/TR/speech-grammar/grammar.dtd"> <grammar xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/06/grammar http://www.w3.org/TR/speech-grammar/grammar.xsd" version="1.0" mode="voice" root="command" tag-format="semantics/1.0"> <rule id="command" scope="public"> <example> call Bob </example> <example> calculate 4 plus 3 </example> <example> search for pictures of the Golden Gate bridge </example> <one-of> <item> call <ruleref uri="contacts.grxml"> <tag>out.action="call_contact"; out.contact=rules.latest()</tag> </item> <item> call <ruleref uri="phonenumber.grxml"> <tag>out.action="call_number"; out.number=rules.latest()</tag> </item> <item> calculate <ruleref uri="#expression"> <tag>out.action="calculate"; out.expression=rules.latest()</tag> </item> <item> search for <ruleref uri="http://grammar.example.com/search-ngram-model.xml"> <tag>out.action="search"; out.query=rules.latest()</tag> </item> </one-of> </rule> </grammar>
This section is non-normative.
This specification is limited to extending existing HTML elements with attributes and methods for allowing speech input. Speech recognition grammars are specified using SRGS and semantic interpretation of hypotheses are specified using SISR.
The scope of this specification does not include providing a new markup language of any kind.
The scope of this specification does not include defining or interpreting any definition or interpretation of dialog management instructions
The scope of this specification does not include interfacing with telephony systems of any kind.
startSpeechInput
.This section is non-normative.
interface HTMLInputElement : HTMLElement { ... // speech input attributes attribute boolean speech; attribute DOMString grammar; attribute short maxresults; attribute long nospeechtimeout; // speech input event handler IDL attributes attribute Function oncapturestart(); attribute Function onspeechstart(); attribute Function onspeechchange(in SpeechInputEvent event); attribute Function onspeechend(); attribute Function onspeecherror(in SpeechInputError error); // speech input methods void startSpeechInput(); void stopSpeechInput(); void cancelSpeechInput(); };
The speech
attribute indicates to the user
agent that it should accept speech input for that element. This attribute is applicable for
HTMLInputElement in the following states:
The grammar
attribute gives the address of an external
application-specific grammar, e.g. "http://example.com/grammars/pizza-order.grxml". The attribute, if present,
must contain a valid
non-empty URL potentially surrounded by spaces. The implementation should use the grammar to guide
the speech recognizer. Implementations must support SRGS grammars
and SISR annotations.
The maxresults
attribute specifies the maximum number of
items that the implementation should place in the results
sequence. If not set, the maximum
number of elements in the results sequence is implementation dependent.
The nospeechtimeout
attribute specifies the time
in milliseconds to wait for start of speech, after which the audio capture times out. If not set, the
timeout is implementation dependent.
The capturestart
event is dispatched when audio capture starts
for the active speech input session.
The speechstart
event is dispatched when the active
speech input session detects that the user has started speaking.
The speechchange
event is dispatched when a set of
complete and valid utterances have been recognized. This event bubbles. A complete utterance ends when the
implementation detects end-of-speech or if the stopSpeechInput()
method was invoked.
Some implementations may dispatch the change
event for elements when their value changes.
When the new value was obtained as the result of a speech input session, such implementations must dispatch
the speechchange
event prior to the change
event.
The speechend
event is dispatched when the active
speech input session detects that the user has stopped speaking.
The speecherror
event is dispatched when the active
speech input session resulted in an error.
When the startSpeechInput()
method is invoked, a
new speech input session is started. If there was already an active speech input session, that session is
cancelled, as if cancelSpeechInput
had been called. If the user has not already given consent to
this web page to accept speech input, implementations must not start the speech input session until the user has
given consent.
When the stopSpeechInput()
method is invoked, if
there was an active speech input session and this element had initiated it, the user agent must gracefully
stop the session as if end-of-speech was detected. The user agent must perform speech recognition on audio
that has already been recorded and the relevant events must be fired if necessary. If there was no active
speech input session, if this element did not initiate the active speech input session or if end-of-speech
was already detected, this method must return without doing anything.
When the cancelSpeechInput()
method is invoked, if
there was an active speech input session and this element had initiated it, the user agent must abort
the session, discard any pending/buffered audio data and fire no events for the pending data. If there was
no active speech input session or if this element did not initiate it, this method must return without
doing anything.
[NoInterfaceObject] interface SpeechInputEvent { readonly attribute SpeechInputResultList results; void feedback(DOMString correctUtterance); };
The results
attribute returns a
SpeechInputResultList
object containing all the recognition results.
The feedback(correctUtterance)
method is used to
give feedback on the speech recognition results. Implementations may use this to improve the accuracy of future speech recognition requests.
correctUtterance
to null indicates that none of the items in results
were a satisfactory match to the original speech.correctUtterance
to the utterance of an item in results
indicates that the item is the most satisfactory match to the original speech.correctUtterance
to a value not matching any of the utterance values
in results
indicates that the given value is the expected match to the original speech.[NoInterfaceObject] interface SpeechInputResultList { readonly attribute unsigned long length; SpeechInputResult item(in unsigned long index); };
The length
attribute must return the number of results
represented by the list.
The item(index)
method must return the indexth result
in the list. If there is no indexth result in the list, then the method must return null.
SpeechInputResult
objects are represented
as regular native objects with optional properties named utterance
, confidence
and interpretation
.
[NoInterfaceObject] interface SpeechInputResult { readonly attribute DOMString utterance; readonly attribute float confidence; readonly attribute object interpretation; };
The utterance
attribute must, on getting, return the
text of recognized speech.
The confidence
attribute must, on getting, return a
value in the inclusive range [0.0, 1.0] indicating the quality of the match. The higher the value, the
more confident the recognizer is that this matches what the user spoke.
The interpretation
attribute must, on getting,
return the result of semantic interpretation of the recognized speech, using semantic annotations in
the grammar. If no grammar was specified or if the grammar contained no semantic annotations for the
utterance, this value should be the same as utterance.
[NoInterfaceObject] interface SpeechInputError { const unsigned short ABORTED = 1; const unsigned short AUDIO = 2; const unsigned short NETWORK = 3; const unsigned short NO_SPEECH = 4; const unsigned short NO_MATCH = 5; const unsigned short BAD_GRAMMAR = 6; const unsigned short PERMISSION_DENIED = 7; const unsigned short UNSUPPORTED_LANGUAGE = 8; readonly attribute unsigned short code; readonly attribute DOMString message; };
The code
attribute must return the appropriate code from
the following list:
ABORTED
(numeric value 1)AUDIO
(numeric value 2)NETWORK
(numeric value 3)NO_SPEECH
(numeric value 4)nospeechtimeout
.NO_MATCH
(numeric value 5)BAD_GRAMMAR
(numeric value 6)PERMISSION_DENIED
(numeric value 7)The message
attribute must return an error message describing the
details of the error encountered. This attribute is primarily intended for debugging and developers
should not use it directly in their application user interface.
The lang
attribute,
if present, sets the speech input language. If this attribute is not set the implementation
must fall back to the language of the closest ancestor that has a lang
attribute, and
finally to the language of the document.
The
pattern
attribute, if present, should be used to guide the speech recognizer.
If grammar
was also specified, grammer
takes precedence over pattern
.
For HTMLInputElement
the value
attribute must be the utterance of the most likely recognition result after a
successful recognition. This is equivalent to results[0].utterance
. In the case of unsuccessful
recognition value
must remain unaffected.
value
attribute to the utterance of most likely speech recognition hypothesis.A DOM application can use the hasFeature(feature, version)
method of the
DOMImplementation
interface with parameter values "SpeechInput" and "1.0" (respectively)
to determine whether or not this module is supported by the implementation.
Implementations that don't support speech input will ignore the additional attributes and events defined in this module and the HTML elements with these attributes will continue to work with other forms of input.
This section is non-normative.
The following diagram indicates the transitions and events relevant to an element with speech input in a typical scenario.
This section outlines the use cases and requirements that are covered by this specification.
onspeechstart
event is received.)Andrei Popescu, Dave Burke, Jeremy Orlow