Copyright © 2011 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
This specification extends HTML to enable pages to incorporate speech recognition and synthesis. Firstly, it defines extensions to existing open web platform interfaces and objects that have general benefit across a range of scenarios, including speech recognition and synthesis. Secondly, it defines a set of speech-specific APIs that enable a richer set of speech semantics and user-agent-provided capabilities. Thirdly, it suggests a design approach for incorporating speech semantics into HTML markup. These three design stages do not all need to be stabilized and implemented simultaneously. For example, the first and second design stages could be stabilized and implemented well before the third.
This specification does not define any new protocols nor require any new protocols to be defined by the IETF or W3C. However, we are aware that more sophisticated protocols will enable a richer set of speech functionality. Hence the speech-specific portions of the API design proposal are designed to be as loosely-coupled to the underlying speech service delivery mechanism as possible (i.e. whether it is proprietary to the browser, delivered by an HTTP/REST service, or delivered by a future to-be-defined protocol).
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document is an API proposal from Microsoft to the HTML Speech Incubator Group. If you wish to make comments regarding this document, please send them to public-xg-htmlspeech@w3.org (subscribe, archives). All feedback is encouraged.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
All diagrams, examples, and notes in this specification are non-normative, as are all sections explicitly marked non-normative. Everything else in this specification is normative.
The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in the normative parts of this document are to be interpreted as described in RFC2119. [RFC2119]
Requirements phrased in the imperative as part of algorithms (such as "strip any leading space characters" or "return false and abort these steps") are to be interpreted with the meaning of the key word ("must", "should", "may", etc) used in introducing the algorithm.
Conformance requirements phrased as algorithms or specific steps may be implemented in any manner, so long as the end result is equivalent. (In particular, the algorithms defined in this specification are intended to be easy to follow, and not intended to be performant.)
User agents may impose implementation-specific limits on otherwise unconstrained inputs, e.g. to prevent denial of service attacks, to guard against running out of memory, or to work around platform-specific limitations.
Implementations that use ECMAScript to implement the APIs defined in this specification must implement them in a manner consistent with the ECMAScript Bindings defined in the Web IDL specification, as this specification uses that specification's terminology. [WEBIDL]
This section is non-normative.
The API design is presented in three conceptual layers, each of which builds upon the previous. The design approach is to:
The Speech APIs presented attempt to enable both basic and advanced speech applications. This means that there needs to be an acceptable default speech experience provided by the user agent for basic applications. However, due to the nature of speech technologies, the large investment that goes into the construction and tuning of grammars and acoustic models and the proprietary nature of statistical language model grammars, it is also necessary to allow the web developer to choose a speech recognition service of their choosing. This proposal enables both of these scenarios.
While the interaction pattern of many applications will enable speech services in response to the end user clicking a button to start speech, or changing the focus to an input field and starting speech, there are other applications that will be doing speech recognition more globally without that pattern (for example, a page that visually displays a map but still allows the user to say "zoom in" or "zoom out"). This proposal supports both these interaction patterns.
In addition to speech input, audio output including text-to-speech synthesis is often necessary as part of a compeling speech experience. The processing of the audio may need to be coordinated with other output (such as changing the visual user experience when a certain word is synthesized) or with the speech input (such as stopping the output in response to the start of speech - I.e., bargein). This proposal supports both of these use cases as well.
This proposal reuses a number of W3C speech standards including [SSML], [SRGS], [SISR], and [EMMA].
This section is non-normative.
This specification is mainly limited to extending existing HTML interfaces, objects, and elements. The proposal also provides new markup elements with attributes and the corresponding object model. Where possible existing standards are used such as speech recognition grammars are specified using SRGS, semantic interpretation of hypotheses are specified using SISR, recognition results are represented with EMMA, and TTS synthesis is specified using SSML.
The scope of this specification does not include providing any new markup languages or new low level protocols.
User agents must thoughtfully balance the needs of the web application to have access to the user's voice with the user's expectation of privacy. User agents should provide a mechanism for the user to allow a web application to be trusted with the speech input. It is up to the user agent at what granularity it makes sense to provide this authorization, since the appropriate authorization and user experience provided by a web browser may differ depending upon the deployment scenario. Such scenarios may include, but are not limited to, the following:
The user agent may allow all access as part of installation, or may allow all access as part of user configuration or a result of previous dialogs with the user, or may allow only certain domains to have access to the speech input, or may allow access only after initiating a dialog with the user once per domain or once per page, or may require a user dialog on each and every access of speech input, or may use whatever other security and privacy settings that it deems most appropriate. A user agent should provide the user with some indication that speech is being captured and should provide the user with some way to deny or revoke a speech capture.
Two examples are illustrated here. The first example illustrates an approach where the user agent interacts with the user to authorize microphone input to pages from a specific domain ("This site is voice enabled..."). The second example illustrates a global choice ("Do you want to control the web with your voice..."), which is conceivably more appropriate on some devices with more constrained opportunities for manual interaction. Some user agents may wish to persist the authorization decision indefinitely, whereas others may ask for re-authorization after a specific event or period. The key point is that this is an important user agent design decision, where the design parameters may vary from one agent to the next. The API should place no restriction on these design options, and there should be no particular microphone consent design assumptions in the API.
Example 1: Site Authorization | Example 2: Global Authorization |
---|---|
This proposal is sufficient to enable a wide variety of scenarios, including, among others:
Examples of each of these scenario types follow.
It should also be noted that although the illustrations show a button on the page for invoking speech recognition, this will not always be the case. Some devices will have a hardware button for invoking recognition, and some user agents may choose to invoke recognition via a button in the chrome. Some apps, such as those designed for use while driving, or in a living room with open microphone ("10-foot" apps), will not have a button at all.
Web search is typical of the current trend in mobile speech applications. The user interface design is deceptively simple: the user states their search term, and the application presents a list of potential answers. However, the application is more complicated than it appears. The language model for the search terms is completely dependent on the knowledge of the back-end search engine (which isn't available to the user agent), and tends to be enormous and continually changing (hence inappropriate to be included in the page itself). Furthermore, the user doesn't want a transcription of what they said - they actually want the resource they've requested, and speech is just a by-product of the process. In addition to the user's utterance, the search engine will incorporate a number of critical data points in order to achieve this, such as device capabilities, GPS coordinates, camera input, etc. At its simplest, the output may be a list of web links. But this is a rapidly evolving application type, and where possible, much richer experiences will be provided.
Step 1: User Initiates a Search | Step 2: Begin Capture |
---|---|
|
|
Step 3: Remote Recognition | Step 4: Results Display |
|
|
This sort of application occurs when a developer tries to layer speech on top of an existing design. This is far from "good" speech application design, but is still valuable when a developer wants to provide convenient input options on devices that do not have effective keyboards; or when a developer wants a "cheap" approach to speech-enabling their application without redesigning the user interface.
Step 1: User Initiates Input Into the First Field | Step 2: Speech Capture & Transcription |
---|---|
|
|
Step 3: User Initiates Input in the Next Field | Step 4: Speech Capture & Transcription |
|
|
Step N: Form Complete | |
|
In this sort of application, the user states their intent using natural language, and the speech system determines what task they are trying to perform, as well as extracting pertinent data for that task from the user's utterance.
Step 1: User Initiates a Request | Step 2: User States Natural Language Intent |
---|---|
|
|
Step 3: Appropriate Action is Taken | |
|
Driving directions are a good example of a class of applications in which the user and application talk to each other, using recognition and synthesis, over a number of dialog turns. These sorts of applications tend to be always-listening (i.e. open-mic) without requiring the user to manually initiate speech input for each utterance, and make use of barge-in to terminate speech output, and pick from lists of options.
Step 1: User Initiates Input of Destination | Step 2: User Says Destination Name |
---|---|
|
|
Step 3: Cloud Determines Location | Step 4: New Map, Directions and Spoken Summary |
|
|
Step 5: User Provides More Instruction | Step 6: Application Recalculates and Speaks Confirmation |
|
|
Step 6: App Issues Turn-By-Turn Instructions | |
|
There are many applications where speech input and output combine with tactile input and visual output, to provide a more natural experience while reducing display clutter and manual input complexity.
Step 1: User Switches Modes | Step 2: Game Confirms New Mode |
---|---|
|
|
Step 3: User Takes Action in the New Mode | |
|
The most fundamental work we advocate is to make a minor set of changes to:
This work is built on in the Speech Object Interfaces Section, where the speech interface can either consume these objects, and/or utilize built-in microphone and speech services provided by the user agent, and/or use alternative speech service protocols that are yet to be determined.
An example flow of events is illustrated in this sequence diagram:
The Media Capture API draft defines an API for accessing the audio, image and video capture capabilities of a device. With some enhancements, the design forms a strong basis for capturing audio input from a microphone to be used in speech recognition, as well as other scenarios such as video capture for streaming to social media sites. Reuse of an existing microphone API design such as this is preferable to defining an alternative speech-specific design. The specific security/privacy requirements of speech are assumed to be similar to those of general microphone or video capture, and any specific functional requirements for speech recognition can be added to the API without breaking other semantics.
We suggest the following modifications to the Media Capture API. Note that since the same API applies to capturing audio, video, and image input, some of the changes don't directly pertain to speech, but are included for completeness and consistency.
supported*
APInavigator.device.capture.supported*
properties
can be accessed without user intervention. Proposed change: navigator.device.openCapture()
returns asynchronously with the capture device object if and only if the user allows
access through a UA-defined mechanism such as those described in
Security.navigator.device.openCapture()
returns asynchronously with the capture device the user prefers (through preference
or UI).
Capture.capture[Image|Video|Audio]
operations launch an asynchronous
UI that returns one or more captures. This means the user has to do something in
the webapp to launch the UI and then do the capture, which makes it impossible to
build capture UI directly into the web application. Not only would this be unusable
for a speech recognition application, but it is also places unnecessary user interface
constraints on other media capture scenarios. Proposed change: the Capture
API should directly capture from device and return a Blob.
An application can control the duration and manage multiple captures. If the a WebApp
wants a picker interface with capture, the HTML Media Capture extensions to
<input type="file">
provides that support (this sort of scenario is
unlikely for speech recognition, but more common in other media capture scenarios).
Note that for privacy reasons some user agents will choose to display some notification
in their surrounding chrome or hardware to make it readily apparent to the user
that capture is occurring, together with the option to cancel the capture.
URL.createObjectURL()
. This URL can then be used as the value
for the src
attribute on an <audio>
or <video>
element. (Although this particular change is motivated by the need to preview video,
it is reasonable to conceive of applications that combine both speech and image
recognition.)
The following IDL shows conceptual additions to the Media Capture API that would satisfy the changes outlined above. The specific IDL is not a formal proposal, but indicative of a viable approach.
... [Use all IDL from Media Capture API, with the following modifications and additions.] // StoppableOperation is like PendingOperation, but the app can stop it, rather than having to wait for built-in UI. interface StoppableOperation { void cancel(); void stop(); }; // end-point parameters interface EndPointParams { attribute unsigned float sensitivity; // 0.0-1.0 attribute unsigned long initialTimeout; //milliseconds attribute unsigned long endTimeout; //milliseconds }; // end-point call-back interface EndPointCB { const unsigned short INITIAL_SILENCE_DETECTED = 0; const unsigned short SPEECH_DETECTED = 1; const unsigned short END_SILENCE_DETECTED = 2; const unsigned short NOISE_DETECTED = 3; const unsigned short NONSPEECH_TIMEOUT = 4; void endpoint(in unsigned int endtype); }; // preview call-back interface PreviewCB { void onPreview(in previewDevice); } // modification of existing Capture interface interface Capture { readonly attribute ConfigurationData[] supportedImageModes; readonly attribute ConfigurationData[] supportedVideoModes; readonly attribute ConfigurationData[] supportedAudioModes; PendingOperation captureImage (in CaptureCB successCB, in optional CaptureErrorCB errorCB, in optional CaptureImageOptions options); // Use StoppableOperation rather than PendingOperation for audio & video recordings // Additional end-pointing parameters to respond to speech in audio & video recordings StoppableOperation captureAudio (in CaptureCB successCB, in optional CaptureErrorCB errorCB, in optional CaptureAudioOptions options, in optional EndPointCB endCB, in optional EndPointParams endparams); StoppableOperation captureVideo (in CaptureCB successCB, in optional CaptureErrorCB errorCB, in optional CaptureVideoOptions options, in optional EndPointCB endCB, in optional EndPointParams endparams); // The preview() function is separate from the actual record() function // because preview quality & format will be different (usually device specific) void preview(in PreviewCB previewCB); }; };
The StoppableOperation
interface is returned by the captureAudio()
and captureVideo()
methods on the
Capture
interface. It is similar
to a PendingOperation
, with the addition
of a stop()
function that the application can call to stop & finish
a recording.
The EndPointParams
interface
is used to define the settings used to detect the beginning and end of speech during
an audio recording. The sensitivity
attribute can be set to a value between 0.0 (least sensitive) and 1.0 (most sensitive),
and has a default value of 0.5. The timeout
attribute specifies the time, in milliseconds, after which the user stops speaking
that the user agent should wait before declaring end of speech. It has a default
value of 400. The initialTimeout
attribute specifies the time the user agent should wait for speech to be detected
after recording begins, before declaring that initial silence has occurred. It is
measured in milliseconds and has a default value of 3,000. The specific end-pointing
algorithm a user agent uses depends on the capabilities of the hardware device and
user agent, so further fine-grained parameters are not presented. The user agent
is also free to ignore these settings if they are not appropriate to the local implementation.
It is also conceivable that some applications or devices may have access to speech
recognition technology with more sophisticated end-pointing capabilities, in which
case, they may choose to not use the Capture API's end-pointing at all.
The EndPointCB
interface defines the
endpoint()
callback method that the user agent calls to notify the
application of an end-point event. This callback provides a numeric value that indicates
the type of end-point that has occurred:
INITIAL_SILENCE_DETECTED
(numeric value 0)initialTimeout
.
The user agent continues recording after this event. The event allows the application
to take appropriate action, such as prompting the user or cancelling the recording.SPEECH_DETECTED
(numeric value 1)END_SILENCE_DETECTED
(numeric value 2)endTimeout
.NOISE_DETECTED
(numeric value 3)NONSPEECH_TIMEOUT
(numeric value 4)
The Capture
interface is
already defined in the Media Capture API draft. We
suggest that the captureAudio()
and captureVideo()
methods be adjusted
to return a StoppableOperation so they
can be stopped by the app, and to optionally accept
EndPointCB and EndPointParams arguments,
so that they can provide end-point events back to the app. We also suggest the addition
of a preview()
that provides a video-only
Stream (via a call-back) that can be used in conjunction with a <video> element
to provide an in-page view of what the camera can see.
Usage Example for Capture API
For this example, imagine a multimodal application where the user points their cell phone camera at an object, and uses voice to issue some query about that object (such as "what is this", "where can I buy one", or "do these also come in pink"). The application needs to preview the image using a video stream, listen for a voice query, and take a photo. Assume the voice and image input are dispatched to appropriate speech and image processing services.
Step 1: Point and Speak | Step 2: Use Voice to Further Refine |
---|---|
|
|
The sample code uses the capture API to provide microphone input as an audio stream to the recognizer; send a camera video stream to an in-page <video> element as a view-finder; and then take a photograph when the user asks a question.
<body onload="init()">
<div id="message"/>
<video id="viewfinder" width="480" height="600" type="video/mp4"/>
<script type="text/javascript">
var recordingSession;
var message;
function init() {
message = document.getElementById("message");
navigator.device.opencapture(onOpenCapture);
}
// opencapture() results in a callback when the UA has a device the user authorized for capture.
function onOpenCapture(in Capture captureDevice) {
// start previewing video
captureDevice.preview(onPreview);
// start listening
var audioOptions = { duration: 15, // max duration in seconds
limit: 1, // only need one recording
mode: { type: "audio/x-wav"} // no need to specify width & height
};
var endpointParams = { sensitivity: 0.5, endTimeout: 300, initialTimeout: 5000 };
recordingSession = captureAudio(onRecordStarted, onFail, audioOptions, onEndpoint, endpointParams);
}
// when we're given the video preview stream, feed it to the <video> element
function onPreview(previewDevice) {
document.getElementById("viewfinder").src = window.URL.createObjectURL(previewDevice);
}
function onRecordStarted(Stream stream) {
//place-holder: send stream to recognizer...
}
function onFail(CaptureError error) {
message.innerHTML = error.toString();
recordingSession = null;
}
function onEndpoint(in unsigned int endevent) {
switch (endevent) {
case 0: //initial silence
message.innerHTML = "please start speaking";
break;
case 1: //started speaking
message.innerHTML = "listening...";
break;
case 2: //finished speaking
recordingSession.stop();
// presumably the recognizer will fire a reco event sometime soon
message.innerHTML = "processing...";
window.capture.captureImage(onPhotoTaken);
break;
case 3: //noise
message.innerHTML = "can't hear you, too noisy";
break;
case 4; //extended non-speech
message.innerHTML = "giving up...";
recordingSession.cancel();
recordingSession = null;
break;
default: break;
}
function onPhotoTaken() {
// place-holder: send it to the image processing service
}
</script>
</body>
We propose the addition of a stream type. While this document does not present a detailed design for this type, we assume a Stream is an object that:
URL.createObjectURL()
.XMLHttpRequest Level 2 (XHR2) is a key enabling HTML API which is used extensively by applications to access remote services and resources, and has undergone enhancement since its inception in 1999, as a result of its fundamental value to apps, and the evolving needs of those apps. It is an established and successful design.
One of the key functional improvements in XHR2 over previous versions of XHR is the ability to make cross-domain requests. This is particularly important to speech, since many speech applications will not want to run their own speech recognition or synthesis web services, and there will likely be some speech recognition services offered to a variety of applications on different domains. Fortunately, XHR2 follows the Cross-Origin Resource Sharing specification and enables this scenario with no changes.
The following enhancements are suggested in order to satisfy speech-related scenarios:
send(Blob data)
method. For the purposes of speech recognition, it is important to send the audio stream
as it is captured. However the semantics of Blob
imply that sending
could only commence once capture is complete. Proposed change: add a
send(Stream stream)
method so that the microphone capture stream can be sent
to the recognition service without delay. This enhancement also helps with the related
scenario of uploading video recordings to video sharing sites in real time.
open()
method
that takes a callback that's used to notify the app that a send is about to be initiated.
When the app receives this callback, it can perform whatever timestamp, crypto,
CAPTCHA, or other mechanism it needs to, set the appropriate headers, then signal
that the send should commence.The following IDL only shows additions to the IDL already specified in the XHR2 draft. It is not a formal proposal, but is indicative of the sorts of enhancement that could be made.
// existing interface interface XMLHttpRequestEventTarget : EventTarget { ... // multipart response notification attribute Function onpartreceived; }; // new callback interface for alternative service auth interface XHRAuthCB { void authneeded(in Function signal); }; // new interface for a body part in a multipart response interface ResponsePart { readonly attribute DOMString mimeType; readonly attribute DOMString encoding; readonly attribute Blob blobPart; readonly attribute DOMString textPart; readonly attribute Document XMLPart; }; // new interface for accessing the collection of response body parts interface ResponsePartCollection { omittable getter ResponsePart item(in unsigned short index); readonly attribute unsigned short count; }; // existing interface interface XMLHttpRequest : XMLHttpRequestEventTarget { ... // open with alternative auth callback void open(DOMString method, DOMString url, boolean async, XHRAuthCB authCB); // send stream void send(in Stream data); // multipart expected (from Mozilla design) attribute boolean multipart; // multipart response readonly attribute ResponsePartCollection responseparts; };
The XMLHttpRequest
interface is extended
in three important ways.
Firstly, there is an additional send
()
method that accepts a Stream
object (similar
to the existing method that accepts Blob). For speech
recognition, this method would be used to stream microphone input to a web service.
Secondly, it provides access to multipart responses. To enable multipart responses,
the app sets the multipart
attribute
to TRUE. When a part is received, the onpartreceived
callback is fired. The app can access each received part through the
responseparts
collection, which is added to as each part is received.
Each ResponsePart
has a mimeType
and encoding attributes that the app can use to disambiguate the parts, as well
as a variety of accessors to examine the part.
Thirdly, it provides an overload of open
that enables alternative authentication schemes, by taking a call-back function
to be invoked by the user agent prior to sending a request. The call-back allows
the application to perform whatever time-sensitive authentication steps it needs
to perform (e.g. calculate a timestamped hash, or fetch a limited-use auth-token,
and place it in a particular header) and then call a signal() function to indicate
that the send operation can proceed.
This section presents a small number of conventions to aid with interoperability between diverse speech services and user agents using HTTP 1.1.
A large portion of current speech scenarios, as well as most of the HTML Speech XG's requirements, are feasible and addressable with existing HTTP 1.1 technology. For speech recognition, the basic pattern is to issue an HTTP POST request containing audio input streamed using chunked transfer encoding ([RFC2616]), and then receiving a 200 OK response from the service, containing EMMA results, and potentially further mime body parts containing additional information. For speech synthesis, the basic pattern is to issue an HTTP POST request, with SSML in the body, and receiving a 200 OK response containing the rendered audio and mark timing.
We acknowledge that while this approach satisfies many scenarios, further innovation in protocols may be necessary for some of the richer scenarios in the longer term. Indeed, new work is being done in the IETF and W3C around real time communication and web sockets that may well become useful and germane to certain speech scenarios, although at this stage it is too early to tell. Ultimately, we envisage that the Speech Object Interfaces presented in this document will work with a variety of underlying implementations (both local and protocol), and have designed those interfaces accordingly to allow for future service and protocol innovations.
HTTP Input ParametersInput parameters to speech requests are expressed by the user agent using any of these standard techniques:
application/www-form-url-encoded
encoding.multipart/form-data
or
multipart/mixed
encoding.Headers and query parameters are the easiest options for apps that use XHR-2. The other options are suitable for other approaches and are included for completeness. A user agent is not required to use the same technique for all of the parameters it passes. For example, it may choose to pass most parameters on the URL string, but include the audio stream in the body.
The recognition API has a number of attributes that an application may provide values for. These have corresponding HTTP parameters with the following reserved names. All of these parameters are optional in the HTTP request, and if absent their default behavior is determined by the speech recognition service.
grammars
: as a whitespace-delimited string of URLs. This must not be used with the grammarN
parameters.grammar1
: the 1st grammar in priority order. If of type text/plain or string,
then it represents a URL reference. It can be the contents of a grammar if in some other format
(for instance if of type application/srgs+xml it would be a grammar file).grammar2
: the 2nd grammar in priority order. If of type text/plain or string,
then it represents a URL reference. It can be the contents of a grammar if in some other format
(for instance if of type application/srgs+xml it would be a grammar file).grammarN
: the n-th grammar in priority order (where N isn't
literal but is some number). If of type text/plain or string, then it represents a URL
reference. It can be the contents of a grammar if in some other format (for instance if of type
application/srgs+xml it would be a grammar file).maxnbest
: as a positive integerspeechtimeout
: as a CSS2
timecompletetimeout
: as a CSS2 timeincompletetimeout
: as a CSS2 timeconfidence
: as a decimal number between 0.0 and 1.0sensitivity
: as a decimal number between 0.0 and 1.0speedvsaccuracy
: as a decimal number between 0.0 and 1.0speechparams
: the string value from the speechparams
attribute.contextblock
: Given that speech services will generally have more than one recognition
server instance, it's important that recognizer adaptation data can travel with client requests from one server
instance to the next. Whenever this property is present in a request, the recognizer
may return acoustic adaptation data to improve the accuracy of future transactions. The application
would then assign the returned data to the contextblock
property in the next transaction.
When the application is making its first request, and thus does not have any data to submit from a
previous transaction, the contextblock
property should be included without any value.
Its content is opaque to the application, and is generally a base-64 encoded opaque data string.
This is analogous to the MRCP recognizer-context-block property.
In addition to this, all recognition service HTTP requests MUST include the following parameters.
audio
: this is a mandatory input parameter for recognition. Unlike
other parameters, it MUST be in the request body, and
MAY be chunked. It may be a URL or raw audio data in a format specified by
its mime-type.Speech synthesis has fewer reserved input parameters:
src
: a URL with the http:
, https:
,
data:
, or file:
schemes, pointing to a source that the service has
access to.content
: either an SSML document, or plain text to be synthesized. Cannot be
used if src
is also used.All synthesis HTTP requests MUST include either src
or content
, but
MUST NOT include both.
In addition to this, a synthesis HTTP request MAY include:
audioformat
: the MIME audio format descriptor required by the user
agent. If this is not specified audio/basic should be assumed.Service implementers may define additional service-specific input parameters. For example, they may define input properties for functions such as logging, session coordination, engine selection, and so forth. Applications can most easily provide these as part of the service URL, or in additional headers. A suitable naming convention should be used, for example service-specific parameter names could be prefixed with "x-". In addition, services should ignore any parameters they do not understand or expect.
HTTP Output DataWhen operating correctly, a speech recognition request returns a 200 OK response. If the response contains a recognition result (which is normally the case), either of the following MUST be true:
content-type
header MUST be "application/emma+xml",
and the response body MUST be an
Extensible Multimedia Annotation (EMMA) document.content-type
header is multipart, and the first part
MUST have a MIME type of "application/emma+xml" and MUST
contain an EMMA document. Subsequent parts
may be of any type, as determined by the service.contextblock
: The block of adaptation data returned by the service. Service implementers MAY provide proprietary information (for example session tokens, adaptation data, or lattices) in the EMMA document, provided such information is expressed using standard XML extension conventions (such as placing proprietary tags in a separate namespace).
When operating correctly, a speech synthesis request returns a 200 OK response with these characteristics:
content-type
header specifies the audio format returned by the service.
The audio itself is encoded into the message body.speech-ssmlmark
header MAY be present, and if so, it indicates the timing
of SSML mark events within the audio stream. This is a space-delimited string of
space-delimited pairs, where each pair is the mark name in double-quotes and the time, in
CSS2 time format.Errors are communicated to the user agent via the speech-error
header, in the
format of an error number, followed by a space and then the error message. In the case of an
error, a recognition or synthesis service will generally still return 200 OK unless the error was
with the HTTP request itself. Some errors may not prevent the generation of results, and the
service may still provide them.
In this example, the user presses a button to start doing a search using their voice, similar to the Web Search example The application uses Capture and XHR to stream audio from the microphone to a web service. The first part of the response from the web service is the EMMA document containing the recognition result. The application displays the top result and the alternates. A short time later, the second part of the response arrives, which is an HTML snippet containing the search results presented in whatever manner is most appropriate, which the app then inserts into the page.
<body onload="init()">
<input type="button" id="btnListen" onclick="onListenClicked()" value="Listen" />
<div id="feedback"></div>
<div id="alternates"></div>
<div id="results"></div>
<script type="text/javascript">
var feedback;
var alternates;
var results;
var recordingSession;
var xhr;
var recoReceived = false;
var searchResultsReceived = false;
function init() {
feedback = document.getElementById("feedback");
alternates = document.getElementById("alternates");
results = document.getElementById("results");
xhr = new XMLHttpRequest();
xhr.open("POST","http://webreco.contoso.com/search",true);
xhr.setRequestHeader("maxnbest","5");
xhr.setRequestHeader("confidence","0.8");
xhr.onreadystatechange = onReadyStateChanged;
xhr.onpartreceived = onResponsePartReceived;
}
function onListenClicked() {
feedback.innerHTML = "preparing...";
navigator.device.opencapture(onOpenCapture);
}
function onOpenCapture(in Capture captureDevice) {
var audioOptions = { duration: 10, // max duration in seconds
limit: 1, // only need one recording
mode: { type: "audio/amr-wb"} // use wideband amr encoding
};
var endpointParams = { sensitivity: 0.5, initialTimeout: 3000, endTimeout:500};
recordingSession = captureAudio(onRecordStarted, onRecordFail, audioOptions, onEndpoint, endpointParams);
}
function onRecordStarted(Stream stream) {
feedback.innerHTML = "listening...";
xhr.send(stream);
}
function onEndpoint (in unsigned int endevent) {
switch (endevent) {
case 0: //initial silence
feedback.innerHTML = "Please start speaking...";
break;
case 1: //started speaking
feedback.innerHTML = "Mmm...hmmm...I'm listening intently...";
break;
case 2: //finished speaking
feedback.innerHTML = "One moment...";
recordingSession.stop();
recordingSession = null;
// now xhr will reach the end of the input stream, and complete the request.
break;
case 3: //noise
feedback.innerHTML = "Too noisy. Try again later...";
recordingSession.cancel();
recordingSession = null;
xhr.abort();
break;
case 4; //extended non-speech
feedback.innerHTML = "Still can't hear you. Try again later...";
recordingSession.cancel();
recordingSession = null;
xhr.abort();
break;
default: break;
}
}
function onResponsePartReceived() {
if (!recoReceived) {
if ("application/emma+xml" == xhr.responseparts(0).mimeType) {
var results = xhr.responseparts(0).XMLPart.getElementsByTagName("searchterm");
feedback.innerHTML = "You asked for '" + results[0] + "'";
if (results.length > 0) {
alternates.innerHTML = "<div>Or you may have said one of these...</div>"
for (var i = 1; i < results.length; i++) {
alternates.innerHTML = alternates.innerHTML + "<div>" + results[i] + "</div>";
}
}
results.innerHTML = "Fetching results...";
}
}
else if ((xhr.responseparts.count > 1) && !searchResultsReceived) {
if ("application/html+searchresults" == xhr.responseparts(1).mimeType) {
results.innerHTML = xhr.responseparts(1).textPart;
}
}
}
</script>
</body>
The basic design approach is to define recognition and synthesis APIs that are accessible from script. These APIs express the semantics of recognition and synthesis, while being loosely coupled to the implementation of those services. The API abstracts microphone input and interaction with the underlying speech services, so that the same API can be used whether the related mechanisms are proprietary to the device/user-agent, accessible over XHR2, or accessible over a future to-be-defined mechanism, without having to modify the API surface.
interface GrammarCollection { omittable getter DOMString item(in unsigned short index); attribute unsigned short count; void add(DOMString grammarURI, in optional float weight); //typically http:, but could be data: for inline grammars }; [NamedConstructor=SpeechRecognizer(), //uses the default recognizer provided by UA NamedConstructor=SpeechRecognizer(DOMString selectionParams), //for specifying the desired characteristics of a built-in recognizer NamedConstructor=SpeechRecognizer(XMLHttpRequest xhr) //for specifying a recognizer // NamedConstructor=SpeechRecognizer(OtherServiceProvider other) //future service providers/protocols can be added ] interface SpeechRecognizer { // audio input configuration void SetInputDevice( in CaptureDevice device, in optional CaptureAudioOptions options, in optional EndPointParams endparams); void SetInputDevice( in CaptureDevice device, in CaptureCB successCB, in optional CaptureErrorCB errorCB, in optional CaptureAudioOptions options, in optional EndPointParams endparams, in optional EndPointCB endCB); attribute boolean defaultUI; // type of recognition const unsigned short SIMPLERECO = 0; const unsigned short COMPLEXRECO = 1; const unsigned short CONTINUOUSRECO = 2; attribute unsigned short speechtype; readonly attribute boolean supportedtypes[]; //e.g. check supportedtypes[COMPLEXRECO] to determine whether interim events are supported. // speech parameters attribute GrammarCollection grammars; attribute short maxnbest; attribute long speechtimeout; attribute long completetimeout; attribute long incompletetimeout; attribute float confidence; attribute float sensitivity; attribute float speedvsaccuracy; attribute Blob contextblock; attribute Function onspeechmatch(in SpeechRecognitionResultCollection results); attribute Function onspeecherror(in SpeechError error); attribute Function onspeechnomatch(); attribute Function onspeechnoinput(); attribute Function onspeechstart(); attribute Function onspeechend(); // speech input methods void startSpeechInput(in optional Blob context); void stopSpeechInput(); void cancelSpeechInput(); void emulateSpeechInput(DOMString input); // states const unsigned short READY = 0; const unsigned short LISTENING = 1; const unsigned short WAITING = 2; readonly attribute unsigned short speechrecostate; };
A SpeechRecognizer
can use
a variety of different underlying services. The specific service depends on how
the object is constructed:
SpeechRecognizer()
constructor uses the default recognition
service provided by the user agent. In some cases this may be an implementation
that is local to the device, and limited by the capabilities of the device. In other
cases it may be a remote service that is accessed over the network using mechanisms
that are hidden to the application. In other cases it may be a smart hybrid of the
two. The responsiveness, accuracy, language modeling capacity, acoustic language
support, adaptation to the user, and other characteristics of the default recognition
service will vary greatly between user agents and between devices.SpeechRecognizer(DOMString selectionParams)
constructor causes
the user agent to use one of its built-in recognizers, selecting the particular
recognizer that matches the parameters listed in selectionParams. This is a URL-encoded
string of label-value pairs. It is up to the user agent to determine how closely
it chooses to honor the requested parameters. Suggested labels and their corresponding
values include:
SpeechRecognizer(XMLHttpRequest xhr)
uses an application provided
instance of XMLHttpRequest
and the
recommended HTTP conventions to access the speech
recognition service. When this constructor is used, the user agent should consider
displaying a notification in it's surrounding chrome to notify the user of the particular
speech service that is being used.In the event that additional service access mechanisms are designed and standardized, such as new objects for accessing new real-time communication protocols, an additional constructor could be added, without changing the API.
By default, microphone input will be provided by the user agent's default device. However,
the application may use the SetInputDevice()
functions
to provide a particular Capture object, with particular configuration settings, in order to exercise
more control over the recording operation.
User agents are expected to provide a default user experience for controlling speech
recognition. However, apps can choose to provide their own user interface by setting
the defaultUI
attribute to FALSE.
The speechtype
attribute indicates the type of
speech recognition that should occur. The values must be one of the following
values:
SIMPLERECO
(numeric value 0)SIMPLERECO means that the request for recognition must raise just one final speechmatch, speechnomatch, speechnoinput, or specherror event and must not produce any interim events or results. If the speech service is a remote service accessible over HTTP/HTTPS this means the recognition may be a simple request-response. SIMPLERECO must be the default value.
COMPLEXRECO
(numeric value 1)COMPLEXRECO means that the speech recognition request must produce all of the interim events in addition to speechstart, speechend, as well as the final recognition result. Interim events may include partial results. COMPLEXRECO must only produce one final recognition result, nomatch, noinput, or error as a COMPLEXRECO is still one utterance from the end user resulting in one result.
CONTINUOUSRECO
(numeric value 2)CONTINUOUSRECO represents a conversation or dialogue or dictation scenario where in
addition to interim events numerous final recognition results must be
able to be produced. A CONTINUOUSRECO speech interaction once started must
not stop until the stopSpeechInput
or cancelSpeechInput
method is
invoked.
Due to the nature of needing to stream audio while getting results back a remote
service doing COMPLEXRECO or CONTINUOUS should use a more sophisticated
paradigm than regular HTTP request-response. Hence, although COMPLEXRECO and CONTINUOUSRECO
could be specified in conjunction with the SpeechRecognizer(XMLHttpRequest
xhr)
, no interim events would be received, and only a single result would
be received.
The optional grammars
attribute is
a collection of URLs that give the address of one or more
application-specific grammars. A weight between 0.0 and 1.0 can optionally be provided
for each grammar (when not specified, weight defaults to 1.0). For example
grammars.add("http://example.com/grammars/pizza-order.grxml", 0.75)
. Some
applications may wish to provide SRGS directly, in which case they can use the data: URI scheme, e.g.
grammars.add("data:,<?xml version=... ...</grammar>")
. The implementation
of recognition systems should use the list of grammars to guide
the speech recognizer. Implementations must support
SRGS grammars and SISR annotations. Note that the order
of the grammars must define a priority order used to resolve
ties where an earlier listed grammar takes higher priority.
If the grammar attribute is absent the recognition service may provide a default grammar. For instance, services that perform recognition within specific domains (e.g. web search, e-commerce catalog search, etc) have an implicit language model, and do not necessarily need the application to specify a grammar.
The optional maxnbest
attribute specifies that
the implementation must not return a number of items greater than the
maxnbest value. If the maxnbest is not set it must default to 1.
The optional speechtimeout
attribute
specifies the time in milliseconds to wait for start of speech, after which the audio capture
must stop and a speechnoinput event must be returned. If not set, the timeout
is speech service dependent.
The optional completetimeout
attribute
specifies the time in milliseconds the recognizer must wait to finalize a
result (either accepting it or throwing a nomatch event for too low confidence results), when the
speech is a complete match of all active grammars. If not set, the timeout is speech service
dependent.
The optional incompletetimeout
attribute
specifies the time in milliseconds the recognizer must wait to finalize a
result (either accepting it or throwing a nomatch event for too low confidence results), when the
speech is an incomplete match (I.e., anything that is not a complete match) of all active
grammars. If not set, the timeout is implementation dependent.
The optional confidence
attribute specifies a
confidence level. The recognition service must reject any recognition result
with a confidence less than the confidence level. The confidence level must
be a float between 0.0 and 1.0 inclusive and must have a default value of
0.5.
The optional sensitivity
attribute specifies
how sensitive the recognition system should be to noise. The recognition service must treat a higher value as a request to be more sensitive to noise. The sensitivity
must be a float between 0.0 and 1.0 inclusive and must
have a default value of 0.5.
The optional speedvsaccuracy
attribute
specifies how much the recognition system should prioritize a speedy low-latency result and how
much it should prioritize getting the most accurate recognition. The recognition service
must treat a higher value as a request to be have a more accurate result and
must treat a lower value as a request to have a faster response. The
speedvsaccuracy must be a float between 0.0 and 1.0 inclusive and must have a default value of 0.5.
The optional contextblock
attribute
is used to convey additional recognizer-defined context data between the application
and the recognizer. For example, recognizers may use it to convey adaptation data
back to the application. The application could persist this context data to be re-used
in a future session. It is automatically updated whenever a speechmatch occurs.
The following events are targeted at the SpeechRecognizer object (or any object that inherits from it), do not bubble, and are not cancelable.
The speechmatch
event must
be dispatched when a set of complete and valid utterances have been matched. A complete utterance ends when the
implementation detects end-of-speech or if the stopSpeechInput()
method was invoked.
Note that if the recognition was of type SIMPLERECO
or COMPLEXRECO
there must be only one speechmatch result returned and this must end the speech recognition.
The speecherror
event must be dispatched when the active speech input session resulted in an error. This
error may be a result of a user agent denying the speech session, or parts of
it, due to security or privacy issues or may be the result of
an error in the web authors specification of the speech request or could be the
result of an error in the recognition system.
The speechnomatch
event must be raised when a complete utterence has failed to match the active
grammars, or has only matched with a confidence less than the specified
confidence value. A complete utterance ends when the
implementation detects end-of-speech or if the stopSpeechInput()
method was invoked.
Note that if the recognition was of type SIMPLERECO
or COMPLEXRECO
there must be only one speechnomatch returned and this must end the speech recognition.
The speechnoinput
event must be raised when the recognizer has detected no speech and the
speechtimeout has expired. Note that if the
recognition was of type SIMPLERECO
or COMPLEXRECO
there must be only one speechnoinput returned and this must end the
speech recognition.
The speechstart
event must
be raised when the recognition service detects that a user has started speaking. This event must not be
raised if the speechtype was SIMPLERECO
but must be generated if
the speechtype is either COMPLEXRECO
or CONTINUOUSRECO
.
The speechend
event must
be raised when the recognition service detects that a user has stopped speaking. This event must not be
raised if the speechtype was SIMPLERECO
but must be generated if
the speechtype is either COMPLEXRECO
or CONTINUOUSRECO
.
When the startSpeechInput()
method is
invoked then a speech recognition turn must be started. It is an error to
call startSpeechInput()
when the speechrecostate is anything but READY
and a user agent must raise a speecherror
should this occur.
Note that user agents should have privacy and security that specify if
scripted speech should be allowed and user agents may prevent the
startSpeechInput() from succeeding called on certain elements, applications, or sessions. If a
recognition turn is not begun with the recognition service then a speecherror
event
must be raised.
When the stopSpeechInput()
method is
invoked, if there was an active speech input session and this element had initiated it, the user
agent must gracefully stop the session as if end-of-speech was detected. The
user agent must perform speech recognition on audio that has already been
recorded and the relevant events must be fired if necessary. If there was no active speech input
session, if this element did not initiate the active speech input session or if end-of-speech was
already detected, this method must return without doing anything.
When the cancelSpeechInput()
method is
invoked, if there was an active speech input session and this element had initiated it, the user
agent must abort the session, discard any pending/buffered audio data and fire no events for the
pending data. If there was no active speech input session or if this element did not initiate it,
this method must return without doing anything.
When the emulateSpeechInput(DOMString
input)
method is invoked then the speech service treats the input string as the text
utterance that was spoken and returns the resulting recognition.
The speechrecostate
readonly variable
tracks the state of the recognition. At the beginning the state is
READY
meaning that the element is ready to start a speech
interaction. Upon starting the recognition turn the state changes to
LISTENING
. Once the system has stopped capturing from the
user, either due to the speech system detecting the end of speech or the web application calling
stopSpeechInput()
, but before the results
have been returned the system is in WAITING
. Once the
system has received the results the state returns to READY
.
[NamedConstructor=SpeechRecognitionResultCollection(Document responseEMMAXML), NamedConstructor=SpeechRecognitionResultCollection(DOMString responseEMMAText)] interface SpeechRecognitionResultCollection { readonly attribute Document responseEMMAXML; readonly attribute DOMString responseEMMAText; readonly attribute unsigned short length; omittable getter SpeechRecognitionResult item(in unsigned short index); void feedbackcorrection(DOMString correctUtterance); void feedbackselection(in unsigned short index); };
The responseEMMAXML
attribute must be generated from the EMMA document returned by the
recognition service. The value of responseEMMAXML
is the result of parsing the
response entity body into a document tree following the rules from the XML
specifications. If this fails (unsupported character encoding, namespace well-formedness error et
cetera) the responseEMMAXML must be null.
The responseEMMAText
attribute
must be generated from EMMA document returned by the
recognition service. The value of responseEMMAText
is the result of parsing the
response entity body into a text response entity body as
defined in XMLHTTPRequest.
The length
attribute must return the number of results represented by the collection.
The item(index)
method
must return the indexth result in the collection. If there is no indexth result
in the collection, then the method must return null. Since this
is an "omittable getter", the "item" accessor is optional: script can use results(index)
as a short-hand for results.item(index).
The feedbackcorrection(correctUtterance)
method is used to give feedback on the speech recognition results by providing the text value
that the application feels was more correct for the last turn of the recognition. The application
should not use feedbackcorrection
if one of the selections was
correct and should instead use feedbackselection
.
The feedbackselection(in unsigned short
index)
method is used to give feedback on the speech recognition results by
providing the item index that the application feels was more correct for the last turn of the
recognition. Passing in a value that is beyond the maximum returned values from the last turns
should be interpreted as the application thinking that the entire list was
incorrect, but that the application is not sure what was correct (since it didn't use
feedbackcorrection).
The results
attribute returns a sequence of
SpeechRecognitionResult
objects. In
ECMAScript, SpeechRecognitionResult
objects are represented as regular native objects with properties named utterance
,
confidence
and interpretation
. Note that this may not be sufficient for
applications that want more information and insight to the recognition such as timing of
various words and phrases, confidences on individual semantics or parts of utterances, and many
other features that might be part of a complex dictation use case. This is fine because in these
applications the raw EMMA will be available in the
SpeechRecognitionResultsContainer.
[NoInterfaceObject] interface SpeechRecognitionResult { readonly attribute DOMString utterance; readonly attribute float confidence; readonly attribute object interpretation; };
The utterance
attribute must return the text of recognized speech.
The confidence
attribute must return a value in the inclusive range [0.0, 1.0] indicating the quality of the
match. The higher the value, the more confident the recognizer is that this matches what the user
spoke.
The interpretation
attribute must return the result of semantic interpretation of the recognized speech, using
semantic annotations in the grammar. If the grammar used contained no semantic annotations for
the utterance, then this value must be the same as utterance.
The SpeechSynthesizer interface generates synthesized voice (or spliced audio recordings) using SSML input (or raw text). It has its own audio playback capabilities, along with a mark event to help with UI synchronization and barge-in. It also produces an audio stream that can be used as the source for an <audio> element.
The interface could be extended with other progress events, such as word and sentence boundaries, phoneme rendering, and even visual cue events such as visemes and facial expressions.
[NamedConstructor=SpeechSynthesizer(), //default built-in synthesizer NamedConstructor=SpeechSynthesizer(DOMString selectionParams), //selection parameters for built-in synthesizer NamedConstructor=SpeechSynthesizer(XMLHttpRequest xhr) //synthesizer via a REST service // NamedConstructor=SpeechSynthesizer(OtherServiceProvider other) //future service providers/protocols can be added ] interface SpeechSynthesizer { // error handling attribute Function onspeecherror(in SpeechError error); // content specification attribute DOMString src; attribute DOMString content; // audio buffering state const unsigned short BUFFER_EMPTY = 0; const unsigned short BUFFER_WAITING = 1; const unsigned short BUFFER_LOADING = 2; const unsigned short BUFFER_COMPLETE = 4; readonly attribute unsigned short readyState; void load(); readonly attribute TimeRanges timeBuffered; readonly attribute Stream audioBuffer; // playback controls void play(); void pause(); void cancel(); attribute double rate; attribute unsigned short volume; // playback state const unsigned short PLAYBACK_EMPTY = 0; const unsigned short PLAYBACK_PAUSED = 1; const unsigned short PLAYBACK_PLAYING = 2; const unsigned short PLAYBACK_COMPLETE = 3; const unsigned short PLAYBACK_STALLED = 4; readonly attribute unsigned short playbackState; // progress attribute double currentTime; readonly attribute DOMString lastMark; attribute Function onmark(in DOMString mark, in unsigned long time); attribute Function onplay(); attribute Function onpause(); attribute Function oncomplete(); attribute Function oncancel(); };Instantiation
A SpeechSynthesizer
can use
a variety of different underlying services. The specific service depends on how
the object is constructed:
SpeechSynthesizer()
constructor uses the default synthesis
service provided by the user agent. In some cases this may be an implementation
that is local to the device, and limited by the capabilities of the device. In other
cases it may be a remote service that is accessed over the network using mechanisms
that are hidden to the application. The responsiveness, expressiveness, suitability
for specific domains, and other characteristics of the default synthesis service
will vary greatly between user agents and between devices.SpeechSynthesizer(DOMString selectionParams)
constructor causes
the user agent to use one of its built-in synthesizers, selecting the particular
synthesizer that matches the parameters listed in selectionParams. This is a URL-encoded
string of label-value pairs. It is up to the user agent to determine how closely
it chooses to honor the requested parameters. Suggested labels and their corresponding
values include:
SpeechSynthesizer(XMLHttpRequest xhr)
uses an application provided
instance of XMLHttpRequest
and the
recommended HTTP conventions to access the speech
synthesis service. When this constructor is used, the user agent should consider
displaying a notification in it's surrounding chrome to notify the user of the particular
speech service that is being used.In the event that additional service access mechanisms are designed and standardized, such as new objects for accessing new real-time communication protocols, an additional constructor could be added, without changing the API.
Error HandlingThe speecherror
event must
be dispatched when an error occurs. See 5.4
SpeechErrorInterface.
Either raw text (DOMString uses UTF-16) or SSML can be synthesized. Applications provide content to be
synthesized by setting the src
attribute to a URL
from which the synthesizer service can fetch the content, or by assigning the content directly to
the content
attribute. Setting one of these
attributes resets the other to ECMA undefined.
Synthesized audio is buffered, along with timing information for mark events. In typical use,
an application will specify the content to be rendered, then call play()
, at which
point the synthesizer service will fetch the content and synthesize it to audio, which is
buffered by the user agent (along with timing info for mark events as they occur), and played to
the audio output device.
If the preload
attribute is set, synthesizer
service will be invoked to begin synthesising content into audio as soon as content is specified.
Otherwise, if preload
isn't set, the service won't be invoked until
play()
is called.
Applications can query the readyState
attribute to determine status of the buffer. For example, when loading a lengthy sequence of
audio, an application may display some sort of feedback to the user, or disable some of the
interaction controls, until the buffer is ready. There are four readyState
values:
BUFFER_EMPTY
(numeric value 0)BUFFER_WAITING
(numeric value 1)BUFFER_LOADING
(numeric value 2)timeBuffered
attribute to track
progress.BUFFER_COMPLETE
(numeric value 3)Some applications may also want to use the raw synthesized audio for other purposes. Provided
the readyState
is BUFFER_COMPLETE
, applications can fetch the raw audio
data from the audioBuffer
attribute.
Playback is initiated or resumed by calling play()
, and paused by calling
pause()
. The playbackState
attribute is useful
for applications to coordinate playback state with the rest of the application.
playbackState
has five values:
PLAYBACK_EMPTY
(numeric value 0)PLAYBACK_PAUSED
(numeric value 1)PLAYBACK_PLAYING
(numeric value 2)PLAYBACK_COMPLETE
(numeric value
3)PLAYBACK_STALLED
(numeric value 4)The current position of playback can be determined either by checking the
currentTime
attribute, which returns the time since
the beginning of the buffer, in milliseconds; or by checking the
lastMark
attribute, which contains the label of the last mark
reached in SSML prior to the current audio position. The same attributes can be used to seek to
specific positions. For example, setting the mark
attribute to the label value of a
mark in the SSML content should move the playback to the corresponding
positoin in the audio buffer.
Playback speed is controlled by setting rate
,
which has a default value of 1.0. Similarly, volume
controls the audio amplitude, and can be set to any
value between 0 and 100, with 50 as the default.
To cease all playback, clear the buffer, and cancel synthesis, call the
cancel()
function.
An application can respond to other key events. These are targeted at the SpeechSynthesizer object (or any object that inherits from it), do not bubble, and are not cancelable.
mark
: raised when the audio position corresponds to
an SSML mark. The particular mark that was reached can be determined by the argument to mark or
by checkign the mark
attribute.play
: raised whenever the object begins or resumes
sending audio to the speaker.pause
: raised whenever audio output is paused
(typically in response to pause()
being called).complete
: raised when the end of the audio
buffer has been output to the speaker, and no more audio is left to be synthesized.cancel
: raised when the synthesis session has been
cancelled, which means any outstanding transaction with the underlying service is discarded,
and the buffer is emptied without being played.[NoInterfaceObject] interface SpeechError { const unsigned short ABORTED = 1; const unsigned short AUDIO = 2; const unsigned short NETWORK = 3; const unsigned short NOT_AUTHORIZED = 4; const unsigned short REJECTED_SPEECHSERVICE = 5; const unsigned short BAD_GRAMMAR = 6; readonly attribute unsigned short code; readonly attribute DOMString message; };
The code
attribute must
return the appropriate code from the following list:
ABORTED
(numeric value 1)AUDIO
(numeric value 2)NETWORK
(numeric value 3)NOT_AUTHORIZED
(numeric value 4)REJECTED_SPEECHSERVICE
(numeric
value 5)BAD_GRAMMAR
(numeric value 6)BAD_SSML
(numeric value 7)BAD_STATE
(numeric value 8)The message
attribute must return an error message describing the details of the error encountered. The
message content is implementation specific. This attribute is primarily intended for debugging
and developers should not use it directly in their application user
interface.
Multimodal Video Game Example
This example illustrates how an application like the multimodal video game example could be built.
<body onload="init()">
<!-- Assume lots of markup to layout the screen.
Somewhere in this markup is an image the user touches to cast
whatever spell they've selected with speech.
-->
<img src="castspell.jpg" onclick="onSpellcastTouched()" />
<script type="text/javascript">
// Assume lots of script for the game logic,
// which we'll abstract as an object called "gameController"
var recognizer;
var synthesizer;
var currentSpell; //remembers the name of the currently selected spell
function init() {
recognizer = new SpeechRecognizer("lang=en-au"); // English, preferably Australian
recognizer.speechtype = 2; //continuous
recognizer.defaultUI = false; //using custom in-game GUI
recognizer.grammars.add("spell-list.grxml");
recognizer.onspeechmatch = onReco;
recognizer.startSpeechInput();
synthesizer = new SpeechSynthesizer("lang=en-au&gender=female");
// assume lots of other script to initiate the game logic
currentSpell = "invisibility";
}
function onReco(results) {
if (results(0).confidence > 0.8) {
currentSpell = results(1).interpretation;
synthesizer.cancel(); //barge-in any existing output
synthesizer.content = results(1).interpretation + " spell armed";
synthesizer.play();
}
}
}
function onSpellcastTouched() {
gameController.castSpell(currentSpell);
}
</script>
</body>
HMIHY Flight Booking Example
This example illustrates how an application like the HMIHY flight booking example might be implemented, using a HTTP-based speech recognition service.
<script type="text/javascript">
var recognizer;
var xhr;
function onMicrophoneClicked() {
navigator.device.opencapture(onOpenCapture);
}
function onOpenCapture(in Capture captureDevice) {
xhr = new XMLHttpRequest();
xhr.open("POST","https://webrecognizer.contoso.com/reco",true);
reco = new SpeechRecognizer(xhr);
recognizer.maxnbest = 4;
recognizer.confidence = 0.2;
recognizer.grammars.add("http://www.contosoair.com/grammarlib/HMIHY.srgs");
var audioOptions = { duration: 10, // max duration in seconds
limit: 1, // only need one recording
mode: { type: "audio/amr-wb"} // use wideband amr encoding
};
var endpointParams = { sensitivity: 0.5, initialTimeout: 3000, endTimeout:500};
recognizer.setInputDevice(captureDevice, audioOptions, endpointParams);
recognizer.defaultUI = true; // use the built-in speech recognition UI
recognizer.onspeechmatch = onspeechmatch;
recognizer.startSpeechInput();
}
function onspeechmatch(results) {
var emma = results.responseEMMAXML;
switch (emma.getElementsByTagName("task")[0].textContent) {
case "flight-booking":
// assume code to display flight booking form fields
// ...
// If a field is present in the reco results, fill it in the form:
assignField(emma,"from");
assignField(emma,"to");
assignField(emma,"departdate");
assignField(emma,"departampm");
assignField(emma,"returndate");
assignField(emma,"returnampm");
assignField(emma,"numadults");
assignField(emma,"numkids");
break;
case "frequent-flyer-program":
// etc
}
}
function assignField(emma, fieldname) {
if (0 != emma.getElementsByTagName(fieldname).count) {
document.getElementById("id-" + fieldname).value = emma.getElementsByTagName(fieldname).textContent;
}
}
</script>
This section describes an approach that could be used to integrate speech as first-class functionality in HTML markup, making a variety of speech scenarios achievable by the novice developer, while still enabling experienced developers to create sophisticated speech applications. Where possible the user agent will provide some lowest-common-denominator speech support (either locally or by selecting a default remote speech service). For example, sometimes the speech recognition is tied to a specific HTMLElement, often an HTMLInputElement. In these cases it may be possible to create default grammars and default speech behaviors specific to that input using the other input elements (type, pattern, etc.). At other times the web developer may need to specify an application specific grammar. This proposal supports both.
The basic idea behind the proposal is to create two new markup elements <reco>
and
<tts>
. The <reco>
element ties itself to its parent containing
HTML element. So <input type="string" pattern="[A-z]{3}"><reco .../></input>
would define a recognition element that is tied to the enclosing input element. This structure would mean
that the reco element can inspect and use the basic information from the input element (like type and pattern)
to automatically provide some speech grammars and behaviors and default user interaction idioms.
One approach to the <reco>
element is to make HTMLInputElement and others allow
<reco>
as a child. Alternatively, if it is deemed as to radical to allow HTMLInputElement
or other elements to take non-empty content, the <reco>
could be associated with the
HTMLInputElement in question using the for
attribute similar to the label
element
today in HTML 5. The rest of this section will assume the former approach, but the latter approach could work
as well if desired.
The API for the HTMLRecoElement.
interface HTMLRecoElement : HTMLElement { // type of speech const unsigned short SIMPLERECO = 0; const unsigned short COMPLEXRECO = 1; const unsigned short CONTINUOUSRECO = 2; attribute unsigned short speechtype; // type of autofill const unsigned short NOFILL = 0; const unsigned short FILLUTTERANCE = 1; const unsigned short FILLSEMANTIC = 2; attribute unsigned short autofill; // speech parameter attributes attribute boolean speechonfocus; attribute DOMString grammar; attribute short maxnbest; attribute long speechtimeout; attribute long completetimeout; attribute long incompletetimeout; attribute float confidence; attribute float sensitivity; attribute float speedvsaccuracy; // speech input event handler IDL attributes attribute Function onspeechmatch(in SpeechRecognitionResultCollection results); attribute Function onspeecherror(in SpeechError error); attribute Function onspeechnomatch(); attribute Function onspeechnoinput(); attribute Function onspeechstart(); attribute Function onspeechend(); // speech input methods void startSpeechInput(); void stopSpeechInput(); void cancelSpeechInput(); void emulateSpeechInput(DOMString input); // service configuration void SetSpeechService(DOMString url, DOMString? lang, DOMString? parameters); void SetSpeechService(DOMString url, DOMString user, DOMString password, DOMString? lang, DOMString? params); void SetSpeechService(DOMString url, DOMString authHeader, Function onCustomAuth, DOMString? lang, DOMString? params); void SetCustomAuth(DOMString authValue); attribute DOMString speechservice; attribute DOMString speechparams; attribute DOMString authHeader; // speech response variables readonly attribute Stream capture; // states const unsigned short READY = 0; const unsigned short LISTENING = 1; const unsigned short WAITING = 2; readonly attribute unsigned short speechrecostate; };
User agents must support <reco>
as a child of HTMLTextAreaElement and
HTMLInputElement. User agents may support <reco>
as a child of
other elements such as HTMLAnchorElement, HTMLImageElement, HTMLAreaElement, HTMLFormElement, HTMLFieldSetElement,
HTMLLableElement, HTMLButtonElement, HTMLSelectElement, HTMLDataListElement, HTMLOptionElement.
The speechtype
attribute indicates the type of speech recognition that should
occur. The values must be one of the following values:
SIMPLERECO
(numeric value 0)SIMPLERECO means that the request for recognition must raise just one final speechmatch, speechnomatch, speechnoinput, or specherror event and must not produce any interim events or results. If the speechservice is a remote service accessible over HTTP/HTTPS this means the recognition may be a simple request-response. SIMPLERECO must be the default value.
COMPLEXRECO
(numeric value 1)COMPLEXRECO means that the speech recognition request must produce all of the interim events in addition to speechstart, speechend, as well as the final recognition result. COMPLEXRECO must only produce one final recognition result, nomatch, noinput, or error as a COMPLEXRECO is still one utterance from the end user resulting in one result. Due to the nature of needing to stream audio while getting results back a remote service doing COMPLEXRECO should use a more sophisticated paradigm than simple request-response, such as WEBSOCKETS.
CONTINUOUSRECO
(numeric value 2)CONTINUOUSRECO represents a conversation or dialogue or dictation scenario where in addition to interim events
numerous final recognition results must be able to be produced. A CONTINUOURECO speech
interaction once started must not stop until the stopSpeechInput
or
cancelSpeechInput
method is invoked.
The autofill
attribute indicates if action should be taken on the
recognition automatically, and if so upon what should it be based. Usually the default action will be to put the
recognition result alue into the parent element, although the exact details are specific to the details
of the parent element. The values must be one of the following values:
NOFILL
(numeric value 0)NOFILL means that upon a recognition match automatic behavior must not occur. NOFILL must be the default value.
FILLUTTERANCE
(numeric value 1)FILLUTTERANCE means that upon a recognition match the automatic behavior, if any, must prefer the use of the utterance of the recognition. For exmple, if the user says "San Francisco International Airport" and the semantic interpretation of the match is "SFO" the user agent uses "San Francisco International Airport" as its result in the default action.
FILLSEMANTIC
(numeric value 2)FILLSEMANTIC means that upon a recognition match the automatic behavior, if any, must prefer the use of the semantic interpretation of the recognition. This means if the user says "Er, um, I'd like 4 of those please" and the semantic interpretation is 4 that the user agent uses 4 as its result in the default action.
The boolean speechonfocus
attribute, if true, specifies that
speech must automatically start upon the containing element receiving focus. If the attribute
is false then the speech must not start upon focus but must instead wait for an explicit
startSpeechInput
method call. The default value must be false.
The optional grammar
attribute is a whitespace separated list of
URLs that give the address of one or more external application-specific grammars, e.g.
"http://example.com/grammars/pizza-order.grxml". The attribute, if present, must contain a
whitespace separated list of valid non-empty URL. The implementation of recognition systems should use the list of grammar to guide the speech recognizer. Implementations must
support SRGS grammars and SISR annotations. Note that the order of
the grammars must define a priority order used to resolve ties where an earlier listed grammar is
considered higher priority.
If the grammar attribute is absent the user agent should provide a reasonable context aware default grammar. For instance, if the speech attribute is on an input element with a type and pattern attribute defined then a user agent may use the same pattern to define a default grammar in the absence of an application-specific grammar attribute. Likewise, if the speech attribute is on a form element then a default grammar may be constructed by taking the grammars (default or application specified) of all the child or descendent elements inside the form.
The optional maxnbest
attribute specifies that the implementation must not return a number of items greater than the maxnbest value. If the maxnbest is not set it must default to 1.
The optional speechtimeout
attribute specifies the time in milliseconds to wait
for start of speech, after which the audio capture must stop and a speechnoinput event must be
returned. If not set, the timeout is speech service dependent.
The optional completetimeout
attribute specifies the time in milliseconds the
recognizer must wait to finalize a result (either accepting it or throwing a nomatch event for
too low confidence results), when the speech is a complete match of all active grammars. If not set, the timeout is
speech service dependent.
The optional incompletetimeout
attribute specifies the time in milliseconds the
recognizer must wait to finalize a result (either accepting it or throwing a nomatch event for
too low confidence results), when the speech is an incomplete match (I.e., anything that is not a complete match) of
all active grammars. If not set, the timeout is implementation dependent.
The optional confidence
attribute specifies a confidence level. The recognition
service must reject any recognition result with a confidence less than the confidence level. The
confidence level must be a float between 0.0 and 1.0 inclusive and must have
a default value of 0.5.
The optional sensitivity
attribute specifies how sensitive the recognition
system should be to noise. The recognition service must treat a higher value as a request to be
more sensitive to noise. The sensitivity must be a float between 0.0 and 1.0 inclusive and
must have a default value of 0.5.
The optional speedvsaccuracy
attribute specifies how must the recognition
system should prioritize a speedy low-latency result and how much it should prioritize getting the most accurate
recognition. The recognition service must treat a higher value as a request to be have a more
accurate result and must treat a lower value as a request to have a faster response. The
speedvsaccuracy must be a float between 0.0 and 1.0 inclusive and must have a
default value of 0.5.
The speechmatch
event must be dispatched when a set of
complete and valid utterances have been matched. This event must bubble and be cancelable. A
complete utterance ends when the implementation detects end-of-speech or if the stopSpeechInput()
method
was invoked. Note that if the recognition was of type SIMPLERECO
or COMPLEXRECO
there
must be only one speechmatch result returned and this must end the speech
recognition.
The default action associated with the speechmatch
may differ depending on the
element upon which the speech is associated. Often this result in a value being set or a selection being made based
on the value of the autofill
attribute and the corresponding most likely interpretation or
utterance.
Some implementations may dispatch the change
event for elements when their value
changes. When the new value was obtained as the result of a speech input session, such implementations must dispatch the speechmatch
event prior to the change
The speecherror
event must be dispatched when the active
speech input session resulted in an error. This error may be a result of a user agent denying the
speech session, or parts of it, due to security or privacy issues or may be the result of an
error in the web authors specification of the speech request or could be the result of an error in the recognition
system. This event must bubble and be cancelable.
The speechnomatch
event must be raised when a complete
utterence has failed to match the active grammars, or has only matched with a confidence less than the specified
confidence value. This event must bubble and be cancelable. A complete utterance ends when the
implementation detects end-of-speech or if the stopSpeechInput()
method was invoked. Note that if the
recognition was of type SIMPLERECO
or COMPLEXRECO
there must be only
one speechnomatch returned and this must end the speech recognition.
The speechnoinput
event must be raised when the recognizer
has detected no speech and the speechtimeout has expired. This event must bubble and be
cancelable. Note that if the recognition was of type SIMPLERECO
or COMPLEXRECO
there
must be only one speechnoinput returned and this must end the speech
recognition.
The speechstart
event must be raised when the recognition
service detects that a user has started speaking. This event must bubble and be cancelable. This
event must not be raised if the speechtype was SIMPLERECO
but must be generated if the speechtype is either COMPLEXRECO
or CONTINUOUSRECO
.
The speechend
event must be raised when the recognition
service detects that a user has stopped speaking. This event must bubble and be cancelable. This
event must not be raised if the speechtype was SIMPLERECO
but must be generated if the speechtype is either COMPLEXRECO
or CONTINUOUSRECO
.
When the startSpeechInput()
method is invoked then a speech recognition turn
must be started. It is an error to call startSpeechInput()
when the speechrecostate
is anything but READY
and a user agent must raise a speecherror
should
this occur. Note that user agents should have privacy and security that specify if scripted
speech should be allowed and user agents may prevent the startSpeechInput() from succeeding
called on certain elements, applications, or sessions. If a recognition turn is not begun with the recognition
service then a speecherror
event must be raised.
When the stopSpeechInput()
method is invoked, if there was an active speech
input session and this element had initiated it, the user agent must gracefully stop the session
as if end-of-speech was detected. The user agent must perform speech recognition on audio that
has already been recorded and the relevant events must be fired if necessary. If there was no active speech input
session, if this element did not initiate the active speech input session or if end-of-speech was already detected,
this method must return without doing anything.
When the cancelSpeechInput()
method is invoked, if there was an active speech
input session and this element had initiated it, the user agent must abort the session, discard any pending/buffered
audio data and fire no events for the pending data. If there was no active speech input session or if this element
did not initiate it, this method must return without doing anything.
When the emulateSpeechInput(DOMString input)
method is invoked then the speech
service treats the input string as the text utterance that was spoken and returns the resulting recognition.
The capture
defines a readonly variable that is a stream
that accumulates the user speech as it occurs. If this stream element is upload to a recognition service in either an
XMLHTTPRequest or WebSocket request the contents should be updated as more speech becomes available.
The speechrecostate
readonly variable tracks the state of the recognition. At
the beginning the state is READY
meaning that the element is ready to start a
speech interaction. Upon starting the recognition turn the state changes to LISTENING
. Once the system has stopped capturing from the user, either due to the
speech system detecting the end of speech or the web application calling stopSpeechInput()
, but before the results have been returned the system
is in WAITING
. Once the system has received the results the state returns to
READY
.
The result object & collection are the same as the API in section 7.
The concept of having a media-player style of experience like those of <audio>
or
<video>
doesn't map well to speech synthesis scenarios.
Although speech synthesis does produce audio output, its use cases are rather different from those of
HTMLMediaElement subclasses <audio>
and <video>
. Listening to music or watching
a movie tends to be a relatively passive experience, where the media is a central purpose the app: the media is the
content. Contrast this with speech synthesis, which tends to be used as a UI component that works with other UI
components to help a user interact with an app - the media isn't the content, it's just part of the UI used to access
the content. So while there are some common semantics between speech synthesis and media playback APIs (such as
volume, rate, play and pause controls), there are many differences. Synthesis apps tend to be very reactive, with the
media generated in response to user action (it's synthesis, not recording). And concepts like
<source>
and <track>
have no natural analogy in synthesis applications.
The design presented here borrows from the HTMLMediaElement for consistency where it makes sense to do so. But is not a subclass of that interface, since many of the inherited semantics and usage patterns would be of peripheral value, or outright confusing with TTS.
interface HTMLTTSElement : HTMLElement { // error handling attribute Function onspeecherror(in SpeechError error); // service configuration void SetSpeechService(DOMString url, DOMString? lang, DOMString? parameters); void SetSpeechService(DOMString url, DOMString user, DOMString password, DOMString? lang, DOMString? params); void SetSpeechService(DOMString url, DOMString authHeader, Function onCustomAuth, DOMString? lang, DOMString? params); void SetCustomAuth(DOMString authValue); attribute DOMString speechservice; attribute DOMString speechparams; attribute DOMString authHeader; // content specification attribute DOMString src; attribute DOMString content; // audio buffering state const unsigned short BUFFER_EMPTY = 0; const unsigned short BUFFER_WAITING = 1; const unsigned short BUFFER_LOADING = 2; const unsigned short BUFFER_COMPLETE = 4; readonly attribute unsigned short readyState; attribute boolean preload; readonly attribute TimeRanges timeBuffered; readonly attribute Stream audioBuffer; // playback controls void play(); void pause(); void cancel(); // playback state const unsigned short PLAYBACK_EMPTY = 0; const unsigned short PLAYBACK_PAUSED = 1; const unsigned short PLAYBACK_PLAYING = 2; const unsigned short PLAYBACK_COMPLETE = 3; const unsigned short PLAYBACK_STALLED = 4; readonly attribute unsigned short playbackState; attribute double rate; attribute unsigned short volume; attribute double currentTime; attribute DOMString mark; attribute Function onmark(in DOMString mark); attribute Function onplay(); attribute Function onpause(); attribute Function oncomplete(); attribute Function oncancel(); };
The interface is called HTMLTTSElement and is represented in the markup as the <tts>
element.
The speecherror
event must be dispatched when
an error occurs. See 7.4 SpeechErrorInterface.
Either raw text (DOMString uses UTF-16) or SSML can be synthesized. Applications provide content to be synthesized by setting the
src
attribute to a URL from which the synthesizer service can fetch the content,
or by assigning the content directly to the content
attribute. Setting one of
these attributes resets the other to an empty string.
Synthesized audio is buffered, along with timing information for mark events. In typical use, an application will
specify the content to be rendered, then call play()
, at which point the synthesizer service will fetch
the content and synthesize it to audio, which is buffered by the user agent (along with timing info for mark events
as they occur), and played to the audio output device.
If the preload
attribute is set, synthesizer service will be invoked to begin
synthesising content into audio as soon as content is specified. Otherwise, if preload
isn't set, the
service won't be invoked until play()
is called.
Applications can query the readyState
attribute to determine status of the
buffer. For example, when loading a lengthy sequence of audio, an application may display some sort of feedback to
the user, or disable some of the interaction controls, until the buffer is ready. There are four
readyState
values:
BUFFER_EMPTY
(numeric value 0)BUFFER_WAITING
(numeric value 1)BUFFER_LOADING
(numeric value 2)timeBuffered
attribute to track progress.BUFFER_COMPLETE
(numeric value 3)Some applications may also want to use the raw synthesized audio for other purposes. Provided the
readyState
is BUFFER_COMPLETE
, applications can fetch the raw audio data from the audioBuffer
attribute.
Playback is initiated or resumed by calling play()
, and paused by calling
pause()
. The playbackState
attribute is useful for applications to
coordinate playback state with the rest of the application. playbackState
has five
values:
PLAYBACK_EMPTY
(numeric value 0)PLAYBACK_PAUSED
(numeric value 1)PLAYBACK_PLAYING
(numeric value 2)PLAYBACK_COMPLETE
(numeric value 3)PLAYBACK_STALLED
(numeric value 4)The current position of playback can be determined either by checking the currentTime
attribute, which returns the time since the beginning of the buffer, in
milliseconds; or by checking the mark
attribute, which contains the label of the
last mark reached in SSML prior to the current audio position. The same attributes can be used to seek to specific
positions. For example, setting the mark
attribute to the label value of a mark in the SSML content
should move the playback to the corresponding positoin in the audio buffer.
Playback speed is controlled by setting rate
, which has a default value of 1.0.
Similarly, volume
controls the audio amplitude, and can be set to any value
between 0 and 100, with 50 as the default.
To cease all playback, clear the buffer, and cancel synthesis, call the cancel()
function.
An application can respond to other key events:
mark
: raised when the audio position corresponds to an SSML mark. The
particular mark that was reached can be determined by the argument to mark or by checkign the mark
attribute.play
: raised whenever the object begins or resumes sending audio to the
speaker.pause
: raised whenever audio output is paused (typically in response to
pause()
being called).complete
: raised when the end of the audio buffer has been output to the
speaker, and no more audio is left to be synthesized.cancel
: raised when the synthesis session has been cancelled, which means
any outstanding transaction with the underlying service is discarded, and the buffer is emptied without being
played.The user agent must provides a default speech service for both recognition and speech synthesis. However, due to the varying needs of the application, the varying capabilities of recognizers and synthesisers, the desire to keep application specific intellectual property relating to voices or grammars or other technology, or the desire to provide a consistent user experience across different user agents it must be possible to use speech services other than the default.
To use a specific service, the application calls SetSpeechService(url,
lang, params)
. The url
parameter specifies the service to be invoked for either recognition
or synthesis (see also 6.4 HTTP Conventions). The optional lang
parameter can be used
to specify the language the application desires. The service must use the language specified in
the standard content (such as specified in SRGS or SSML) if one is
specified. If the content doesn't specify a language and the language parameter is omitted or blank then the service
uses its own proprietary logic (such as a default setting, part of the service URL, examining the content, etc.). If
the lang
parameter is specified, it must be in standard language code format. If the service is unable
to use the language specified an error must be raised. Many services also accept additional
proprietary parameters that govern their operation. These can be supplied with the optional params
parameter, in the form of a string of URL-encoded name-value pairs.
Some services will require authentication using standard HTTP challenge/response. When this happens, the user
agent should invoke its regular authentication logic. Often this will result in a dialog box requesting the user for
a user name and password. In many cases, this is not a desirable user experience for the application, and can be
circumvented by providing the username
and password
beforehand as parameters to
SetSpeechService()
by calling SetSpeechService(url, user, password, lang, params)
.
Some services will use proprietary authentication schemes that require placing a special value in a proprietary
header. To specify such a service, applications should call SetSpeechService(url, authHeader, onCustomAuth,
lang, parameters)
, where authHeader
is the name of the proprietary header, and
onCustomAuth
is a function provided by the application. In this case, when the user agent invokes the
service, it should call the application function provided in onCustomAuth
and then wait until the
application calls SetCustomAuth(authValue)
, where
authValue
is the value to be assigned to the custom auth header, at which point the user agent should go
ahead and invoke the service. The reason for this call-back and wait logic is because a common web service
authentication pattern involves using the current clock time as an input to the authentication token.
The optional speechservice
attribute is a URL to the recognition service to be used. This attribute can either be set directly or through the
setSpeechService()
method calls. This recognition service could be local to the user agent or a remote
service separate from the user agent. The user agent must attempt to use the specified
recognition service. If the user agent can not use the specified recognition service it must
notify the web application by raising a speecherror event and may attempt to use a different
recognition service. Note that user agents should have privacy and security settings that specify
if the speech should be allowed to be delivered to a specific recognition service and may prevent
certain speech services from being used. Note also that because running a high quality speech service is difficult it
is expected that many applications may want to use networked recognition services that are at a different domain than
the web application and user agents should enable this with the trusted recognition services. If
not present then the user agent must use its default recognition service.
The optional speechparams
attribute is designed to take
extensible and custom parameters particular to a given speech service. This attribute can either be set directly or
as a result of the setSpeechService()
method. For instance, one recognition service may require account
information for authorization while a different recognition service may allow recognizer specific tuning parameters
or recognizer specific acoustic models or recognizer context blocks to be specified. Because there may be many such
parameters the speechparams must be specified as a set of name=value pairs in a URI encoded query
parameters string which may use either ampersands or semicolons as pair separators.
The optional authHeader
attribute is used to set the authorization
header to be used with the speech service. This can either be set directly or through the
setSpeechService()
attribute.
A DOM application can use the hasFeature(feature, version)
method of the
DOMImplementation
interface with parameter values "SpeechInput" and "1.0" (respectively) to determine
whether or not this module is supported by the implementation.
Implementations that don't support speech input will ignore the additional attributes and events defined in this module and the HTML elements with these attributes will continue to work with other forms of input.
This example illustrates how speech markup could be used to implement a web search page that uses speech.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="text/html; charset=us-ascii" http-equiv="content-type" />
<title>Bing</title>
</head>
<body>
<form action="/search" id="sb_form" name="sb_form">
<input class="sw_qbox" id="sb_form_q" name="q" title="Enter your search term" type="text" value="" >
<reco speechtype="0" autofill="2" speechonfocus="true" speechservice="http://www.bing.com/speech" onspeechmatch="document.sb_form.submit()" />
</input>
<input class="sw_qbtn" id="sb_form_go" name="go" tabindex="0" title="Search" type="submit" value="" />
<input name="form" type="hidden" value="QBLH" />
</form>
</body>
</html>
This example illustrates how speech markup could be used to implement a simple flight booking form.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="text/html; charset=us-ascii" http-equiv="content-type" />
<title>Contso Flight Booking</title>
</head>
<body>
<script>
function submit_page_if_full () {
if (
document.getElementById("where_from").value != "" &&
document.getElementById("where_to").value != "" &&
document.getElementById("departing").value != "" &&
document.getElementById("returning").value != ""
) {
document.flight_form.submit();
}
}
document.getElementById("flight_form").addEventListener("speechmatch", submit_page_if_full, false);
</script>
<form action="/search" id="flight_form" name="flight_form">
<input id="where_from" name="where_from" title="Where from?" type="text" value="">
<reco speechtype="0" autofill="2" speechonfocus="true" speechservice="http://www.contso.com/speech" />
</input>
<input id="where_to" name="where_to" title="Where to?" type="text" value="">
<reco speechtype="0" autofill="2" speechonfocus="true" speechservice="http://www.contso.com/speech" />
</input>
<input id="departing" name="departing" title="Departing" type="datetime-local" value="">
<reco speechtype="0" autofill="2" speechonfocus="true" speechservice="http://www.contso.com/speech" />
</input>
<input id="returning" name="returning" title="Returning" type="datetime-local" value="">
<reco speechtype="0" autofill="2" speechonfocus="true" speechservice="http://www.contso.com/speech" />
</input>
<input id="flight_form_go" name="go" title="Search" type="submit" value="" />
</form>
</body>
</html>
There are three sub-proposals in this specification, each of which is cumulative on the previous, and each of which is scored against the Scenario Examples, and against the HTML Speech XG Use Cases and Requirements:
A "four star" score is used to evaluate each sub-proposal against the requirements or scenarios:
This section outlines the use cases and requirements that are covered by this specification.
This proposal built on some earlier proposals to the incubator group, including especially: the Speech Input API Specification from Satish Sampath and Bjorn Bringert; and the HTML TTS API Specification from Bjorn Bringert.