W3C

HTML Speech XG Speech API Proposal

Editor's Draft 28 February 2011

Latest Version:
Posted to the www-archive public mailing list on 1 March 2011: http://lists.w3.org/Archives/Public/www-archive/2011Mar/
Editors:
Michael Bodell, Microsoft
Robert Brown, Microsoft
Shane Landry, Microsoft

Abstract

This specification extends HTML to enable pages to incorporate speech recognition and synthesis. Firstly, it defines extensions to existing open web platform interfaces and objects that have general benefit across a range of scenarios, including speech recognition and synthesis. Secondly, it defines a set of speech-specific APIs that enable a richer set of speech semantics and user-agent-provided capabilities. Thirdly, it suggests a design approach for incorporating speech semantics into HTML markup. These three design stages do not all need to be stabilized and implemented simultaneously. For example, the first and second design stages could be stabilized and implemented well before the third.

This specification does not define any new protocols nor require any new protocols to be defined by the IETF or W3C. However, we are aware that more sophisticated protocols will enable a richer set of speech functionality. Hence the speech-specific portions of the API design proposal are designed to be as loosely-coupled to the underlying speech service delivery mechanism as possible (i.e. whether it is proprietary to the browser, delivered by an HTTP/REST service, or delivered by a future to-be-defined protocol).

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document is an API proposal from Microsoft to the HTML Speech Incubator Group. If you wish to make comments regarding this document, please send them to public-xg-htmlspeech@w3.org (subscribe, archives). All feedback is encouraged.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Table of Contents

1 Conformance requirements

All diagrams, examples, and notes in this specification are non-normative, as are all sections explicitly marked non-normative. Everything else in this specification is normative.

The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in the normative parts of this document are to be interpreted as described in RFC2119. [RFC2119]

Requirements phrased in the imperative as part of algorithms (such as "strip any leading space characters" or "return false and abort these steps") are to be interpreted with the meaning of the key word ("must", "should", "may", etc) used in introducing the algorithm.

Conformance requirements phrased as algorithms or specific steps may be implemented in any manner, so long as the end result is equivalent. (In particular, the algorithms defined in this specification are intended to be easy to follow, and not intended to be performant.)

User agents may impose implementation-specific limits on otherwise unconstrained inputs, e.g. to prevent denial of service attacks, to guard against running out of memory, or to work around platform-specific limitations.

Implementations that use ECMAScript to implement the APIs defined in this specification must implement them in a manner consistent with the ECMAScript Bindings defined in the Web IDL specification, as this specification uses that specification's terminology. [WEBIDL]

2 Introduction

This section is non-normative.

The API design is presented in three conceptual layers, each of which builds upon the previous. The design approach is to:

  1. First, make a small set of enhancements to implemented or proposed mechanisms in the existing open web platform, that will enable many of the HTML Speech XG's requirements to be satisfied without having to inject special speech semantics into HTML. These enhancements are not specific to speech, but enable speech semantics to be implemented in web applications using familiar programming practices and objects. Furthermore, the design changes are broadly applicable to multimodal scenarios in general, not just speech. We consider this a necessary foundation, as it satisfies many of the central use cases, and enables the HTML equivalent of many current widely used speech apps, as well as solving many shared problems across a number of scenarios, and does so without introducing parallel divergent designs to what is already on the web.
  2. Second, introduce scriptable speech objects that implement a rich set of speech semantics. These objects are designed in such a way that they are resilient to changes in the underlying implementation. In this proposal, the objects can utilize the enhancements specified in #1, as well as built-in speech services provided by the user-agent (which may be local to the device, or accessed over the web via proprietary mechanisms, at the user agent's discretion). When future work between the W3C and IETF results in other mechanisms for accessing speech services, these too could be implemented without changing the surface interfaces of the speech objects. We consider these APIs an enabling convenience for HTML speech developers since they enable a powerful set of options for application developers and user agents to deliver on a broad set of speech scenarios.
  3. Third, introduce speech-specific markup tags and attributes to expose speech features as first class citizens in an HTML application. The chief benefit is to make basic speech scenarios very achievable for less experienced developers, while still allowing a more experienced speech developer the ability to customize the speech experience and to enable truly compelling speech-enabled HTML applications. The design principle is to make the easy, easy and the hard, possible while minimizing any discontinuity in moving from the basic to the advanced. This enables a developer to start off with a basic application and evolve it into having more complex interactions without needing to rewrite the application. We consider the markup approach to have considerable value, but it is also the most intrusive to current HTML designs, and if not pursued immediately, should be strongly investigated in a longer term project.

The Speech APIs presented attempt to enable both basic and advanced speech applications. This means that there needs to be an acceptable default speech experience provided by the user agent for basic applications. However, due to the nature of speech technologies, the large investment that goes into the construction and tuning of grammars and acoustic models and the proprietary nature of statistical language model grammars, it is also necessary to allow the web developer to choose a speech recognition service of their choosing. This proposal enables both of these scenarios.

While the interaction pattern of many applications will enable speech services in response to the end user clicking a button to start speech, or changing the focus to an input field and starting speech, there are other applications that will be doing speech recognition more globally without that pattern (for example, a page that visually displays a map but still allows the user to say "zoom in" or "zoom out"). This proposal supports both these interaction patterns.

In addition to speech input, audio output including text-to-speech synthesis is often necessary as part of a compeling speech experience. The processing of the audio may need to be coordinated with other output (such as changing the visual user experience when a certain word is synthesized) or with the speech input (such as stopping the output in response to the start of speech - I.e., bargein). This proposal supports both of these use cases as well.

This proposal reuses a number of W3C speech standards including [SSML], [SRGS], [SISR], and [EMMA].

3 Scope

This section is non-normative.

This specification is mainly limited to extending existing HTML interfaces, objects, and elements. The proposal also provides new markup elements with attributes and the corresponding object model. Where possible existing standards are used such as speech recognition grammars are specified using SRGS, semantic interpretation of hypotheses are specified using SISR, recognition results are represented with EMMA, and TTS synthesis is specified using SSML.

The scope of this specification does not include providing any new markup languages or new low level protocols.

4 Security and privacy considerations

User agents must thoughtfully balance the needs of the web application to have access to the user's voice with the user's expectation of privacy. User agents should provide a mechanism for the user to allow a web application to be trusted with the speech input. It is up to the user agent at what granularity it makes sense to provide this authorization, since the appropriate authorization and user experience provided by a web browser may differ depending upon the deployment scenario. Such scenarios may include, but are not limited to, the following:

The user agent may allow all access as part of installation, or may allow all access as part of user configuration or a result of previous dialogs with the user, or may allow only certain domains to have access to the speech input, or may allow access only after initiating a dialog with the user once per domain or once per page, or may require a user dialog on each and every access of speech input, or may use whatever other security and privacy settings that it deems most appropriate. A user agent should provide the user with some indication that speech is being captured and should provide the user with some way to deny or revoke a speech capture.

Two examples are illustrated here. The first example illustrates an approach where the user agent interacts with the user to authorize microphone input to pages from a specific domain ("This site is voice enabled..."). The second example illustrates a global choice ("Do you want to control the web with your voice..."), which is conceivably more appropriate on some devices with more constrained opportunities for manual interaction. Some user agents may wish to persist the authorization decision indefinitely, whereas others may ask for re-authorization after a specific event or period. The key point is that this is an important user agent design decision, where the design parameters may vary from one agent to the next. The API should place no restriction on these design options, and there should be no particular microphone consent design assumptions in the API.

Example 1: Site AuthorizationExample 2: Global Authorization
Site level authorization Global level authorization

5 Scenario Examples

This proposal is sufficient to enable a wide variety of scenarios, including, among others:

  1. Search, where an utterance is captured at the client, and recognized by a remote service using large and sophisticated models of the Web.
  2. Form Filling, where utterances are captured to fill in individual fields on a form.
  3. How May I Help You (HMIHY) or Intent-to-Action applications, where a user expresses their need using natural language, and a sophisticated model infers the appropriate task and context.
  4. Dialog applications, where a user interacts over multiple spoken dialog turns.
  5. Multimodal, where a user provides input with any of touch, speech recognition, or video recognition.

Examples of each of these scenario types follow.

It should also be noted that although the illustrations show a button on the page for invoking speech recognition, this will not always be the case. Some devices will have a hardware button for invoking recognition, and some user agents may choose to invoke recognition via a button in the chrome. Some apps, such as those designed for use while driving, or in a living room with open microphone ("10-foot" apps), will not have a button at all.

5.1 Web Search Example

Web search is typical of the current trend in mobile speech applications. The user interface design is deceptively simple: the user states their search term, and the application presents a list of potential answers. However, the application is more complicated than it appears. The language model for the search terms is completely dependent on the knowledge of the back-end search engine (which isn't available to the user agent), and tends to be enormous and continually changing (hence inappropriate to be included in the page itself). Furthermore, the user doesn't want a transcription of what they said - they actually want the resource they've requested, and speech is just a by-product of the process. In addition to the user's utterance, the search engine will incorporate a number of critical data points in order to achieve this, such as device capabilities, GPS coordinates, camera input, etc. At its simplest, the output may be a list of web links. But this is a rapidly evolving application type, and where possible, much richer experiences will be provided.

Step 1: User Initiates a SearchStep 2: Begin Capture
User initiates search
  1. User launches search application
  2. User taps or selects the microphone button

User begins capture

  1. Voice capture listening experience is started
  2. User hears a sound to signify the start of listening
  3. VU meter is displayed and reacts in real-time to audio input
  4. User states search term as audio input
  5. User has the option to cancel the capture or set their own capture end-point
  6. Voice capture can auto end-point capture based on a number of factors
  7. At the end of the capture session a stop listening sound is played
Step 3: Remote RecognitionStep 4: Results Display
Remote recognition occurs
  1. Animation is displayed while cloud recognition is performed
  2. A sound is played to complement the animation
  3. User can cancel the action and return to the screen below
Results are displayed
  1. A results sound is played to signify the return of results
  2. Recognized text for the best reco match is displayed as editable text in the search box
  3. Search results for the reco text are displayed

5.2 GUI Form Filling Example (Flight Booking)

This sort of application occurs when a developer tries to layer speech on top of an existing design. This is far from "good" speech application design, but is still valuable when a developer wants to provide convenient input options on devices that do not have effective keyboards; or when a developer wants a "cheap" approach to speech-enabling their application without redesigning the user interface.

Step 1: User Initiates Input Into the First Field Step 2: Speech Capture & Transcription
User taps first field
  1. User taps or selects microphone button in a single form field
User says Springfield
  1. Voice capture listening experience is started in-line with the first form field
  2. User hears a sound to signify the start of listening
  3. VU meter is displayed and reacts in real-time to audio input
  4. User states location as audio input
  5. User has the option to set their own capture end-point
  6. Voice capture can auto end-point capture based on a number of factors
  7. At the end of the capture session a stop listening sound is played
  8. Result is delivered as text to the form field
Step 3: User Initiates Input in the Next Field Step 4: Speech Capture & Transcription
User taps next field
  1. User taps or selects microphone button in a single form field
User says Detroit
  1. Voice capture listening experience is started in-line with the first form field
  2. User hears a sound to signify the start of listening
  3. VU meter is displayed and reacts in real-time to audio input
  4. User states location as audio input
  5. User has the option to set their own capture end-point
  6. Voice capture can auto end-point capture based on a number of factors
  7. At the end of the capture session a stop listening sound is played
  8. Result is delivered as text to the form field
Step N: Form Complete
The form is complete
  1. After many repeats of this sequence, the user has finally filled in the form. The experience is cumbersome, but if it was on a device without a keyboard, it was probably better than the alternative.
  2. Once complete the form can be edited with text input, form item selection, or audio input

5.3 Intent-to-Action/HMIHY Example (Flight Booking)

In this sort of application, the user states their intent using natural language, and the speech system determines what task they are trying to perform, as well as extracting pertinent data for that task from the user's utterance.

Step 1: User Initiates a Request Step 2: User States Natural Language Intent
User initiates request
  1. User taps or selects microphone button
User says Springfield to Detroit on the morning of December 11th
  1. Voice capture listening experience is started in-line within the form field
  2. User hears a sound to signify the start of listening
  3. VU meter is displayed and reacts in real-time to audio input
  4. User states trip information as a single utterance as audio input
  5. User has the option to set their own capture end-point
  6. Voice capture can auto end-point capture based on a number of factors
  7. At the end of the capture session a stop listening sound is played
Step 3: Appropriate Action is Taken
System takes appropriate action
  1. Once complete the utterance is displayed as a form that can be edited with text input, form item selection, or audio input

5.4 Dialog Example (Driving Directions)

Driving directions are a good example of a class of applications in which the user and application talk to each other, using recognition and synthesis, over a number of dialog turns. These sorts of applications tend to be always-listening (i.e. open-mic) without requiring the user to manually initiate speech input for each utterance, and make use of barge-in to terminate speech output, and pick from lists of options.

Step 1: User Initiates Input of Destination Step 2: User Says Destination Name
User taps destination
  1. User taps or selects microphone button for the destination/to field
User says city farmer's market
  1. Voice capture listening experience is started
  2. User hears a sound to signify the start of listening
  3. VU meter is displayed and reacts in real-time to audio input
  4. User states destination as audio input
  5. User has the option to cancel the capture or set their own capture end-point
  6. Voice capture can auto end-point capture based on a number of factors
  7. At the end of the capture session a stop listening sound is played
Step 3: Cloud Determines Location Step 4: New Map, Directions and Spoken Summary
System determines location
  1. Animation is displayed while cloud recognition is performed
  2. A sound is played to complement the animation
  3. User can cancel the action and return to the screen below
New map and directions appear
  1. Driving directions are displayed
  2. Application announces destination and estimated travel time
Step 5: User Provides More Instruction Step 6: Application Recalculates and Speaks Confirmation
User says more instructions
  1. Application is always listening for audio input from user
  2. User states a modification to the provided directions
  3. VU Meter is displayed on screen to signify to user that audio input is being captured
System recalculates route and speaks changes
  1. Updated driving directions are displayed
  2. Application announces destination modification
Step 6: App Issues Turn-By-Turn Instructions
Application speaks turn by turn directions
  1. Application announces first direction for user to navigate based on physical location context

5.5 Multimodal Example (Video Game)

There are many applications where speech input and output combine with tactile input and visual output, to provide a more natural experience while reducing display clutter and manual input complexity.

Step 1: User Switches Modes Step 2: Game Confirms New Mode
User switches video game mode
  1. User is playing a game
  2. User is using one finger to navigate using the touchscreen
  3. User is using another finger to perform game actions (attack or defend)
  4. Application is always listening
  5. User states inventory item as audio input
System confirms new spell
  1. Application announces focus has been switched to user input inventory item
  2. Inventory item is switched
  3. User can now use that item via touch
  4. User is still using two fingers on the touchscreen
Step 3: User Takes Action in the New Mode
User takes action based on previous speech
  1. User takes action with inventory item
  2. User is still using two fingers on the touchscreen

6 Basic Extensions to Existing HTML Designs

The most fundamental work we advocate is to make a minor set of changes to:

  1. Media Capture API [CAPTUREAPI], in order to enable capture and streaming of microphone input to a speech recognizer.
  2. XMLHttpRequest Level 2 [XHR2], in order to enable streaming of microphone input to a remote speech recognizer, the fetching of rich data such as audio from a remote speech synthesizer.

This work is built on in the Speech Object Interfaces Section, where the speech interface can either consume these objects, and/or utilize built-in microphone and speech services provided by the user agent, and/or use alternative speech service protocols that are yet to be determined.

An example flow of events is illustrated in this sequence diagram:

sequence diagram of speech system

6.1 Capture API Extensions

The Media Capture API draft defines an API for accessing the audio, image and video capture capabilities of a device. With some enhancements, the design forms a strong basis for capturing audio input from a microphone to be used in speech recognition, as well as other scenarios such as video capture for streaming to social media sites. Reuse of an existing microphone API design such as this is preferable to defining an alternative speech-specific design. The specific security/privacy requirements of speech are assumed to be similar to those of general microphone or video capture, and any specific functional requirements for speech recognition can be added to the API without breaking other semantics.

We suggest the following modifications to the Media Capture API. Note that since the same API applies to capturing audio, video, and image input, some of the changes don't directly pertain to speech, but are included for completeness and consistency.

Privacy of supported* API
In the current design, privacy-sensitive information about the device can be leaked to an application because the navigator.device.capture.supported* properties can be accessed without user intervention. Proposed change: navigator.device.openCapture() returns asynchronously with the capture device object if and only if the user allows access through a UA-defined mechanism such as those described in Security.
Multiple Devices
The current W3C design doesn't support multiple devices (especially where they support different formats). Proposed change: navigator.device.openCapture() returns asynchronously with the capture device the user prefers (through preference or UI).
Direct Capture
The Capture.capture[Image|Video|Audio] operations launch an asynchronous UI that returns one or more captures. This means the user has to do something in the webapp to launch the UI and then do the capture, which makes it impossible to build capture UI directly into the web application. Not only would this be unusable for a speech recognition application, but it is also places unnecessary user interface constraints on other media capture scenarios. Proposed change: the Capture API should directly capture from device and return a Blob. An application can control the duration and manage multiple captures. If the a WebApp wants a picker interface with capture, the HTML Media Capture extensions to <input type="file"> provides that support (this sort of scenario is unlikely for speech recognition, but more common in other media capture scenarios). Note that for privacy reasons some user agents will choose to display some notification in their surrounding chrome or hardware to make it readily apparent to the user that capture is occurring, together with the option to cancel the capture.
Streaming
For speech recognition, captured audio needs to be sent directly to the recognition service. However, the current design only supports capturing to Blobs, which are not useful for speech scenarios. Proposed change: Starting a capture asynchronously returns a Stream object containing the captured data. This change would also be useful in video recording scenarios. For example, using a capture stream, an app could stream a recording to a video sharing site, as it is recorded.
Preview
In the case of video capture, live preview within the application is important. Proposed change: a URL reference to the capture Stream object can be created with URL.createObjectURL(). This URL can then be used as the value for the src attribute on an <audio> or <video> element. (Although this particular change is motivated by the need to preview video, it is reasonable to conceive of applications that combine both speech and image recognition.)
End-Pointing
For speech recognition, it's important to know when the user starts and stops talking. For example, if the app starts recording but the user doesn't start talking, the app may wish to indicate that it can't hear the user. More importantly, when the user stops talking, the app will generally want to stop recording, and transition into working on the recognition results. This sort of capability may also be of some use during non-speech scenarios, to provide prompts to users who are recording videos. Proposed solution: add attributes to set the sensitivity level for voice/non-voice detection, and the required period of silence (or at least, non-speech noise) before the user is determined to be not speaking. This timeout value will vary from scenario to scenario, and so needs to be settable by the app. The specific signal processing algorithm to be used is at the discretion of the device or UA implementor. Some devices may have access to onboard silicon or software that provides a sophisticated and highly reliable measure. Others may rely on very simple filtering.

The following IDL shows conceptual additions to the Media Capture API that would satisfy the changes outlined above. The specific IDL is not a formal proposal, but indicative of a viable approach.

      ... [Use all IDL from Media Capture API, with the following modifications and additions.]

      // StoppableOperation is like PendingOperation, but the app can stop it, rather than having to wait for built-in UI.
      interface StoppableOperation {
          void cancel();
          void stop();
      };

      // end-point parameters
      interface EndPointParams {
          attribute unsigned float sensitivity; // 0.0-1.0
          attribute unsigned long initialTimeout;  //milliseconds
          attribute unsigned long endTimeout;  //milliseconds
      };

      // end-point call-back
      interface EndPointCB {
          const unsigned short INITIAL_SILENCE_DETECTED = 0;
          const unsigned short SPEECH_DETECTED          = 1;
          const unsigned short END_SILENCE_DETECTED     = 2;
          const unsigned short NOISE_DETECTED           = 3;
          const unsigned short NONSPEECH_TIMEOUT        = 4;
          void endpoint(in unsigned int endtype);  };

      // preview call-back
      interface PreviewCB {
          void onPreview(in previewDevice);
      }

      // modification of existing Capture interface
      interface Capture {  

          readonly attribute ConfigurationData[] supportedImageModes;
          readonly attribute ConfigurationData[] supportedVideoModes;
          readonly attribute ConfigurationData[] supportedAudioModes;
          
          PendingOperation captureImage (in CaptureCB successCB, in optional CaptureErrorCB errorCB, in optional CaptureImageOptions options);

          // Use StoppableOperation rather than PendingOperation for audio & video recordings
          // Additional end-pointing parameters to respond to speech in audio & video recordings
          StoppableOperation captureAudio (in CaptureCB successCB, 
                                           in optional CaptureErrorCB errorCB, 
                                           in optional CaptureAudioOptions options, 
                                           in optional EndPointCB endCB, 
                                           in optional EndPointParams endparams);
          StoppableOperation captureVideo (in CaptureCB successCB, 
                                           in optional CaptureErrorCB errorCB, 
                                           in optional CaptureVideoOptions options, 
                                           in optional EndPointCB endCB, 
                                           in optional EndPointParams endparams);

          // The preview() function is separate from the actual record() function
          // because preview quality & format will be different (usually device specific)
          void preview(in PreviewCB previewCB);
        };
    };

    

The StoppableOperation interface is returned by the captureAudio() and captureVideo() methods on the Capture interface. It is similar to a PendingOperation, with the addition of a stop() function that the application can call to stop & finish a recording.

The EndPointParams interface is used to define the settings used to detect the beginning and end of speech during an audio recording. The sensitivity attribute can be set to a value between 0.0 (least sensitive) and 1.0 (most sensitive), and has a default value of 0.5. The timeout attribute specifies the time, in milliseconds, after which the user stops speaking that the user agent should wait before declaring end of speech. It has a default value of 400. The initialTimeout attribute specifies the time the user agent should wait for speech to be detected after recording begins, before declaring that initial silence has occurred. It is measured in milliseconds and has a default value of 3,000. The specific end-pointing algorithm a user agent uses depends on the capabilities of the hardware device and user agent, so further fine-grained parameters are not presented. The user agent is also free to ignore these settings if they are not appropriate to the local implementation. It is also conceivable that some applications or devices may have access to speech recognition technology with more sophisticated end-pointing capabilities, in which case, they may choose to not use the Capture API's end-pointing at all.

The EndPointCB interface defines the endpoint() callback method that the user agent calls to notify the application of an end-point event. This callback provides a numeric value that indicates the type of end-point that has occurred:

INITIAL_SILENCE_DETECTED (numeric value 0)
Indicates the user has not spoken at all since the beginning of the recording, for the period specied in initialTimeout. The user agent continues recording after this event. The event allows the application to take appropriate action, such as prompting the user or cancelling the recording.
SPEECH_DETECTED (numeric value 1)
Indicates that the user has started speaking. Many apps will use this event to provide feedback to the user that the user is being heard.
END_SILENCE_DETECTED (numeric value 2)
Indicates that the user has concluded speaking, for at least the period specified in endTimeout.
NOISE_DETECTED (numeric value 3)
Indicates that the speech detection algorithm has determined the microphone input is too noisy to reliably detect the presence or absence of the user's speech.
NONSPEECH_TIMEOUT (numeric value 4)
Indicates that the user has not spoken at all since the beginning of the recording, for an extended period. The period is defined by the user agent since it may depend on the context in which the device is used. Applications could assume that either the user really doesn't want to talk, or that an environmental factor is preventing them from being heard (e.g. microphone unplugged), and take alternative action.

The Capture interface is already defined in the Media Capture API draft. We suggest that the captureAudio() and captureVideo() methods be adjusted to return a StoppableOperation so they can be stopped by the app, and to optionally accept EndPointCB and EndPointParams arguments, so that they can provide end-point events back to the app. We also suggest the addition of a preview() that provides a video-only Stream (via a call-back) that can be used in conjunction with a <video> element to provide an in-page view of what the camera can see.

Usage Example for Capture API

For this example, imagine a multimodal application where the user points their cell phone camera at an object, and uses voice to issue some query about that object (such as "what is this", "where can I buy one", or "do these also come in pink"). The application needs to preview the image using a video stream, listen for a voice query, and take a photo. Assume the voice and image input are dispatched to appropriate speech and image processing services.

Step 1: Point and SpeakStep 2: Use Voice to Further Refine
User points camera and says where can I buy that
  1. User taps or selects the microphone button or designated hardware key to initiate speech recognition
  2. The user uses the built-in viewfinder on the page to point their camera at an interesting object
  3. While asking a question about the object
  4. The application takes a still shot, and the user's speech utterance, and processes the query
system returns the search results as a result of multimodal input and allows further refinement from speech
  1. The application displays possible results
  2. The user continues to use speech further refine the results

The sample code uses the capture API to provide microphone input as an audio stream to the recognizer; send a camera video stream to an in-page <video> element as a view-finder; and then take a photograph when the user asks a question.

  

  <body onload="init()">
    <div id="message"/>
    <video id="viewfinder" width="480" height="600" type="video/mp4"/>

    <script type="text/javascript">

      var recordingSession;
      var message;

      function init() {
        message = document.getElementById("message");
        navigator.device.opencapture(onOpenCapture);
      }

      // opencapture() results in a callback when the UA has a device the user authorized for capture.
      function onOpenCapture(in Capture captureDevice) {

        // start previewing video
        captureDevice.preview(onPreview);

        // start listening
        var audioOptions = {  duration: 15, // max duration in seconds
                              limit: 1,     // only need one recording
                              mode: { type: "audio/x-wav"}  // no need to specify width & height
                           };
        var endpointParams = { sensitivity: 0.5, endTimeout: 300, initialTimeout: 5000 };
        recordingSession = captureAudio(onRecordStarted, onFail, audioOptions, onEndpoint, endpointParams);
      }
  
      // when we're given the video preview stream, feed it to the <video> element
      function onPreview(previewDevice) {
        document.getElementById("viewfinder").src = window.URL.createObjectURL(previewDevice);
      }

      function onRecordStarted(Stream stream) {
        //place-holder: send stream to recognizer...
      }

      function onFail(CaptureError error) {
        message.innerHTML = error.toString();
        recordingSession = null;
      }

      function onEndpoint(in unsigned int endevent) {
        switch (endevent) {
          case 0: //initial silence
              message.innerHTML = "please start speaking";  
              break;
  
          case 1: //started speaking
              message.innerHTML = "listening...";
              break;
  
          case 2: //finished speaking
              recordingSession.stop();
              // presumably the recognizer will fire a reco event sometime soon
              message.innerHTML = "processing...";
              window.capture.captureImage(onPhotoTaken);
              break;
  
          case 3: //noise
              message.innerHTML = "can't hear you, too noisy";
              break;
  
          case 4; //extended non-speech
              message.innerHTML = "giving up...";
              recordingSession.cancel();
              recordingSession = null;
              break;

          default: break;
      }

      function onPhotoTaken() {
        // place-holder: send it to the image processing service
      }

    </script>
  </body>
  
  

6.2 Streams

We propose the addition of a stream type. While this document does not present a detailed design for this type, we assume a Stream is an object that:

  1. Has a content type;
  2. Has unspecified length;
  3. Can generally be used in the same places Blob can be used, for example URL.createObjectURL().

6.3 XHR L2 Extensions

XMLHttpRequest Level 2 (XHR2) is a key enabling HTML API which is used extensively by applications to access remote services and resources, and has undergone enhancement since its inception in 1999, as a result of its fundamental value to apps, and the evolving needs of those apps. It is an established and successful design.

One of the key functional improvements in XHR2 over previous versions of XHR is the ability to make cross-domain requests. This is particularly important to speech, since many speech applications will not want to run their own speech recognition or synthesis web services, and there will likely be some speech recognition services offered to a variety of applications on different domains. Fortunately, XHR2 follows the Cross-Origin Resource Sharing specification and enables this scenario with no changes.

The following enhancements are suggested in order to satisfy speech-related scenarios:

Streaming
XHR2 already has a the ability to send a data Blob as a chunked-transfer stream, which is exposed via the send(Blob data) method. For the purposes of speech recognition, it is important to send the audio stream as it is captured. However the semantics of Blob imply that sending could only commence once capture is complete. Proposed change: add a send(Stream stream) method so that the microphone capture stream can be sent to the recognition service without delay. This enhancement also helps with the related scenario of uploading video recordings to video sharing sites in real time.
Multipart Responses
For apps with simple interaction models, such as one-field-at-a-time form filling, or dictation of a message, it is sufficient for the response to a recognition request to simply be the recognition result. However, in many speech recognition apps, recognition is not the end, but just the means to an end. If a user of an app speaks a request, what they really want is the resource or action they requested, not a transcription of what they said. Speech recognition services will use complex heuristics that cannot be easily and efficiently duplicated in the app, either because they are proprietary, data-intensive, network-intensive, or too complicated to practically express in script on a page. For this reason, many speech services will also return additional information. Even a simple apps like web search will evolve in this direction - when the user speaks a query, as well as a transcription of their query, they need the search results, a variety of other decision aids that may include images, HTML applets, or advertisements. Proposed change: Add support for multipart responses to XHR2. A design approach is suggested below. It should be noted that multipart extensions to XHR have also been explored by Mozilla.
Evolving Authentication Schemes
As web services evolve into the mainstream, the authentication methods they use have diverged considerably from the traditional HTTP basic auth, or 401/digest pattern that XMLHttpRequest uses. This divergence is generally due to the architectural scaling requirements of services that weren't envisaged in the original HTTP authentication design. We envisage that many speech services will follow this trend, and use different authentication schemes. Proposed change: Add an overload of the open() method that takes a callback that's used to notify the app that a send is about to be initiated. When the app receives this callback, it can perform whatever timestamp, crypto, CAPTCHA, or other mechanism it needs to, set the appropriate headers, then signal that the send should commence.

The following IDL only shows additions to the IDL already specified in the XHR2 draft. It is not a formal proposal, but is indicative of the sorts of enhancement that could be made.


  // existing interface
  interface XMLHttpRequestEventTarget : EventTarget {
  ...
    // multipart response notification
    attribute Function onpartreceived;
  };

  // new callback interface for alternative service auth
  interface XHRAuthCB {
    void authneeded(in Function signal);
  };

  // new interface for a body part in a multipart response
  interface ResponsePart {
    readonly attribute DOMString mimeType;
    readonly attribute DOMString encoding;
    readonly attribute Blob blobPart;
    readonly attribute DOMString textPart;
    readonly attribute Document XMLPart;
  };

  // new interface for accessing the collection of response body parts
  interface ResponsePartCollection {
    omittable getter ResponsePart item(in unsigned short index);
    readonly attribute unsigned short count;
  };

  // existing interface
  interface XMLHttpRequest : XMLHttpRequestEventTarget {
  ...

    // open with alternative auth callback
    void open(DOMString method, DOMString url, boolean async, XHRAuthCB authCB);

    // send stream
    void send(in Stream data);

    // multipart expected (from Mozilla design)
    attribute boolean multipart;

    // multipart response
    readonly attribute ResponsePartCollection responseparts;
  };

  

The XMLHttpRequest interface is extended in three important ways.

Firstly, there is an additional send() method that accepts a Stream object (similar to the existing method that accepts Blob). For speech recognition, this method would be used to stream microphone input to a web service.

Secondly, it provides access to multipart responses. To enable multipart responses, the app sets the multipart attribute to TRUE. When a part is received, the onpartreceived callback is fired. The app can access each received part through the responseparts collection, which is added to as each part is received. Each ResponsePart has a mimeType and encoding attributes that the app can use to disambiguate the parts, as well as a variety of accessors to examine the part.

Thirdly, it provides an overload of open that enables alternative authentication schemes, by taking a call-back function to be invoked by the user agent prior to sending a request. The call-back allows the application to perform whatever time-sensitive authentication steps it needs to perform (e.g. calculate a timestamped hash, or fetch a limited-use auth-token, and place it in a particular header) and then call a signal() function to indicate that the send operation can proceed.

6.4 HTTP Conventions

This section presents a small number of conventions to aid with interoperability between diverse speech services and user agents using HTTP 1.1.

A large portion of current speech scenarios, as well as most of the HTML Speech XG's requirements, are feasible and addressable with existing HTTP 1.1 technology. For speech recognition, the basic pattern is to issue an HTTP POST request containing audio input streamed using chunked transfer encoding ([RFC2616]), and then receiving a 200 OK response from the service, containing EMMA results, and potentially further mime body parts containing additional information. For speech synthesis, the basic pattern is to issue an HTTP POST request, with SSML in the body, and receiving a 200 OK response containing the rendered audio and mark timing.

We acknowledge that while this approach satisfies many scenarios, further innovation in protocols may be necessary for some of the richer scenarios in the longer term. Indeed, new work is being done in the IETF and W3C around real time communication and web sockets that may well become useful and germane to certain speech scenarios, although at this stage it is too early to tell. Ultimately, we envisage that the Speech Object Interfaces presented in this document will work with a variety of underlying implementations (both local and protocol), and have designed those interfaces accordingly to allow for future service and protocol innovations.

HTTP Input Parameters

Input parameters to speech requests are expressed by the user agent using any of these standard techniques:

  1. Additional query parameters on the URL.
  2. HTTP headers
  3. An HTTP POST entity body with application/www-form-url-encoded encoding.
  4. One or more body parts of an HTTP POST entity body with multipart/form-data or multipart/mixed encoding.

Headers and query parameters are the easiest options for apps that use XHR-2. The other options are suitable for other approaches and are included for completeness. A user agent is not required to use the same technique for all of the parameters it passes. For example, it may choose to pass most parameters on the URL string, but include the audio stream in the body.

The recognition API has a number of attributes that an application may provide values for. These have corresponding HTTP parameters with the following reserved names. All of these parameters are optional in the HTTP request, and if absent their default behavior is determined by the speech recognition service.

In addition to this, all recognition service HTTP requests MUST include the following parameters.

Speech synthesis has fewer reserved input parameters:

All synthesis HTTP requests MUST include either src or content, but MUST NOT include both.

In addition to this, a synthesis HTTP request MAY include:

Service implementers may define additional service-specific input parameters. For example, they may define input properties for functions such as logging, session coordination, engine selection, and so forth. Applications can most easily provide these as part of the service URL, or in additional headers. A suitable naming convention should be used, for example service-specific parameter names could be prefixed with "x-". In addition, services should ignore any parameters they do not understand or expect.

HTTP Output Data

When operating correctly, a speech recognition request returns a 200 OK response. If the response contains a recognition result (which is normally the case), either of the following MUST be true:

  1. The content-type header MUST be "application/emma+xml", and the response body MUST be an Extensible Multimedia Annotation (EMMA) document.
  2. The content-type header is multipart, and the first part MUST have a MIME type of "application/emma+xml" and MUST contain an EMMA document. Subsequent parts may be of any type, as determined by the service.
  3. contextblock: The block of adaptation data returned by the service.

Service implementers MAY provide proprietary information (for example session tokens, adaptation data, or lattices) in the EMMA document, provided such information is expressed using standard XML extension conventions (such as placing proprietary tags in a separate namespace).

When operating correctly, a speech synthesis request returns a 200 OK response with these characteristics:

Errors are communicated to the user agent via the speech-error header, in the format of an error number, followed by a space and then the error message. In the case of an error, a recognition or synthesis service will generally still return 200 OK unless the error was with the HTTP request itself. Some errors may not prevent the generation of results, and the service may still provide them.

6.5 Code Samples For Basic Extensions

In this example, the user presses a button to start doing a search using their voice, similar to the Web Search example The application uses Capture and XHR to stream audio from the microphone to a web service. The first part of the response from the web service is the EMMA document containing the recognition result. The application displays the top result and the alternates. A short time later, the second part of the response arrives, which is an HTML snippet containing the search results presented in whatever manner is most appropriate, which the app then inserts into the page.


<body onload="init()">

<input type="button" id="btnListen" onclick="onListenClicked()" value="Listen" />
<div id="feedback"></div>
<div id="alternates"></div>
<div id="results"></div>

<script type="text/javascript">

  var feedback;
  var alternates;
  var results;
  var recordingSession;
  var xhr;
  var recoReceived = false;
  var searchResultsReceived = false;

  function init() {
    feedback = document.getElementById("feedback");
    alternates = document.getElementById("alternates");
    results = document.getElementById("results");
    xhr = new XMLHttpRequest();
    xhr.open("POST","http://webreco.contoso.com/search",true);
    xhr.setRequestHeader("maxnbest","5");
    xhr.setRequestHeader("confidence","0.8");
    xhr.onreadystatechange = onReadyStateChanged;
    xhr.onpartreceived = onResponsePartReceived;
  }

  function onListenClicked() {
    feedback.innerHTML = "preparing...";
    navigator.device.opencapture(onOpenCapture);
  }

  function onOpenCapture(in Capture captureDevice) {
    var audioOptions = {  duration: 10, // max duration in seconds
                          limit: 1,     // only need one recording
                          mode: { type: "audio/amr-wb"}  // use wideband amr encoding
                       };
    var endpointParams = { sensitivity: 0.5, initialTimeout: 3000, endTimeout:500};
    recordingSession = captureAudio(onRecordStarted, onRecordFail, audioOptions, onEndpoint, endpointParams);
  }
  
  function onRecordStarted(Stream stream) {
    feedback.innerHTML = "listening...";
    xhr.send(stream);
  }

  function onEndpoint (in unsigned int endevent) {
    switch (endevent) {
      case 0: //initial silence
          feedback.innerHTML = "Please start speaking...";  
          break;
  
      case 1: //started speaking
          feedback.innerHTML = "Mmm...hmmm...I'm listening intently...";
          break;
  
      case 2: //finished speaking
          feedback.innerHTML = "One moment...";
          recordingSession.stop();
          recordingSession = null;
          // now xhr will reach the end of the input stream, and complete the request.
          break;
  
      case 3: //noise
          feedback.innerHTML = "Too noisy.  Try again later...";
          recordingSession.cancel();
          recordingSession = null;
          xhr.abort();
          break;
  
      case 4; //extended non-speech
          feedback.innerHTML = "Still can't hear you.  Try again later...";
          recordingSession.cancel();
          recordingSession = null;
          xhr.abort();
          break;

      default: break;
    }
  }

  function onResponsePartReceived() {
    if (!recoReceived) {
      if ("application/emma+xml" == xhr.responseparts(0).mimeType) {
        var results = xhr.responseparts(0).XMLPart.getElementsByTagName("searchterm");
        feedback.innerHTML = "You asked for '" + results[0] + "'";
        if (results.length > 0) {
          alternates.innerHTML = "<div>Or you may have said one of these...</div>"
          for (var i = 1; i < results.length; i++) {
            alternates.innerHTML = alternates.innerHTML + "<div>" + results[i] + "</div>";
          }
        }
        results.innerHTML = "Fetching results...";

      }
    }
    else if ((xhr.responseparts.count > 1) && !searchResultsReceived) {
      if ("application/html+searchresults" == xhr.responseparts(1).mimeType) {
        results.innerHTML = xhr.responseparts(1).textPart;
      }
    }
  }

</script>


</body>

7 Speech Object Interfaces

The basic design approach is to define recognition and synthesis APIs that are accessible from script. These APIs express the semantics of recognition and synthesis, while being loosely coupled to the implementation of those services. The API abstracts microphone input and interaction with the underlying speech services, so that the same API can be used whether the related mechanisms are proprietary to the device/user-agent, accessible over XHR2, or accessible over a future to-be-defined mechanism, without having to modify the API surface.

7.1 SpeechRecognizer Interface


  interface GrammarCollection {
    omittable getter DOMString item(in unsigned short index);
    attribute unsigned short count;
    void add(DOMString grammarURI, in optional float weight); 
          //typically http:, but could be data: for inline grammars
  };

  [NamedConstructor=SpeechRecognizer(), //uses the default recognizer provided by UA
   NamedConstructor=SpeechRecognizer(DOMString selectionParams),  //for specifying the desired characteristics of a built-in recognizer
   NamedConstructor=SpeechRecognizer(XMLHttpRequest xhr) //for specifying a recognizer
   // NamedConstructor=SpeechRecognizer(OtherServiceProvider other) //future service providers/protocols can be added
  ] 
  interface SpeechRecognizer {

    // audio input configuration
    void SetInputDevice( in CaptureDevice device, 
                         in optional CaptureAudioOptions options, 
                         in optional EndPointParams endparams);
    void SetInputDevice( in CaptureDevice device, 
                         in CaptureCB successCB,
                         in optional CaptureErrorCB errorCB, 
                         in optional CaptureAudioOptions options, 
                         in optional EndPointParams endparams,
                         in optional EndPointCB endCB);
    attribute boolean defaultUI;

    // type of recognition
    const unsigned short SIMPLERECO = 0;
    const unsigned short COMPLEXRECO = 1;
    const unsigned short CONTINUOUSRECO = 2;
    attribute unsigned short speechtype;   
    readonly attribute boolean supportedtypes[]; 
      //e.g. check supportedtypes[COMPLEXRECO] to determine whether interim events are supported.

    // speech parameters
    attribute GrammarCollection grammars;
    attribute short maxnbest;
    attribute long speechtimeout;
    attribute long completetimeout;
    attribute long incompletetimeout;
    attribute float confidence;
    attribute float sensitivity;
    attribute float speedvsaccuracy;
    attribute Blob contextblock;

    attribute Function onspeechmatch(in SpeechRecognitionResultCollection results);
    attribute Function onspeecherror(in SpeechError error);
    attribute Function onspeechnomatch();
    attribute Function onspeechnoinput();
    attribute Function onspeechstart();
    attribute Function onspeechend();

    // speech input methods
    void startSpeechInput(in optional Blob context);
    void stopSpeechInput();
    void cancelSpeechInput();
    void emulateSpeechInput(DOMString input);

    // states
    const unsigned short READY = 0;
    const unsigned short LISTENING = 1;
    const unsigned short WAITING = 2;

    readonly attribute unsigned short speechrecostate;

  };

  

A SpeechRecognizer can use a variety of different underlying services. The specific service depends on how the object is constructed:

  1. The SpeechRecognizer() constructor uses the default recognition service provided by the user agent. In some cases this may be an implementation that is local to the device, and limited by the capabilities of the device. In other cases it may be a remote service that is accessed over the network using mechanisms that are hidden to the application. In other cases it may be a smart hybrid of the two. The responsiveness, accuracy, language modeling capacity, acoustic language support, adaptation to the user, and other characteristics of the default recognition service will vary greatly between user agents and between devices.
  2. The SpeechRecognizer(DOMString selectionParams) constructor causes the user agent to use one of its built-in recognizers, selecting the particular recognizer that matches the parameters listed in selectionParams. This is a URL-encoded string of label-value pairs. It is up to the user agent to determine how closely it chooses to honor the requested parameters. Suggested labels and their corresponding values include:
  3. The SpeechRecognizer(XMLHttpRequest xhr) uses an application provided instance of XMLHttpRequest and the recommended HTTP conventions to access the speech recognition service. When this constructor is used, the user agent should consider displaying a notification in it's surrounding chrome to notify the user of the particular speech service that is being used.

In the event that additional service access mechanisms are designed and standardized, such as new objects for accessing new real-time communication protocols, an additional constructor could be added, without changing the API.

By default, microphone input will be provided by the user agent's default device. However, the application may use the SetInputDevice() functions to provide a particular Capture object, with particular configuration settings, in order to exercise more control over the recording operation.

User agents are expected to provide a default user experience for controlling speech recognition. However, apps can choose to provide their own user interface by setting the defaultUI attribute to FALSE.

The speechtype attribute indicates the type of speech recognition that should occur. The values must be one of the following values:

SIMPLERECO (numeric value 0)

SIMPLERECO means that the request for recognition must raise just one final speechmatch, speechnomatch, speechnoinput, or specherror event and must not produce any interim events or results. If the speech service is a remote service accessible over HTTP/HTTPS this means the recognition may be a simple request-response. SIMPLERECO must be the default value.

COMPLEXRECO (numeric value 1)

COMPLEXRECO means that the speech recognition request must produce all of the interim events in addition to speechstart, speechend, as well as the final recognition result. Interim events may include partial results. COMPLEXRECO must only produce one final recognition result, nomatch, noinput, or error as a COMPLEXRECO is still one utterance from the end user resulting in one result.

CONTINUOUSRECO (numeric value 2)

CONTINUOUSRECO represents a conversation or dialogue or dictation scenario where in addition to interim events numerous final recognition results must be able to be produced. A CONTINUOUSRECO speech interaction once started must not stop until the stopSpeechInput or cancelSpeechInput method is invoked.

Due to the nature of needing to stream audio while getting results back a remote service doing COMPLEXRECO or CONTINUOUS should use a more sophisticated paradigm than regular HTTP request-response. Hence, although COMPLEXRECO and CONTINUOUSRECO could be specified in conjunction with the SpeechRecognizer(XMLHttpRequest xhr), no interim events would be received, and only a single result would be received.

The optional grammars attribute is a collection of URLs that give the address of one or more application-specific grammars. A weight between 0.0 and 1.0 can optionally be provided for each grammar (when not specified, weight defaults to 1.0). For example grammars.add("http://example.com/grammars/pizza-order.grxml", 0.75). Some applications may wish to provide SRGS directly, in which case they can use the data: URI scheme, e.g. grammars.add("data:,<?xml version=... ...</grammar>"). The implementation of recognition systems should use the list of grammars to guide the speech recognizer. Implementations must support SRGS grammars and SISR annotations. Note that the order of the grammars must define a priority order used to resolve ties where an earlier listed grammar takes higher priority.

If the grammar attribute is absent the recognition service may provide a default grammar. For instance, services that perform recognition within specific domains (e.g. web search, e-commerce catalog search, etc) have an implicit language model, and do not necessarily need the application to specify a grammar.

The optional maxnbest attribute specifies that the implementation must not return a number of items greater than the maxnbest value. If the maxnbest is not set it must default to 1.

The optional speechtimeout attribute specifies the time in milliseconds to wait for start of speech, after which the audio capture must stop and a speechnoinput event must be returned. If not set, the timeout is speech service dependent.

The optional completetimeout attribute specifies the time in milliseconds the recognizer must wait to finalize a result (either accepting it or throwing a nomatch event for too low confidence results), when the speech is a complete match of all active grammars. If not set, the timeout is speech service dependent.

The optional incompletetimeout attribute specifies the time in milliseconds the recognizer must wait to finalize a result (either accepting it or throwing a nomatch event for too low confidence results), when the speech is an incomplete match (I.e., anything that is not a complete match) of all active grammars. If not set, the timeout is implementation dependent.

The optional confidence attribute specifies a confidence level. The recognition service must reject any recognition result with a confidence less than the confidence level. The confidence level must be a float between 0.0 and 1.0 inclusive and must have a default value of 0.5.

The optional sensitivity attribute specifies how sensitive the recognition system should be to noise. The recognition service must treat a higher value as a request to be more sensitive to noise. The sensitivity must be a float between 0.0 and 1.0 inclusive and must have a default value of 0.5.

The optional speedvsaccuracy attribute specifies how much the recognition system should prioritize a speedy low-latency result and how much it should prioritize getting the most accurate recognition. The recognition service must treat a higher value as a request to be have a more accurate result and must treat a lower value as a request to have a faster response. The speedvsaccuracy must be a float between 0.0 and 1.0 inclusive and must have a default value of 0.5.

The optional contextblock attribute is used to convey additional recognizer-defined context data between the application and the recognizer. For example, recognizers may use it to convey adaptation data back to the application. The application could persist this context data to be re-used in a future session. It is automatically updated whenever a speechmatch occurs.

The following events are targeted at the SpeechRecognizer object (or any object that inherits from it), do not bubble, and are not cancelable.

The speechmatch event must be dispatched when a set of complete and valid utterances have been matched. A complete utterance ends when the implementation detects end-of-speech or if the stopSpeechInput() method was invoked. Note that if the recognition was of type SIMPLERECO or COMPLEXRECO there must be only one speechmatch result returned and this must end the speech recognition.

The speecherror event must be dispatched when the active speech input session resulted in an error. This error may be a result of a user agent denying the speech session, or parts of it, due to security or privacy issues or may be the result of an error in the web authors specification of the speech request or could be the result of an error in the recognition system.

The speechnomatch event must be raised when a complete utterence has failed to match the active grammars, or has only matched with a confidence less than the specified confidence value. A complete utterance ends when the implementation detects end-of-speech or if the stopSpeechInput() method was invoked. Note that if the recognition was of type SIMPLERECO or COMPLEXRECO there must be only one speechnomatch returned and this must end the speech recognition.

The speechnoinput event must be raised when the recognizer has detected no speech and the speechtimeout has expired. Note that if the recognition was of type SIMPLERECO or COMPLEXRECO there must be only one speechnoinput returned and this must end the speech recognition.

The speechstart event must be raised when the recognition service detects that a user has started speaking. This event must not be raised if the speechtype was SIMPLERECO but must be generated if the speechtype is either COMPLEXRECO or CONTINUOUSRECO.

The speechend event must be raised when the recognition service detects that a user has stopped speaking. This event must not be raised if the speechtype was SIMPLERECO but must be generated if the speechtype is either COMPLEXRECO or CONTINUOUSRECO.

When the startSpeechInput() method is invoked then a speech recognition turn must be started. It is an error to call startSpeechInput() when the speechrecostate is anything but READY and a user agent must raise a speecherror should this occur. Note that user agents should have privacy and security that specify if scripted speech should be allowed and user agents may prevent the startSpeechInput() from succeeding called on certain elements, applications, or sessions. If a recognition turn is not begun with the recognition service then a speecherror event must be raised.

When the stopSpeechInput() method is invoked, if there was an active speech input session and this element had initiated it, the user agent must gracefully stop the session as if end-of-speech was detected. The user agent must perform speech recognition on audio that has already been recorded and the relevant events must be fired if necessary. If there was no active speech input session, if this element did not initiate the active speech input session or if end-of-speech was already detected, this method must return without doing anything.

When the cancelSpeechInput() method is invoked, if there was an active speech input session and this element had initiated it, the user agent must abort the session, discard any pending/buffered audio data and fire no events for the pending data. If there was no active speech input session or if this element did not initiate it, this method must return without doing anything.

When the emulateSpeechInput(DOMString input) method is invoked then the speech service treats the input string as the text utterance that was spoken and returns the resulting recognition.

The speechrecostate readonly variable tracks the state of the recognition. At the beginning the state is READY meaning that the element is ready to start a speech interaction. Upon starting the recognition turn the state changes to LISTENING. Once the system has stopped capturing from the user, either due to the speech system detecting the end of speech or the web application calling stopSpeechInput(), but before the results have been returned the system is in WAITING. Once the system has received the results the state returns to READY.

7.2 SpeechRecognitionResultCollection Interface

  [NamedConstructor=SpeechRecognitionResultCollection(Document responseEMMAXML),
   NamedConstructor=SpeechRecognitionResultCollection(DOMString responseEMMAText)] 
  interface SpeechRecognitionResultCollection {
    readonly attribute Document responseEMMAXML;
    readonly attribute DOMString responseEMMAText;
    readonly attribute unsigned short length;
    omittable getter SpeechRecognitionResult item(in unsigned short index);
    void feedbackcorrection(DOMString correctUtterance);
    void feedbackselection(in unsigned short index);
  };
  
  

The responseEMMAXML attribute must be generated from the EMMA document returned by the recognition service. The value of responseEMMAXML is the result of parsing the response entity body into a document tree following the rules from the XML specifications. If this fails (unsupported character encoding, namespace well-formedness error et cetera) the responseEMMAXML must be null.

The responseEMMAText attribute must be generated from EMMA document returned by the recognition service. The value of responseEMMAText is the result of parsing the response entity body into a text response entity body as defined in XMLHTTPRequest.

The length attribute must return the number of results represented by the collection.

The item(index) method must return the indexth result in the collection. If there is no indexth result in the collection, then the method must return null. Since this is an "omittable getter", the "item" accessor is optional: script can use results(index) as a short-hand for results.item(index).

The feedbackcorrection(correctUtterance) method is used to give feedback on the speech recognition results by providing the text value that the application feels was more correct for the last turn of the recognition. The application should not use feedbackcorrection if one of the selections was correct and should instead use feedbackselection.

The feedbackselection(in unsigned short index) method is used to give feedback on the speech recognition results by providing the item index that the application feels was more correct for the last turn of the recognition. Passing in a value that is beyond the maximum returned values from the last turns should be interpreted as the application thinking that the entire list was incorrect, but that the application is not sure what was correct (since it didn't use feedbackcorrection).

7.3 SpeechRecognitionResult Interface

The results attribute returns a sequence of SpeechRecognitionResult objects. In ECMAScript, SpeechRecognitionResult objects are represented as regular native objects with properties named utterance, confidence and interpretation. Note that this may not be sufficient for applications that want more information and insight to the recognition such as timing of various words and phrases, confidences on individual semantics or parts of utterances, and many other features that might be part of a complex dictation use case. This is fine because in these applications the raw EMMA will be available in the SpeechRecognitionResultsContainer.

  [NoInterfaceObject]
  interface SpeechRecognitionResult {
    readonly attribute DOMString utterance;
    readonly attribute float confidence;
    readonly attribute object interpretation;
  };
  
  

The utterance attribute must return the text of recognized speech.

The confidence attribute must return a value in the inclusive range [0.0, 1.0] indicating the quality of the match. The higher the value, the more confident the recognizer is that this matches what the user spoke.

The interpretation attribute must return the result of semantic interpretation of the recognized speech, using semantic annotations in the grammar. If the grammar used contained no semantic annotations for the utterance, then this value must be the same as utterance.

7.4 SpeechSynthesizer Interface

The SpeechSynthesizer interface generates synthesized voice (or spliced audio recordings) using SSML input (or raw text). It has its own audio playback capabilities, along with a mark event to help with UI synchronization and barge-in. It also produces an audio stream that can be used as the source for an <audio> element.

The interface could be extended with other progress events, such as word and sentence boundaries, phoneme rendering, and even visual cue events such as visemes and facial expressions.


  [NamedConstructor=SpeechSynthesizer(), //default built-in synthesizer
   NamedConstructor=SpeechSynthesizer(DOMString selectionParams), //selection parameters for built-in synthesizer
   NamedConstructor=SpeechSynthesizer(XMLHttpRequest xhr) //synthesizer via a REST service
// NamedConstructor=SpeechSynthesizer(OtherServiceProvider other) //future service providers/protocols can be added
  ]
  interface SpeechSynthesizer {

    // error handling
    attribute Function onspeecherror(in SpeechError error);

    // content specification
    attribute DOMString src;
    attribute DOMString content;

    // audio buffering state
    const unsigned short BUFFER_EMPTY = 0;
    const unsigned short BUFFER_WAITING = 1;
    const unsigned short BUFFER_LOADING = 2;
    const unsigned short BUFFER_COMPLETE = 4;
    readonly attribute unsigned short readyState;
      
    void load();
    readonly attribute TimeRanges timeBuffered;
    readonly attribute Stream audioBuffer;

    // playback controls
    void play();
    void pause();
    void cancel();
    attribute double rate;
    attribute unsigned short volume;

    // playback state
    const unsigned short PLAYBACK_EMPTY = 0;
    const unsigned short PLAYBACK_PAUSED = 1;
    const unsigned short PLAYBACK_PLAYING = 2;
    const unsigned short PLAYBACK_COMPLETE = 3;
    const unsigned short PLAYBACK_STALLED = 4;
    readonly attribute unsigned short playbackState;

    // progress         
    attribute double currentTime;
    readonly attribute DOMString lastMark;

    attribute Function onmark(in DOMString mark, in unsigned long time);
    attribute Function onplay();
    attribute Function onpause();
    attribute Function oncomplete();
    attribute Function oncancel();
};

Instantiation

A SpeechSynthesizer can use a variety of different underlying services. The specific service depends on how the object is constructed:

  1. The SpeechSynthesizer() constructor uses the default synthesis service provided by the user agent. In some cases this may be an implementation that is local to the device, and limited by the capabilities of the device. In other cases it may be a remote service that is accessed over the network using mechanisms that are hidden to the application. The responsiveness, expressiveness, suitability for specific domains, and other characteristics of the default synthesis service will vary greatly between user agents and between devices.
  2. The SpeechSynthesizer(DOMString selectionParams) constructor causes the user agent to use one of its built-in synthesizers, selecting the particular synthesizer that matches the parameters listed in selectionParams. This is a URL-encoded string of label-value pairs. It is up to the user agent to determine how closely it chooses to honor the requested parameters. Suggested labels and their corresponding values include:
  3. The SpeechSynthesizer(XMLHttpRequest xhr) uses an application provided instance of XMLHttpRequest and the recommended HTTP conventions to access the speech synthesis service. When this constructor is used, the user agent should consider displaying a notification in it's surrounding chrome to notify the user of the particular speech service that is being used.

In the event that additional service access mechanisms are designed and standardized, such as new objects for accessing new real-time communication protocols, an additional constructor could be added, without changing the API.

Error Handling

The speecherror event must be dispatched when an error occurs. See 5.4 SpeechErrorInterface.

Content Specification

Either raw text (DOMString uses UTF-16) or SSML can be synthesized. Applications provide content to be synthesized by setting the src attribute to a URL from which the synthesizer service can fetch the content, or by assigning the content directly to the content attribute. Setting one of these attributes resets the other to ECMA undefined.

Audio Buffering

Synthesized audio is buffered, along with timing information for mark events. In typical use, an application will specify the content to be rendered, then call play(), at which point the synthesizer service will fetch the content and synthesize it to audio, which is buffered by the user agent (along with timing info for mark events as they occur), and played to the audio output device.

If the preload attribute is set, synthesizer service will be invoked to begin synthesising content into audio as soon as content is specified. Otherwise, if preload isn't set, the service won't be invoked until play() is called.

Applications can query the readyState attribute to determine status of the buffer. For example, when loading a lengthy sequence of audio, an application may display some sort of feedback to the user, or disable some of the interaction controls, until the buffer is ready. There are four readyState values:

BUFFER_EMPTY (numeric value 0)
indicates there is no data in the buffer, and the service has not been invoked to provide any data.
BUFFER_WAITING (numeric value 1)
indicates the service has been invoked, but no data has been received yet.
BUFFER_LOADING (numeric value 2)
indicates data is being received and is available in the buffer. Applications can also check the timeBuffered attribute to track progress.
BUFFER_COMPLETE (numeric value 3)
indicates that all data has been received and placed in the buffer.

Some applications may also want to use the raw synthesized audio for other purposes. Provided the readyState is BUFFER_COMPLETE, applications can fetch the raw audio data from the audioBuffer attribute.

Playback Controls

Playback is initiated or resumed by calling play(), and paused by calling pause(). The playbackState attribute is useful for applications to coordinate playback state with the rest of the application. playbackState has five values:

PLAYBACK_EMPTY (numeric value 0)
indicates that playback cannot occur because there is no audio in the buffer.
PLAYBACK_PAUSED (numeric value 1)
indicates the playback is paused.
PLAYBACK_PLAYING (numeric value 2)
indicates the audio buffer is being played to the speaker.
PLAYBACK_COMPLETE (numeric value 3)
indicates playback has reached the end of the buffer, and the buffer is complete.
PLAYBACK_STALLED (numeric value 4)
indicates playback has reached the end of the buffer, but the buffer is incomplete. Once there is more audio in the buffer, playback will automatically resume.

The current position of playback can be determined either by checking the currentTime attribute, which returns the time since the beginning of the buffer, in milliseconds; or by checking the lastMark attribute, which contains the label of the last mark reached in SSML prior to the current audio position. The same attributes can be used to seek to specific positions. For example, setting the mark attribute to the label value of a mark in the SSML content should move the playback to the corresponding positoin in the audio buffer.

Playback speed is controlled by setting rate, which has a default value of 1.0. Similarly, volume controls the audio amplitude, and can be set to any value between 0 and 100, with 50 as the default.

To cease all playback, clear the buffer, and cancel synthesis, call the cancel() function.

An application can respond to other key events. These are targeted at the SpeechSynthesizer object (or any object that inherits from it), do not bubble, and are not cancelable.

  1. mark: raised when the audio position corresponds to an SSML mark. The particular mark that was reached can be determined by the argument to mark or by checkign the mark attribute.
  2. play: raised whenever the object begins or resumes sending audio to the speaker.
  3. pause: raised whenever audio output is paused (typically in response to pause() being called).
  4. complete: raised when the end of the audio buffer has been output to the speaker, and no more audio is left to be synthesized.
  5. cancel: raised when the synthesis session has been cancelled, which means any outstanding transaction with the underlying service is discarded, and the buffer is emptied without being played.

7.5 SpeechError Interface

  [NoInterfaceObject]
  interface SpeechError {
    const unsigned short ABORTED = 1;
    const unsigned short AUDIO = 2;
    const unsigned short NETWORK = 3;
    const unsigned short NOT_AUTHORIZED = 4;
    const unsigned short REJECTED_SPEECHSERVICE = 5;
    const unsigned short BAD_GRAMMAR = 6;
    readonly attribute unsigned short code;
    readonly attribute DOMString message;
  };
  

The code attribute must return the appropriate code from the following list:

ABORTED (numeric value 1)
The user or a script aborted speech service.
AUDIO (numeric value 2)
There was an error with audio.
NETWORK (numeric value 3)
There was a network error, for implementations that use server-side speech services.
NOT_AUTHORIZED (numeric value 4)
The user agent is not allowing any speech services to occur for reasons of security or privacy.
REJECTED_SPEECHSERVICE (numeric value 5)
The user agent is not allowing the web application requested speech service, but would allow some speech serivice, to be used either because the user agent doesn't support it or becuase of reasons of security of privacy.
BAD_GRAMMAR (numeric value 6)
There was an error in the speech recognition grammar.
BAD_SSML (numeric value 7)
There was an error in the SSML given to the synthesizer.
BAD_STATE (numeric value 8)
There was an error as the state was wrong for the task that was requested.

The message attribute must return an error message describing the details of the error encountered. The message content is implementation specific. This attribute is primarily intended for debugging and developers should not use it directly in their application user interface.

7.6 Code Samples for Speech Interfaces

Multimodal Video Game Example

This example illustrates how an application like the multimodal video game example could be built.


<body onload="init()">

<!-- Assume lots of markup to layout the screen.
     Somewhere in this markup is an image the user touches to cast
     whatever spell they've selected with speech.
-->

  <img src="castspell.jpg" onclick="onSpellcastTouched()" />

<script type="text/javascript">

// Assume lots of script for the game logic,
// which we'll abstract as an object called "gameController"

  var recognizer;
  var synthesizer;
  var currentSpell; //remembers the name of the currently selected spell

  function init() {
    recognizer = new SpeechRecognizer("lang=en-au"); // English, preferably Australian
    recognizer.speechtype = 2; //continuous
    recognizer.defaultUI = false; //using custom in-game GUI
    recognizer.grammars.add("spell-list.grxml");
    recognizer.onspeechmatch = onReco;
    recognizer.startSpeechInput();

    synthesizer = new SpeechSynthesizer("lang=en-au&gender=female");
    
    // assume lots of other script to initiate the game logic
    currentSpell = "invisibility";
  }

  function onReco(results) {
    if (results(0).confidence > 0.8) {
      currentSpell = results(1).interpretation;
      synthesizer.cancel(); //barge-in any existing output
      synthesizer.content = results(1).interpretation + " spell armed";
      synthesizer.play();
      }
    }
  }

  function onSpellcastTouched() {
    gameController.castSpell(currentSpell);
  }

</script>

</body>

HMIHY Flight Booking Example

This example illustrates how an application like the HMIHY flight booking example might be implemented, using a HTTP-based speech recognition service.


<script type="text/javascript">
  var recognizer;
  var xhr;

  function onMicrophoneClicked() {
    navigator.device.opencapture(onOpenCapture);
  }

  function onOpenCapture(in Capture captureDevice) {
    xhr = new XMLHttpRequest();
    xhr.open("POST","https://webrecognizer.contoso.com/reco",true);
    reco = new SpeechRecognizer(xhr);
    recognizer.maxnbest = 4;
    recognizer.confidence = 0.2;
    recognizer.grammars.add("http://www.contosoair.com/grammarlib/HMIHY.srgs");
    var audioOptions = {  duration: 10, // max duration in seconds
                          limit: 1,     // only need one recording
                          mode: { type: "audio/amr-wb"}  // use wideband amr encoding
                       };
    var endpointParams = { sensitivity: 0.5, initialTimeout: 3000, endTimeout:500};
    recognizer.setInputDevice(captureDevice, audioOptions, endpointParams);
    recognizer.defaultUI = true; // use the built-in speech recognition UI
    recognizer.onspeechmatch = onspeechmatch;
    recognizer.startSpeechInput();
  }

  function onspeechmatch(results) {
    var emma = results.responseEMMAXML;
    switch (emma.getElementsByTagName("task")[0].textContent) {
      case "flight-booking":
        // assume code to display flight booking form fields
        // ...
        // If a field is present in the reco results, fill it in the form:
        assignField(emma,"from");
        assignField(emma,"to");
        assignField(emma,"departdate");
        assignField(emma,"departampm");
        assignField(emma,"returndate");
        assignField(emma,"returnampm");
        assignField(emma,"numadults");
        assignField(emma,"numkids");
        break;
       
      case "frequent-flyer-program":
          // etc
    }
  }

  function assignField(emma, fieldname) {
    if (0 != emma.getElementsByTagName(fieldname).count) {
      document.getElementById("id-" + fieldname).value = emma.getElementsByTagName(fieldname).textContent;
    }
  }
</script>

8 (Future) Markup Enhancements

This section describes an approach that could be used to integrate speech as first-class functionality in HTML markup, making a variety of speech scenarios achievable by the novice developer, while still enabling experienced developers to create sophisticated speech applications. Where possible the user agent will provide some lowest-common-denominator speech support (either locally or by selecting a default remote speech service). For example, sometimes the speech recognition is tied to a specific HTMLElement, often an HTMLInputElement. In these cases it may be possible to create default grammars and default speech behaviors specific to that input using the other input elements (type, pattern, etc.). At other times the web developer may need to specify an application specific grammar. This proposal supports both.

The basic idea behind the proposal is to create two new markup elements <reco> and <tts>. The <reco> element ties itself to its parent containing HTML element. So <input type="string" pattern="[A-z]{3}"><reco .../></input> would define a recognition element that is tied to the enclosing input element. This structure would mean that the reco element can inspect and use the basic information from the input element (like type and pattern) to automatically provide some speech grammars and behaviors and default user interaction idioms.

One approach to the <reco> element is to make HTMLInputElement and others allow <reco> as a child. Alternatively, if it is deemed as to radical to allow HTMLInputElement or other elements to take non-empty content, the <reco> could be associated with the HTMLInputElement in question using the for attribute similar to the label element today in HTML 5. The rest of this section will assume the former approach, but the latter approach could work as well if desired.

8.1 Speech Recognition Markup

The API for the HTMLRecoElement.

  interface HTMLRecoElement : HTMLElement {
    // type of speech
    const unsigned short SIMPLERECO = 0;
    const unsigned short COMPLEXRECO = 1;
    const unsigned short CONTINUOUSRECO = 2;
    attribute unsigned short speechtype;

    // type of autofill
    const unsigned short NOFILL = 0;
    const unsigned short FILLUTTERANCE = 1;
    const unsigned short FILLSEMANTIC = 2;
    attribute unsigned short autofill;

    // speech parameter attributes
    attribute boolean speechonfocus;
    attribute DOMString grammar;
    attribute short maxnbest;
    attribute long speechtimeout;
    attribute long completetimeout;
    attribute long incompletetimeout;
    attribute float confidence;
    attribute float sensitivity;
    attribute float speedvsaccuracy;

    // speech input event handler IDL attributes
    attribute Function onspeechmatch(in SpeechRecognitionResultCollection results);
    attribute Function onspeecherror(in SpeechError error);
    attribute Function onspeechnomatch();
    attribute Function onspeechnoinput();
    attribute Function onspeechstart();
    attribute Function onspeechend();

    // speech input methods
    void startSpeechInput();
    void stopSpeechInput();
    void cancelSpeechInput();
    void emulateSpeechInput(DOMString input);

    // service configuration
    void SetSpeechService(DOMString url, DOMString? lang, DOMString? parameters);
    void SetSpeechService(DOMString url, DOMString user, DOMString password, DOMString? lang, DOMString? params);
    void SetSpeechService(DOMString url, DOMString authHeader, Function onCustomAuth, DOMString? lang, DOMString? params);
    void SetCustomAuth(DOMString authValue);
    attribute DOMString speechservice;
    attribute DOMString speechparams;
    attribute DOMString authHeader;


    // speech response variables
    readonly attribute Stream capture;

    // states
    const unsigned short READY = 0;
    const unsigned short LISTENING = 1;
    const unsigned short WAITING = 2;

    readonly attribute unsigned short speechrecostate;

  };
  

User agents must support <reco> as a child of HTMLTextAreaElement and HTMLInputElement. User agents may support <reco> as a child of other elements such as HTMLAnchorElement, HTMLImageElement, HTMLAreaElement, HTMLFormElement, HTMLFieldSetElement, HTMLLableElement, HTMLButtonElement, HTMLSelectElement, HTMLDataListElement, HTMLOptionElement.

The speechtype attribute indicates the type of speech recognition that should occur. The values must be one of the following values:

SIMPLERECO (numeric value 0)

SIMPLERECO means that the request for recognition must raise just one final speechmatch, speechnomatch, speechnoinput, or specherror event and must not produce any interim events or results. If the speechservice is a remote service accessible over HTTP/HTTPS this means the recognition may be a simple request-response. SIMPLERECO must be the default value.

COMPLEXRECO (numeric value 1)

COMPLEXRECO means that the speech recognition request must produce all of the interim events in addition to speechstart, speechend, as well as the final recognition result. COMPLEXRECO must only produce one final recognition result, nomatch, noinput, or error as a COMPLEXRECO is still one utterance from the end user resulting in one result. Due to the nature of needing to stream audio while getting results back a remote service doing COMPLEXRECO should use a more sophisticated paradigm than simple request-response, such as WEBSOCKETS.

CONTINUOUSRECO (numeric value 2)

CONTINUOUSRECO represents a conversation or dialogue or dictation scenario where in addition to interim events numerous final recognition results must be able to be produced. A CONTINUOURECO speech interaction once started must not stop until the stopSpeechInput or cancelSpeechInput method is invoked.

The autofill attribute indicates if action should be taken on the recognition automatically, and if so upon what should it be based. Usually the default action will be to put the recognition result alue into the parent element, although the exact details are specific to the details of the parent element. The values must be one of the following values:

NOFILL (numeric value 0)

NOFILL means that upon a recognition match automatic behavior must not occur. NOFILL must be the default value.

FILLUTTERANCE (numeric value 1)

FILLUTTERANCE means that upon a recognition match the automatic behavior, if any, must prefer the use of the utterance of the recognition. For exmple, if the user says "San Francisco International Airport" and the semantic interpretation of the match is "SFO" the user agent uses "San Francisco International Airport" as its result in the default action.

FILLSEMANTIC (numeric value 2)

FILLSEMANTIC means that upon a recognition match the automatic behavior, if any, must prefer the use of the semantic interpretation of the recognition. This means if the user says "Er, um, I'd like 4 of those please" and the semantic interpretation is 4 that the user agent uses 4 as its result in the default action.

The boolean speechonfocus attribute, if true, specifies that speech must automatically start upon the containing element receiving focus. If the attribute is false then the speech must not start upon focus but must instead wait for an explicit startSpeechInput method call. The default value must be false.

The optional grammar attribute is a whitespace separated list of URLs that give the address of one or more external application-specific grammars, e.g. "http://example.com/grammars/pizza-order.grxml". The attribute, if present, must contain a whitespace separated list of valid non-empty URL. The implementation of recognition systems should use the list of grammar to guide the speech recognizer. Implementations must support SRGS grammars and SISR annotations. Note that the order of the grammars must define a priority order used to resolve ties where an earlier listed grammar is considered higher priority.

If the grammar attribute is absent the user agent should provide a reasonable context aware default grammar. For instance, if the speech attribute is on an input element with a type and pattern attribute defined then a user agent may use the same pattern to define a default grammar in the absence of an application-specific grammar attribute. Likewise, if the speech attribute is on a form element then a default grammar may be constructed by taking the grammars (default or application specified) of all the child or descendent elements inside the form.

The optional maxnbest attribute specifies that the implementation must not return a number of items greater than the maxnbest value. If the maxnbest is not set it must default to 1.

The optional speechtimeout attribute specifies the time in milliseconds to wait for start of speech, after which the audio capture must stop and a speechnoinput event must be returned. If not set, the timeout is speech service dependent.

The optional completetimeout attribute specifies the time in milliseconds the recognizer must wait to finalize a result (either accepting it or throwing a nomatch event for too low confidence results), when the speech is a complete match of all active grammars. If not set, the timeout is speech service dependent.

The optional incompletetimeout attribute specifies the time in milliseconds the recognizer must wait to finalize a result (either accepting it or throwing a nomatch event for too low confidence results), when the speech is an incomplete match (I.e., anything that is not a complete match) of all active grammars. If not set, the timeout is implementation dependent.

The optional confidence attribute specifies a confidence level. The recognition service must reject any recognition result with a confidence less than the confidence level. The confidence level must be a float between 0.0 and 1.0 inclusive and must have a default value of 0.5.

The optional sensitivity attribute specifies how sensitive the recognition system should be to noise. The recognition service must treat a higher value as a request to be more sensitive to noise. The sensitivity must be a float between 0.0 and 1.0 inclusive and must have a default value of 0.5.

The optional speedvsaccuracy attribute specifies how must the recognition system should prioritize a speedy low-latency result and how much it should prioritize getting the most accurate recognition. The recognition service must treat a higher value as a request to be have a more accurate result and must treat a lower value as a request to have a faster response. The speedvsaccuracy must be a float between 0.0 and 1.0 inclusive and must have a default value of 0.5.

The speechmatch event must be dispatched when a set of complete and valid utterances have been matched. This event must bubble and be cancelable. A complete utterance ends when the implementation detects end-of-speech or if the stopSpeechInput() method was invoked. Note that if the recognition was of type SIMPLERECO or COMPLEXRECO there must be only one speechmatch result returned and this must end the speech recognition.

The default action associated with the speechmatch may differ depending on the element upon which the speech is associated. Often this result in a value being set or a selection being made based on the value of the autofill attribute and the corresponding most likely interpretation or utterance.

Some implementations may dispatch the change event for elements when their value changes. When the new value was obtained as the result of a speech input session, such implementations must dispatch the speechmatch event prior to the change

The speecherror event must be dispatched when the active speech input session resulted in an error. This error may be a result of a user agent denying the speech session, or parts of it, due to security or privacy issues or may be the result of an error in the web authors specification of the speech request or could be the result of an error in the recognition system. This event must bubble and be cancelable.

The speechnomatch event must be raised when a complete utterence has failed to match the active grammars, or has only matched with a confidence less than the specified confidence value. This event must bubble and be cancelable. A complete utterance ends when the implementation detects end-of-speech or if the stopSpeechInput() method was invoked. Note that if the recognition was of type SIMPLERECO or COMPLEXRECO there must be only one speechnomatch returned and this must end the speech recognition.

The speechnoinput event must be raised when the recognizer has detected no speech and the speechtimeout has expired. This event must bubble and be cancelable. Note that if the recognition was of type SIMPLERECO or COMPLEXRECO there must be only one speechnoinput returned and this must end the speech recognition.

The speechstart event must be raised when the recognition service detects that a user has started speaking. This event must bubble and be cancelable. This event must not be raised if the speechtype was SIMPLERECO but must be generated if the speechtype is either COMPLEXRECO or CONTINUOUSRECO.

The speechend event must be raised when the recognition service detects that a user has stopped speaking. This event must bubble and be cancelable. This event must not be raised if the speechtype was SIMPLERECO but must be generated if the speechtype is either COMPLEXRECO or CONTINUOUSRECO.

When the startSpeechInput() method is invoked then a speech recognition turn must be started. It is an error to call startSpeechInput() when the speechrecostate is anything but READY and a user agent must raise a speecherror should this occur. Note that user agents should have privacy and security that specify if scripted speech should be allowed and user agents may prevent the startSpeechInput() from succeeding called on certain elements, applications, or sessions. If a recognition turn is not begun with the recognition service then a speecherror event must be raised.

When the stopSpeechInput() method is invoked, if there was an active speech input session and this element had initiated it, the user agent must gracefully stop the session as if end-of-speech was detected. The user agent must perform speech recognition on audio that has already been recorded and the relevant events must be fired if necessary. If there was no active speech input session, if this element did not initiate the active speech input session or if end-of-speech was already detected, this method must return without doing anything.

When the cancelSpeechInput() method is invoked, if there was an active speech input session and this element had initiated it, the user agent must abort the session, discard any pending/buffered audio data and fire no events for the pending data. If there was no active speech input session or if this element did not initiate it, this method must return without doing anything.

When the emulateSpeechInput(DOMString input) method is invoked then the speech service treats the input string as the text utterance that was spoken and returns the resulting recognition.

The capture defines a readonly variable that is a stream that accumulates the user speech as it occurs. If this stream element is upload to a recognition service in either an XMLHTTPRequest or WebSocket request the contents should be updated as more speech becomes available.

The speechrecostate readonly variable tracks the state of the recognition. At the beginning the state is READY meaning that the element is ready to start a speech interaction. Upon starting the recognition turn the state changes to LISTENING. Once the system has stopped capturing from the user, either due to the speech system detecting the end of speech or the web application calling stopSpeechInput(), but before the results have been returned the system is in WAITING. Once the system has received the results the state returns to READY.

The result object & collection are the same as the API in section 7.

8.2 Speech Synthesis Markup

The concept of having a media-player style of experience like those of <audio> or <video> doesn't map well to speech synthesis scenarios.

Although speech synthesis does produce audio output, its use cases are rather different from those of HTMLMediaElement subclasses <audio> and <video>. Listening to music or watching a movie tends to be a relatively passive experience, where the media is a central purpose the app: the media is the content. Contrast this with speech synthesis, which tends to be used as a UI component that works with other UI components to help a user interact with an app - the media isn't the content, it's just part of the UI used to access the content. So while there are some common semantics between speech synthesis and media playback APIs (such as volume, rate, play and pause controls), there are many differences. Synthesis apps tend to be very reactive, with the media generated in response to user action (it's synthesis, not recording). And concepts like <source> and <track> have no natural analogy in synthesis applications.

The design presented here borrows from the HTMLMediaElement for consistency where it makes sense to do so. But is not a subclass of that interface, since many of the inherited semantics and usage patterns would be of peripheral value, or outright confusing with TTS.


  interface HTMLTTSElement : HTMLElement {

    // error handling
    attribute Function onspeecherror(in SpeechError error);

    // service configuration
    void SetSpeechService(DOMString url, DOMString? lang, DOMString? parameters);
    void SetSpeechService(DOMString url, DOMString user, DOMString password, DOMString? lang, DOMString? params);
    void SetSpeechService(DOMString url, DOMString authHeader, Function onCustomAuth, DOMString? lang, DOMString? params);
    void SetCustomAuth(DOMString authValue);
    attribute DOMString speechservice;
    attribute DOMString speechparams;
    attribute DOMString authHeader;

    // content specification
    attribute DOMString src;
    attribute DOMString content;

    // audio buffering state
    const unsigned short BUFFER_EMPTY = 0;
    const unsigned short BUFFER_WAITING = 1;
    const unsigned short BUFFER_LOADING = 2;
    const unsigned short BUFFER_COMPLETE = 4;
    readonly attribute unsigned short readyState;
      
    attribute boolean preload;
    readonly attribute TimeRanges timeBuffered;
    readonly attribute Stream audioBuffer;

    // playback controls
    void play();
    void pause();
    void cancel();

    // playback state
    const unsigned short PLAYBACK_EMPTY = 0;
    const unsigned short PLAYBACK_PAUSED = 1;
    const unsigned short PLAYBACK_PLAYING = 2;
    const unsigned short PLAYBACK_COMPLETE = 3;
    const unsigned short PLAYBACK_STALLED = 4;
    readonly attribute unsigned short playbackState;
         
    attribute double rate;
    attribute unsigned short volume;
    attribute double currentTime;
    attribute DOMString mark;

    attribute Function onmark(in DOMString mark);
    attribute Function onplay();
    attribute Function onpause();
    attribute Function oncomplete();
    attribute Function oncancel();
};

Instantiation

The interface is called HTMLTTSElement and is represented in the markup as the <tts> element.

Error Handling

The speecherror event must be dispatched when an error occurs. See 7.4 SpeechErrorInterface.

Content Specification

Either raw text (DOMString uses UTF-16) or SSML can be synthesized. Applications provide content to be synthesized by setting the src attribute to a URL from which the synthesizer service can fetch the content, or by assigning the content directly to the content attribute. Setting one of these attributes resets the other to an empty string.

Audio Buffering

Synthesized audio is buffered, along with timing information for mark events. In typical use, an application will specify the content to be rendered, then call play(), at which point the synthesizer service will fetch the content and synthesize it to audio, which is buffered by the user agent (along with timing info for mark events as they occur), and played to the audio output device.

If the preload attribute is set, synthesizer service will be invoked to begin synthesising content into audio as soon as content is specified. Otherwise, if preload isn't set, the service won't be invoked until play() is called.

Applications can query the readyState attribute to determine status of the buffer. For example, when loading a lengthy sequence of audio, an application may display some sort of feedback to the user, or disable some of the interaction controls, until the buffer is ready. There are four readyState values:

BUFFER_EMPTY (numeric value 0)
indicates there is no data in the buffer, and the service has not been invoked to provide any data.
BUFFER_WAITING (numeric value 1)
indicates the service has been invoked, but no data has been received yet.
BUFFER_LOADING (numeric value 2)
indicates data is being received and is available in the buffer. Applications can also check the timeBuffered attribute to track progress.
BUFFER_COMPLETE (numeric value 3)
indicates that all data has been received and placed in the buffer.

Some applications may also want to use the raw synthesized audio for other purposes. Provided the readyState is BUFFER_COMPLETE, applications can fetch the raw audio data from the audioBuffer attribute.

Playback Controls

Playback is initiated or resumed by calling play(), and paused by calling pause(). The playbackState attribute is useful for applications to coordinate playback state with the rest of the application. playbackState has five values:

PLAYBACK_EMPTY (numeric value 0)
indicates that playback cannot occur because there is no audio in the buffer.
PLAYBACK_PAUSED (numeric value 1)
indicates the playback is paused.
PLAYBACK_PLAYING (numeric value 2)
indicates the audio buffer is being played to the speaker.
PLAYBACK_COMPLETE (numeric value 3)
indicates playback has reached the end of the buffer, and the buffer is complete.
PLAYBACK_STALLED (numeric value 4)
indicates playback has reached the end of the buffer, but the buffer is incomplete. Once there is more audio in the buffer, playback will automatically resume.

The current position of playback can be determined either by checking the currentTime attribute, which returns the time since the beginning of the buffer, in milliseconds; or by checking the mark attribute, which contains the label of the last mark reached in SSML prior to the current audio position. The same attributes can be used to seek to specific positions. For example, setting the mark attribute to the label value of a mark in the SSML content should move the playback to the corresponding positoin in the audio buffer.

Playback speed is controlled by setting rate, which has a default value of 1.0. Similarly, volume controls the audio amplitude, and can be set to any value between 0 and 100, with 50 as the default.

To cease all playback, clear the buffer, and cancel synthesis, call the cancel() function.

An application can respond to other key events:

  1. mark: raised when the audio position corresponds to an SSML mark. The particular mark that was reached can be determined by the argument to mark or by checkign the mark attribute.
  2. play: raised whenever the object begins or resumes sending audio to the speaker.
  3. pause: raised whenever audio output is paused (typically in response to pause() being called).
  4. complete: raised when the end of the audio buffer has been output to the speaker, and no more audio is left to be synthesized.
  5. cancel: raised when the synthesis session has been cancelled, which means any outstanding transaction with the underlying service is discarded, and the buffer is emptied without being played.

8.3 Speech Service Specification

The user agent must provides a default speech service for both recognition and speech synthesis. However, due to the varying needs of the application, the varying capabilities of recognizers and synthesisers, the desire to keep application specific intellectual property relating to voices or grammars or other technology, or the desire to provide a consistent user experience across different user agents it must be possible to use speech services other than the default.

To use a specific service, the application calls SetSpeechService(url, lang, params). The url parameter specifies the service to be invoked for either recognition or synthesis (see also 6.4 HTTP Conventions). The optional lang parameter can be used to specify the language the application desires. The service must use the language specified in the standard content (such as specified in SRGS or SSML) if one is specified. If the content doesn't specify a language and the language parameter is omitted or blank then the service uses its own proprietary logic (such as a default setting, part of the service URL, examining the content, etc.). If the lang parameter is specified, it must be in standard language code format. If the service is unable to use the language specified an error must be raised. Many services also accept additional proprietary parameters that govern their operation. These can be supplied with the optional params parameter, in the form of a string of URL-encoded name-value pairs.

Some services will require authentication using standard HTTP challenge/response. When this happens, the user agent should invoke its regular authentication logic. Often this will result in a dialog box requesting the user for a user name and password. In many cases, this is not a desirable user experience for the application, and can be circumvented by providing the username and password beforehand as parameters to SetSpeechService() by calling SetSpeechService(url, user, password, lang, params).

Some services will use proprietary authentication schemes that require placing a special value in a proprietary header. To specify such a service, applications should call SetSpeechService(url, authHeader, onCustomAuth, lang, parameters), where authHeader is the name of the proprietary header, and onCustomAuth is a function provided by the application. In this case, when the user agent invokes the service, it should call the application function provided in onCustomAuth and then wait until the application calls SetCustomAuth(authValue), where authValue is the value to be assigned to the custom auth header, at which point the user agent should go ahead and invoke the service. The reason for this call-back and wait logic is because a common web service authentication pattern involves using the current clock time as an input to the authentication token.

The optional speechservice attribute is a URL to the recognition service to be used. This attribute can either be set directly or through the setSpeechService() method calls. This recognition service could be local to the user agent or a remote service separate from the user agent. The user agent must attempt to use the specified recognition service. If the user agent can not use the specified recognition service it must notify the web application by raising a speecherror event and may attempt to use a different recognition service. Note that user agents should have privacy and security settings that specify if the speech should be allowed to be delivered to a specific recognition service and may prevent certain speech services from being used. Note also that because running a high quality speech service is difficult it is expected that many applications may want to use networked recognition services that are at a different domain than the web application and user agents should enable this with the trusted recognition services. If not present then the user agent must use its default recognition service.

The optional speechparams attribute is designed to take extensible and custom parameters particular to a given speech service. This attribute can either be set directly or as a result of the setSpeechService() method. For instance, one recognition service may require account information for authorization while a different recognition service may allow recognizer specific tuning parameters or recognizer specific acoustic models or recognizer context blocks to be specified. Because there may be many such parameters the speechparams must be specified as a set of name=value pairs in a URI encoded query parameters string which may use either ampersands or semicolons as pair separators.

The optional authHeader attribute is used to set the authorization header to be used with the speech service. This can either be set directly or through the setSpeechService() attribute.

8.4 Backwards compatibility

A DOM application can use the hasFeature(feature, version) method of the DOMImplementation interface with parameter values "SpeechInput" and "1.0" (respectively) to determine whether or not this module is supported by the implementation.

Implementations that don't support speech input will ignore the additional attributes and events defined in this module and the HTML elements with these attributes will continue to work with other forms of input.

8.5 Code Samples for (Future) Markup Enhancements

This example illustrates how speech markup could be used to implement a web search page that uses speech.



<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta content="text/html; charset=us-ascii" http-equiv="content-type" />
    <title>Bing</title>
  </head>

  <body>
    <form action="/search" id="sb_form" name="sb_form">
      <input class="sw_qbox" id="sb_form_q" name="q" title="Enter your search term" type="text" value="" >
        <reco speechtype="0" autofill="2" speechonfocus="true" speechservice="http://www.bing.com/speech" onspeechmatch="document.sb_form.submit()" />
      </input>
      <input class="sw_qbtn" id="sb_form_go" name="go" tabindex="0" title="Search" type="submit" value="" />
      <input name="form" type="hidden" value="QBLH" />
    </form>
  </body>
</html>


  

This example illustrates how speech markup could be used to implement a simple flight booking form.



<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta content="text/html; charset=us-ascii" http-equiv="content-type" />
    <title>Contso Flight Booking</title>
  </head>

  <body>
    <script>
      function submit_page_if_full () {
          if (
              document.getElementById("where_from").value != "" &&
              document.getElementById("where_to").value != "" &&
              document.getElementById("departing").value != "" &&
              document.getElementById("returning").value != "" 
             ) {
              document.flight_form.submit();
          }
      }

      document.getElementById("flight_form").addEventListener("speechmatch", submit_page_if_full, false);
    </script>
    <form action="/search" id="flight_form" name="flight_form">
      <input id="where_from" name="where_from" title="Where from?" type="text" value="">
        <reco speechtype="0" autofill="2" speechonfocus="true" speechservice="http://www.contso.com/speech" />
      </input>
      <input id="where_to" name="where_to" title="Where to?" type="text" value="">
        <reco speechtype="0" autofill="2" speechonfocus="true" speechservice="http://www.contso.com/speech" />
      </input>
      <input id="departing" name="departing" title="Departing" type="datetime-local" value="">
        <reco speechtype="0" autofill="2" speechonfocus="true" speechservice="http://www.contso.com/speech" />
      </input>
      <input id="returning" name="returning" title="Returning" type="datetime-local" value="">
        <reco speechtype="0" autofill="2" speechonfocus="true" speechservice="http://www.contso.com/speech" />
      </input>
      <input id="flight_form_go" name="go" title="Search" type="submit" value="" />
    </form>
  </body>
</html>

  

9 Evaluation Against Scenarios, Use-Cases and Requirements

There are three sub-proposals in this specification, each of which is cumulative on the previous, and each of which is scored against the Scenario Examples, and against the HTML Speech XG Use Cases and Requirements:

  1. Basic Extensions to Existing HTML Designs.
  2. Speech Object Interfaces, and assuming the existence of the Basic Extensions.
  3. (Future) Markup Enhancements, included for discussion purposes.

A "four star" score is used to evaluate each sub-proposal against the requirements or scenarios:

This section outlines the use cases and requirements that are covered by this specification.

Requirement Basic Extensions to existing HTML interfaces Speech Object Interfaces (Future) Markup Enhancements
FPR40. Web applications must be able to use barge-in (interrupting audio and TTS output when the user starts speaking). ***- ***- ***-
FPR4. It should be possible for the web application to get the recognition results in a standard format such as EMMA. **** **** ****
FPR24. The web app should be notified when recognition results are available. **** **** ****
FPR50. Web applications must not be prevented from integrating input from multiple modalities. **** **** ****
FPR59. While capture is happening, there must be a way for the web application to abort the capture and recognition process. **** **** ****
FPR52. The web app should be notified when TTS playback finishes. ***- **** ****
FPR60. Web application must be able to programatically abort tts output. **** **** ****
FPR38. Web application must be able to specify language of recognition. ***- **** ****
FPR45. Applications should be able to specify the grammars (or lack thereof) separately for each recognition. ***- **** ****
FPR1. Web applications must not capture audio without the user's consent. **** **** ****
FPR19. User-initiated speech input should be possible. **** **** ****
FPR21. The web app should be notified that capture starts. **** **** ****
FPR22. The web app should be notified that speech is considered to have started for the purposes of recognition. **** **** ****
FPR23. The web app should be notified that speech is considered to have ended for the purposes of recognition. **** **** ****
FPR25. Implementations should be allowed to start processing captured audio before the capture completes. **** **** ****
FPR26. The API to do recognition should not introduce unneeded latency. **** **** ****
FPR34. Web application must be able to specify domain specific custom grammars. ***- **** ****
FPR35. Web application must be notified when speech recognition errors or non-matches occur. ***- **** ****
FPR42. It should be possible for user agents to allow hands-free speech input. **** **** ****
FPR48. Web application author must be able to specify a domain specific statistical language model. ***- ***- ***-
FPR54. Web apps should be able to customize all aspects of the user interface for speech recognition, except where such customizations conflict with security and privacy requirements in this document, or where they cause other security or privacy problems. **** **** ****
FPR51. The web app should be notified when TTS playback starts. ***- **** ****
FPR53. The web app should be notified when the audio corresponding to a TTS element is played back. **-- **** ****
FPR5. It should be easy for the web appls to get access to the most common pieces of recognition results such as utterance, confidence, and nbests. ***- **** ****
FPR39. Web application must be able to be notified when the selected language is not available. ***- **** ****
FPR13. It should be easy to assign recognition results to a single input field. ***- **** ****
FPR14. It should not be required to fill an input field every time there is a recognition result. **** **** ****
FPR15. It should be possible to use recognition results to fill multiple input fields. **** **** ****
FPR16. User consent should be informed consent. **** **** ****
FPR18. It must be possible for the user to revoke consent. **** **** ****
FPR11. If the web apps specify speech services, it should be possible to specify parameters. **** **** ****
FPR12. Speech services that can be specified by web apps must include network speech services. **** **** ****
FPR2. Implementations must support the XML format of SRGS and must support SISR. ***- **** ****
FPR27. Speech recognition implementations should be allowed to add implementation specific information to speech recognition results. **** **** ****
FPR3. Implementation must support SSML. ***- **** ****
FPR46. Web apps should be able to specify which voice is used for TTS. ***- **** ****
FPR7. Web apps should be able to request speech service different from default. **** **** ****
FPR9. If browser refuses to use the web application requested speech service, it must inform the web app. **** **** ****
FPR17. While capture is happening, there must be an obvious way for the user to abort the capture and recognition process. **** **** ****
FPR37. Web application should be given captured audio access only after explicit consent from the user. **** **** ****
FPR49. End users need a clear indication whenever microphone is listening to the user. **** **** ****
FPR33. There should be at least one mandatory-to-support codec that isn't encumbered with IP issues and has sufficient fidelity & low bandwidth requirements. ---- ---- ----
FPR28. Speech recognition implementations should be allowed to fire implementation specific events. *--- *--- *---
FPR41. It should be easy to extend the standard without affecting existing speech applications. **** **** ****
FPR36. User agents must provide a default interface to control speech recognition. *--- **** ****
FPR44. Recognition without specifying a grammar should be possible. **** **** ****
FPR61. Aborting the TTS output should be efficient. ***- ***- ***-
FPR32. Speech services that can be specified by web apps must include local speech services. *--- **** ****
FPR47. When speech input is used to provide input to a web app, it should be possible for the user to select alternative input methods. **** **** ****
FPR56. Web applications must be able to request NL interpretation based only on text input (no audio sent). ***- **** ****
FPR30. Web applications must be allowed at least one form of communication with a particular speech service that is supported in all UAs. **** **** ****
FPR55. Web application must be able to encrypt communications to remote speech service.