HTML Speech Incubator Group Final Report

1 Terminology

The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this specification are to be interpreted as described in [IETF RFC 2119].

2 Overview

This document presents the deliverables of the HTML Speech Incubator Group. First, it presents the requirements developed by the group, ordered by priority of interest of the group members. Next, it briefly describes and points to the major individual proposals sent in to the group as proof-of-concept examples to help the group be aware of both possibilities and tradeoffs. It then presents design possibilities on important topics, providing decisions where the group had consensus and alternatives where multiple strongly differing opinions existed, with a focus on satisfying the high-interest requirements. Finally, the document contains (all or some of) a proposed solution that addresses the high-interest requirements and the design decisions.

The major steps the group took in working towards API recommendations, rather than just the final decisions, are recorded to act as an aid to any future standards-track efforts in understanding the motivations that drove the recommendations. Thus, even if a final standards-track document differs from any API recommendations in this document, the final standard should address the requirements and design decisions laid out by this Incubator Group.

3 Deliverables

According to the charter, the group is to produce one deliverable, this document. It goes on to state that the document may include

Requirements
Use cases
Change requests to HTML5 and, as appropriate, other specifications, e.g., capture API, CSS, Audio XG, EMMA, SRGS, VoiceXML 3

The group has developed requirements, some with use cases, and has made progress towards one or more API proposals that are effectively change requests to other existing standard specifications. These subdeliverables follow.

3.1 Prioritized Requirements

The HTML Speech Incubator Group developed and prioritized requirements as described in the Requirements and use cases document. A summary of the results is presented below with requirements listed in priority order, and segmented into those with strong interest, those with moderate interest, and those with mild interest. Each requirement is linked to its description in the requirements document.

3.1.1 Strong Interest

A requirement was classified as having "strong interest" if at least 80% of the group believed it needs to be addressed by any specification developed based on the work of this group. These requirements are:

3.1.2 Moderate Interest

A requirement was classified as having "moderate interest" if less than 80% but at least 50% of the group believed it needs to be addressed by any specification developed based on the work of this group. These requirements are:

3.1.3 Mild Interest

A requirement was classified as having "mild interest" if less than 50% of the group believed it needs to be addressed by any specification developed based on the work of this group. These requirements are:

3.2 Individual Proposals

The following individual proposals were sent in to the group to help drive discussion.

From Google, a speech input API with a modification and a TTS proposal.
From Mozilla, a speech input proposal.
From Microsoft, a speech and tts proposal.
From Voxeo, a description of Voxeo's Javascript Tropo API.

3.3 Solution Design Agreements and Alternatives

This section attempts to capture the major design decisions the group made. In cases where substantial disagreements existed, the relevant alternatives are presented rather than a decision. Note that text only went into this section if it either represented group consensus or an accurate description of the specific alternative, as appropriate.

3.3.1 General Design Decisions

There are three aspects to the solution which must be addressed: communication with and control of speech services, a script-level API, and markup-level hooks and capabilities.
The script API will be Javascript.
The scripting API is the primary focus, with all key functionality available via scripting. Any HTML markup capabilities, if present, will be based completely on the scripting capabilities.
Notifications from the user agent to the web application should be in the form of Javascript events/callbacks.
For ASR, there must at least be these three logical functions:
1. start speech input and start processing
2. stop speech input and get result
3. cancel (stop speech input and ignore result)
For TTS, there must be at least these two logical functions:
1. play
2. pause
There is agreement that it should be possible to stop playback, but there is not agreement on the need for an explicit stop function.
It must be possible for a web application to specify the speech engine.
Speech service implementations must be referenceable by URI.
It must be possible to reference ASR grammars by URI.
It must be possible to select the ASR language using language tags.
It must be possible to leave the ASR grammar unspecified. Behavior in this case is not yet defined.

3.3.2 Speech Service Communication and Control Design Decisions

This is where design decisions regarding control of and communication with remote speech services, including media negotiation and control, will be recorded.

3.3.3 Script API Design Decisions

This is where design decisions regarding the script API capabilities and realization will be recorded.

It must be possible to define at least the following handlers (names TBD):
- onspeechstart (not yet clear precisely what start of speech means)
- onspeechend (not yet clear precisely what end of speech means)
- onerror (one or more handlers for errors)
- a handler for when the recognition result is available
Note: significant work is needed to get interoperability here.

3.3.4 Markup API Design Decisions

This is where design decisions regarding the markup changes and/or enhancements will be recorded.

3.4 Proposed Solution

TBD after we make substantial progress on the design decisions.

HTML Speech Incubator Group Final Report

W3C Note 19 April 2011

Abstract

Status of this Document

Table of Contents

Appendices