HTML Speech XG
Proposed Protocol Approach

Draft Version 3, 6th July, 2011

This version:
Posted to http://lists.w3.org/Archives/Public/public-xg-htmlspeech/
Latest version:
Posted to http://lists.w3.org/Archives/Public/public-xg-htmlspeech/
Previous version:
http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/att-0065/speech-protocol-basic-approach-02.html
Editor:
Robert Brown, Microsoft
Contributors:
Milan Young, Nuance
Michael Johnston, AT&T
Patrick Ehlen, AT&T
And other contributions from HTML Speech XG participants, http://www.w3.org/2005/Incubator/htmlspeech/

Abstract

The HTML Speech protocol is defined as a sub-protocol of WebSockets [WS-PROTOCOL], and enables HTML user agents and applications to make interoperable use of network-based speech service providers, such that applications can use the service providers of their choice, regardless of the particular user agent the application is running in. The protocol bears some similarity to [MRCPv2]. However, since the use cases for HTML Speech applications are in some places considerably different from those around which MRCPv2 was designed, the HTML Speech protocol is not merely a transcript of MRCP, but shares some design concepts, while simplifying some details, and adding others. Similarly, because the HTML Speech protocol builds on WebSockets, its session negotiation and media transport needs are quite different from those of MRCP.

Status of this Document

This document is an informal rough draft that collates proposals, agreements, and open issues on the design of the necessary underlying protocol for the HTML Speech XG, for the purposes of review and discussion within the XG.

Contents

  1. Architecture
  2. Definitions
  3. Protocol Basics
    1. Session Establishment
    2. Signaling
      1. Generic Headers
      2. Request Messages
      3. Status Messages
      4. Event Messages
    3. Media Transmission
  4. General Capabilities
    1. Getting and Setting Parameters
    2. Interim Events
    3. Requestless Notifications
  5. Recognition
    1. Recognition Requests
    2. Recognition Events
    3. Recognition Headers
    4. Predefined Grammars
    5. Recognition Examples
  6. Synthesis
    1. Synthesis Requests
    2. Synthesis Events
    3. Synthesis Headers
    4. Synthesis Examples
  7. References

1. Architecture


             Client
|-----------------------------|
|       HTML Application      |                                            Server
|-----------------------------|                                 |--------------------------|
|       HTML Speech API       |                                 | Synthesizer | Recognizer |
|-----------------------------|                                 |--------------------------|
| HTML-Speech Protocol Client |---html-speech/1.0 subprotocol---|     HTML-Speech Server   |
|-----------------------------|                                 |--------------------------|
|      WebSockets Client      |-------WebSockets protocol-------|     WebSockets Server    |
|-----------------------------|                                 |--------------------------|

TODO: Write up design concept with some example sequence diagrams to illustrate how the protocol would work. This shouldn't just be a repeat of MRCP examples. It should also show some of things that could happen in an HTML application that wouldn't happen in an IVR application. *Especially* synthesis. For example: synthesizing multiple prompts in parallel for playback in the UA when the app needs them, rather than synthesis occurring on a prompt queue; or recognition from both an audio stream and gesture stream; or recognition from two streams with alternate encodings (one of which is decodable to be listened to by humans, the other is specifically encoded for the recognizer; or recognition from multiple beams of an array mic, with multiple simultaneous users.

2. Definitions

Recognizer

A Recognizer performs speech recognition, with the following characteristics:

  1. Support for one or more spoken languages and acoustic scenarios.
  2. Processing of one or more input stream. The typical scenario consists of a single stream of encoded audio. But some scenarios will involve multiple audio streams, such as multiple beams from an array microphone picking up different speakers in a room; or streams of multimodal input such as gesture or motion, in addition to speech.
  3. Support for multiple simultaneous grammars/language models, including but not limited to [SRGS].
  4. Support for continous recognition, generating match events, and other events, as appropriate.
  5. Support for at least one "dictation" language model, enabling essentially unconstrained spoken input by the user.

Because continuous recognition plays an important role in HTML Speech scenarios, a Recognizer is a resource that essentially acts as a filter on its input streams. Its grammars/language models can be specified and changed at any time, as needed by the application, and the recognizer adapts its processing accordingly. Single-shot recognition (e.g. a user on a web search page presses a button and utters a single web-search query) is a special case of this general pattern, where the application specifies its model once, and is only interested in one match event, after which it stops sending audio (if it hasn't already).

"Recognizers" are not strictly required to perform speech recognition, and may perform additional or alternative functions, such as speaker verification, emotion detection, or audio recording.

Synthesizer

A Synthesizer generates audio streams from textual input, with the following characteristics:

  1. Support for [SSML] input, although specific implementations may support other formats for other audio-rendering purposes.
  2. Generation of interim events, such as those corresponding to SSML marks, with precise timing.

Because playback is done asynchronously by the user agent if and when the application deems appropriate, a Synthesizer resource does not provide any form of shuttle control (pausing or skipping) or volume control. Nor does it have any concept of queuing. It simply services each synthesis request it receives, rendering the audio at least as rapidly as would be needed to support real-time playback, and preferably faster. The client may make multiple simultaneous requests, which the server must service simultaneously, and for which it may stream the rendered audio simultaneously.

3. Protocol Basics

3.1 Session Establishment

The WebSockets session is established through the standard WebSockets HTTP handshake, with these specifics:

For example:


C->S: GET /speechservice123?customparam=foo&otherparam=bar HTTP/1.1
      Host: examplespeechservice.com
      Upgrade: websocket
      Connection: Upgrade
      Sec-WebSocket-Key: OIUSDGY67SDGLjkSD&g2 (for example)
      Sec-WebSocket-Version: 9
      Sec-WebSocket-Protocol: html-speech/1.0, x-proprietary-speech

S->C: HTTP/1.1 101 Switching Protocols
      Upgrade: websocket
      Connection: Upgrade
      Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=
      Sec-WebSocket-Protocol: html-speech/1.0

Once the WebSockets session is established, the UA can begin sending requests and media to the service, which can respond with events, responses or media, in the case of TTS.

NOTE: In MRCP, session negotiation also involves negotiating unique channel IDs (e.g. 128397521@recognizer) for the various resource types the client will need (recognizer, synthesizer, etc). In html-speech/1.0 this is unnecessary, since the WebSockets connection itself provides a unique shared context between the client and server, and resources are referred to directly by type, without the need for channel-IDs.

3.2 Signaling

The signaling design borrows its basic pattern from [MRCPv2], where there are three classes of control messages:

Requests
C->S requests from the UA to the service. The client requests a "method" from a particular remote speech resource. These methods are assumed to be defined as in MRCP2 except where noted.
Status Notifications
S->C general status notification messages from the service to the UA, marked as either PENDING, IN-PROGRESS or COMPLETE.
Events
S->C named events from the service to the UA, that are essentially special cases of 'IN-PROGRESS' request-state notifications.

The interaction is full-duplex and asymmetrical: service activity is instigated by requests the UA, which may be multiple and overlapping, where each request results in one or more messages from the service back to the UA.


control-message =   start-line ; i.e. use the typical MIME message format
                  *(header CRLF)
                    CRLF
                   [body]
start-line      =   request-line | status-line | event-line
header          =  <Standard MIME header format> ; actual headers depend on the type of message
body            =  *OCTET                        ; depends on the type of message

3.2.1 Generic Headers


generic-header  =
                | accept
                | accept-charset
                | content-base
                | logging-tag
                | resource-id
                | vendor-specific
                | active-request-id-list
                | cache-control
                | channel-identifier
                | content-type
                | content-id
                | content-encoding
                | content-location
                | content-length
                | fetch-timeout
                | proxy-sync-id
                | set-cookie
                | set-cookie2

resource-id     = "Resource-ID:" ("recognizer" | "synthesizer" | vendor-resource)
vendor-resource = "x-" 1*UTFCHAR

accept          = <same as [MRCPv2]>
accept-charset  = <same as [MRCPv2]>
content-base    = <same as [MRCPv2]>
logging-tag     = <same as [MRCPv2]>
vendor-specific = <same as [MRCPv2]>

NOTE: This is mostly a strict subset of the [MRCPv2] generic headers, many of which have been omitted as either unnecessary or inappropriate for HTML speech client/server scenarios.

Resource-ID
The Resource-ID header is included in all signaling messages. In requests, it indicates the resource to which the request is directed. In status messages and events, it indicates the resource from which the message originated.
Accept
The Accept header is similar to its namesake in [MRCPv2]. It MAY be included in any message to indicate the content types that will be accepted by the sender of the message from the receiver of the message. When absent, the following defaults should be assumed: clients will accept "application/emma+xml" from recognizers; recognizers will accept "application/srgs+xml"; synthesizers will accept "application/ssml+xml".
Accept-Charset
The Accept-Charset header is similar to its namesake in [MRCPv2]. When absent, any charset may be used. This header has two general purposes: so the client can indicate the charset it will accept in recognition results; and so the synthesizer can indicate the charset it will accept for SSML documents.
Content-Base
The Content-Base header is similar to its namesake in [MRCPv2]. When a message contains an entity that includes relative URIs, Content-Base provides the absolute URI against which they are based.
Logging-Tag
The Logging-Tag header is similar to its namesake in [MRCPv2]. It is generally only used in requests, or in response to GET-PARAMS.
Vendor-Specific-Parameters
The Vendor-Specific-Parameters header is similar to its namesake in [MRCPv2].

3.2.2 Request Messages

Request messages are sent from the client to the server, usually to request an action or modify a setting. Each request has its own request-id, which is unique within a given WebSockets html-speech session. Any status or event messages related to a request use the same request-id. All request-ids MUST have an integer value between 0 and 65535.

NOTE: This upper value is less that that used in MRCP, and was chosen so that request-ids can be associated with audio streams using a 16-bit field in binary audio messages.


request-line   = version SP message-length SP method-name SP request-id SP CRLF
version        = "html-speech/" 1*DIGIT "." 1*DIGIT ; html-speech/1.0
message-length = 1*DIGIT
method-name    = general-method | synth-method | reco-method | proprietary-method
request-id     = <Decimal integer between 0 and 65535>

NOTE: In MRCP, all messages include their message length, so that they can be framed in what is otherwise an open stream of data. In html-speech/1.0, framing is already provided by WebSockets, and message length is not needed, and therefore not included.

3.2.3 Status Messages

Status messages are sent by the server, to indicate the state of a request.


status-line   =  version SP message-length SP request-id SP status-code SP request-state CRLF								
status-code   =  3DIGIT       ; Specific codes TBD, but probably similar to those used in MRCP

; All communication from the server is labeled with a request state.
request-state = "COMPLETE"    ; Processing of the request has completed.
              | "IN-PROGRESS" ; The request is being fulfilled.
              | "PENDING"     ; Processing of the request has not begun.

TODO: Determine status code values.

3.2.4 Event Messages

Event messages are sent by the server, to indicate specific data, such as synthesis marks, speech detection, and recognition results. They are essentially specialized status messages.


event-line   =  version SP message-length SP event-name SP request-id SP request-state CRLF								
event-name    =  synth-event | reco-event | proprietary-event

3.3 Media Transmission

HTML Speech applications feature a wide variety of audio transmission scenarios. The number of audio streams at any given time is not fixed. A recognizer may accept one or more input streams, which may start and end at any time as microphones or other input devices are activated/deactivated by the application or the user. Recognizers do not require their data in real-time, and will generally prefer to wait for delayed packets where a human listener would rather just skip the data and put it down to a "bad line". Applications may, and often will, request the synthesis of multiple SSML documents at the same time, for playback at the application's discretion. The synthesizer needs to return rendered data to the client rapidly (generally faster than real time), and MAY render multiple requests in parallel if it has the capacity to do so.

In html-speech/1.0, audio (or other media) is packetized and transmitted as a series of WebSockets binary messages, on the same WebSockets session used for the control messages.


audio-packet        =  binary-message-type
                       binary-request-id
                       binary-reserved
                       binary-data
binary-message-type =  OCTET ; Values > 0x03 are reserved. 0x00 is undefined.
binary-session-id   = 2OCTET ; Matches the request-id for a SPEAK or START-MEDIA-STREAM request
binary-reserved     =  OCTET ; Just buffers out to the 32-bit boundary for now, but may be useful later.
binary-data         = *OCTET 

The binary-request-id field is used to correlate the audio stream with the corresponding SPEAK or START-MEDIA-STREAM request. It is a 16-bit unsigned integer.

The binary-message-type field has these defined values:

0x01: Audio
The message is an audio packet, and contains encoded audio data.

         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |           request-id          |   reserved    |
        |1 0 0 0 0 0 0 0|1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 1|0 0 0 0 0 0 0 0|
        +---------------+-------------------------------+---------------+
        |                       encoded audio data                      |
        |                              ...                              |
        |                              ...                              |
        |                              ...                              |
        +---------------------------------------------------------------+
0x02: Skip
The message indicates a period of silence, and contains the new timestamp from which all future audio packets in the stream are based. This message is optional. Use of this message type is optional, and determined by the UA. If a UA knows that it is using a CODEC that does not efficiently encode silence, and the UA has its own silence-detection logic, it MAY use this message to avoid transmitting unecessary audio packets.

         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |           request-id          |   reserved    |
        |0 1 0 0 0 0 0 0|1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 1|0 0 0 0 0 0 0 0|
        +---------------+-------------------------------+---------------+
        |                           timestamp                           |
        |1 0 1 1 1 0 0 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|
        +---------------------------------------------------------------+
0x03: End-of-stream
The message indicates the end of the audio stream. Any future audio messages with the same request-id are invalid.

         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |           request-id          |   reserved    |
        |1 1 0 0 0 0 0 0|1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 1|0 0 0 0 0 0 0 0|
        +---------------+-------------------------------+---------------+

There is no strict constraint on the size and frequency of audio messages. Nor is there a requirement for all audio packets to encode the same duration of sound. Implementations SHOULD seek to minimize interference with the flow of other messages, by sending messages that encode between 20 and 80 milliseconds of media. Since a WebSockets frame header is typically only 4 bytes, overhead is minimal and implementations should err on the side of sending smaller packets more frequently.

NOTE: The design does not permit the transmission of media as text messages. Because base-64 encoding of audio data in text incurs a 33% transmission overhead, and WebSockets provides native support for binary messages, there is no need for text-encoding of audio.

A sequence of audio messages represents an in-order contiguous stream of audio data. Because the messages are sent in-order and audio packets cannot be lost (WebSockets uses TCP), there is no need for sequence numbering or timestamps. The sender just packetizes audio from the encoder and sends it, while the receiver just un-packs the messages and feeds them to the decoder. Timing is calculated by decoded offset from the beginning of the stream.

A synthesis service MAY (and typically will) send audio faster than real-time, and the client MUST be able to handle this.

A client MAY send audio to a recognition service slower than real time. While this generally causes undesirable user interface latency, it may be necessary due to practical limitations of the network. A recognition service MUST be prepared to receive slower-than-real-time audio.

Although most services will be strictly either recognition or synthesis services, some services may support both in the same session. While this is a more advanced scenario, the design does not introduce any constraints to prevent it. Indeed, both the client and server MAY send audio streams in the same session.

Advanced implementations of HTML Speech may incorporate multiple channels of audio in a single transmission. For example, living-room devices with microphone arrays may send separate streams in order to capture the speech of multiple individuals within the room. Or, for example, some devices may send parallel streams with alternative encodings that may not be human-consumable (like standard codecs) but contain information that is of particular value to a recognition service. The protocol future-proofs for this scenario by incorporating a channel ID into each message, so that separate audio channels can be multiplexed onto the same session. Channel IDs are selected by the originator of the stream, and only need to be unique within the set of channels being transmitted by that originator.

4. General Capabilities

4.1 Getting and Setting Parameters

ISSUE: do we really need GET/SET-PARAMS? The precise set of headers that are sticky parameters that can be set with SET-PARAMS is ambiguous, and there's no harm setting them in each message that needs them.

The GET-PARAMS and SET-PARAMS requests are the same as their [MRCPv2] counterparts. They are used to discover and set the configuration parameters of a resource (recognizer or synthesiser). Like all messages, they must always include the Resource-ID header.


general-method = "SET-PARAMS"
               | "GET-PARAMS"

capability-query-headers =
                 "Supported-Media:" mime-media-type *("," mime-media-type) ; See [RFC3555]
               | "Supported-Languages:" lang-tag *("," lang-tag) ; See [RFC5646]

Two additional headers, "Supported-Media" and "Supported-Languages" are introduced in html-speech/1.0 to provide a way for the application/UA to determine whether a resource supports the basic language and codec capabilities it needs. In most cases applications will know service's resource capabilities ahead of time. However, some applications may be more adaptable, or may wish to double-check at runtime. To determine resource capabilities, the UA sends a GET-PARAMS request to the resource, containing a set of capabilities, to which the resource responds with the specific subset it actually supports.

For example:


C->S: html-speech/1.0 GET-PARAMS 34132
      resource-id: recognizer
      supported-media: audio/basic, audio/amr-wb,
                       audio/x-wav;channels=2;formattag=pcm;samplespersec=44100,
                       audio/dsr-es202212; rate:8000; maxptime:40
      supported-languages: en-AU, en-GB, en-US, en (The application has UK, US and Australian English speakers, and possibly speakers of other English dialects)

S->C: html-speech/1.0 34132 200 COMPLETE
      resource-id: recognizer
      supported-media: audio/basic, audio/dsr-es202212; rate:8000; maxptime:40
      supported-languages: en-GB, en (The recognizer supports UK English, but will work with any English)

C->S: html-speech/1.0 GET-PARAMS 48223
      resource-id: synthesizer
      supported-media: audio/ogg, audio/flac, audio/basic
      supported-languages: en-AU, en-GB

S->C: html-speech/1.0 48223 200 COMPLETE
      resource-id: synthesizer
      supported-media: audio/flac, audio/basic
      supported-languages: en-GB
Supported-Languages
This read-only property is used by the client to discover whether a resource supports a particular set of languages. Unlike most headers, when a blank value is used in GET-PARAMS, the resource will respond with a blank header rather than the full set of languages it supports. This avoids the resource having to respond with a potentially cumbersome and possibly ambiguous list of languages and dialects. Instead, the client must include the set of languages in which it is interested as the value of the Supported-Languages header in the GET-PARAMS request. The service will respond with the subset of these languages that it actually supports.
Supported-Media
This read-only property is used to discover whether a resource supports particular media encoding formats. Given the broad variety of codecs, and the large set of parameter permutations for each codec, it is impractical for a resource to advertize all media encodings it could possibly support. Hence, when a blank value is used in GET-PARAMS, the resource will respond with a blank value. Instead, the client must supply the of media encodings it is interested in. The resource responds with the subset it actually supports.

4.2 Interim Events

Speech services may care to send optional vendor-specific interim events during the processing of a request. For example: some recognizers are capable of providing additional information as they process input audio; and some synthesizers are capable of firing progress events on word, phoneme, and viseme boundaries. These are exposed through the HTML Speech API as events that the webapp can listen for if it knows to do so. A service vendor MAY require a vendor-specific value to be set with SET-PARAMS before a it starts to fire certain events.


interim-event =   version request-ID SP INTERIM-EVENT CRLF 
                *(header CRLF)
                  CRLF
                 [body]

The Request-ID and Content-Type headers are required, and any data conveyed by the event must be contained in the body.

4.3 Requestless Notifications

Generally, a service may only send a message to the client in response to a client-originating request. However, in some cases, a service-originating notification mechanism is useful.


server-notification =   version SP NOTIFY CRLF 
                      *(header CRLF)
                        CRLF
                       [body]

The NOTIFY message is sent from the service to the client. It has no Request-ID, and there is no response from the client to the service. It MAY include a Resource-ID if relevant. It MUST include a Content-Type header if it contains a body. It MAY include Vendor-Specific-Parameters. Other headers are not supported, since it is unclear hwo the client would make sense of them. If webapps require support for request-ids, parameter, etc, they would probably be best encoded within the message body.

If the [body] was detected as being XML or JSON, it would be nice if the client browser could automatically reflect the data as a DOM or EMCA object.

ISSUE: Is there a clear scenario for requestless notifications? If so, we need to be more specific about what the scenario is, and how the NOTIFY message is used. If not, we should delete this message from the draft.

5. Recognition

A recognizer resource is either in the "listening" state, or the "idle" state. Because continuous recognition scenarios often don't have dialog turns or other down-time, most functions can be performed in either state: grammars can be loaded, rules activated and deactivated, media streams started, and text-input interpreted, regardless of whether the recognizer is listening or idle. For example: text dictation applications commonly have a variety of command grammars that are activated and deactivated to enable editing and correction modes; in open-microphone multimodal applications, the application will listen continuously, but change the set of active grammars based on the user's other non-speech interactions with the app. The key distinction between the idle and listening states is the obvious one: when listening, the recognizer processes incoming media and produces results; whereas when idle, the recognizer SHOULD buffer audio but will not process it.

Recognition is accomplished with a set of messages and events, some of which are borrowed from [MRCPv2] and have the same or similar semantics, while others are unique to html-speech/2.0 and are needed to support HTML Speech scenarios.

ISSUE: If we're going so far as to replace RECOGNIZE with LISTEN + listen-mode, then why not replace the "recognizer" resource with a "listener", and call the whole topic "listening" rather than "recognition"?


Idle State                 Listening State
    |                            |
    |--\                         |--\
    |  START-MEDIA-STREAM        |  START-MEDIA-STREAM
    |<-/                         |<-/
    |                            |
    |--\                         |--\
    |  DEFINE-GRAMMAR            |  DEFINE-GRAMMAR
    |<-/                         |<-/
    |                            |
    |--\                         |--\
    |  SET-GRAMMAR               |  SET-GRAMMAR
    |<-/                         |<-/
    |                            |
    |--\                         |--\
    |  INTERPRET                 |  INTERPRET
    |<-/                         |<-/
    |                            |
    |--\                         |--\
    |  INTERPRETATION-COMPLETE   |  INTERPRETATION-COMPLETE
    |<-/                         |<-/
    |                            |
    |--\                         |--\
    |  STOP                      |  LISTEN
    |<-/                         |<-/
    |                            |--\
    |                            |  INTERIM-EVENT
    |                            |<-/
    |                            |
    |                            |--\
    |                            |  START-OF-INPUT
    |                            |<-/
    |                            |
    |                            |--\
    |                            |  START-INPUT-TIMERS
    |                            |<-/
    |                            |
    |                            |--\
    |                            |  END-OF-INPUT
    |                            |<-/
    |                            |
    |                            |--\
    |                            |  INTERMEDIATE-RESULT
    |                            |<-/
    |                            |
    |                            |--\
    |                            |  RECOGNITION-COMPLETE
    |                            | (when mode = recognize-continuous)
    |                            |<-/
    |                            |
    |<---RECOGNITION-COMPLETE----|
    |(when mode = recognize-once)|
    |                            |
    |<----------STOP-------------|
    |                            |
    |<---some 4xx/5xx errors-----|

5.1 Recognition Requests


reco-method  = "CLEAR-GRAMMARS"     ; Unloads all grammars, whether active or inactive
             | "DEFINE-GRAMMAR"     ; Pre-loads & compiles a grammar, assigns a temporary URI for reference in other methods
             | "INTERPRET"          ; Interprets input text as though it was spoken
             | "LISTEN"             ; Transitions Idle -> Listening
             | "SET-GRAMMAR"        ; Activates and deactivates grammars and rules
             | "START-INPUT-TIMERS" ; Starts the timer for the various input timeout conditions
             | "START-MEDIA-STREAM" ; Starts an input media stream for the recognizer to listen to
             | "STOP"               ; Transitions Listening -> Idle
CLEAR-GRAMMARS

In continuous recognition, a variety of grammars may be loaded over time, potentially resulting in unused grammars consuming memory resources in the recognizer. The CLEAR-GRAMMARS method unloads all grammars, whether active or inactive.

DEFINE-GRAMMAR

The DEFINE-GRAMMAR method is similar to its namesake in [MRCPv2]. DEFINE-GRAMMAR does not activate a grammar, it simply causes the recognizer to pre-load and compile it, and associates it with a temporary URI that can then be used to activate or deactivate the grammar or one of its rules. DEFINE-GRAMMAR is not required in order to use a grammar, since the recognizer can load grammars as needed. However, it is useful when an application wants to ensure a large grammar is pre-loaded and ready for use prior to the recognizer entering the listening state. DEFINE-GRAMMAR can be used when the recognizer is in either the listening or idle state.

INTERPRET

The INTERPRET method is similar to its namesake in [MRCPv2], and processes the input text according to the set of grammar rules that are active at the time it is received by the recognizer. It MUST include the Interpret-Text header. The use of INTERPRET is orthogonal to any audio processing the recognizer may be doing, and will not affect any audio processing. The recognizer can be in either the listening or idle state.

LISTEN

The LISTEN method transitions the recognizer from the idle state to the listening state. The recognizer then processes the media input streams against the set of active grammars. The request MUST include the Source-Time header, which is used by the Recognizer to determine the point in the input stream(s) that the recognizer should start processing from. The request MUST also include the Listen-Mode header to indicate whether the recognizer should perform continuous recognition, a single recognition, or vendor-specific processing.

A LISTEN request MAY also activiate or deactivate grammars and rules using the Grammar-Activate and Grammar-Deactivate headers. These grammars/rules are considered to be activated/deactivated from the point specified in the Source-Time header.

NOTE: LISTEN does not use the same grammar specification technique as the MRCP RECOGNIZE method. In html-speech/1.0 this would add unnecessary and redundant complexity, since all the necessary functionality is already present in other html-speech/1.0 methods.

If or when there are no input media streams, the recognizer automatically transitions to the idle state, and issues a RECOGNITION-COMPLETE event, with Completion-Cause set to ???TBD.

TODO: Specify Completion-Cause value for no input stream.

If the recognizer is already in the listening state when it receives a LISTEN request, it remains in the listening state, changes the Listen-Mode to match that of the new request, and activates/deactivates any grammars/rules specified in the new request. Any grammars/rules active prior to the request, but not explicitly deactivated by the request, remain active.

SET-GRAMMAR

The SET-GRAMMAR method is used to activate and deactivate grammars and rules, using the Grammar-Activate and Grammar-Deactivate headers. The Source-Time header MUST be used, and activations/deactivations are considered to take place at precisely that time in the input stream(s).

ISSUE: Do we need an explicit method for this, or is SET-PARAMS enough?

START-INPUT-TIMERS

This is identical to the [MRCPv2] method with the same name. It is useful, for example, when the application wants to enable voice barge-in during a prompt, but doesn't want to start the time-out clock until after the prompt has completed.

START-MEDIA-STREAM

The START-MEDIA-STREAM request defines and initiates an audio input stream. It MUST include the Source-Time header, which the recognizer will use as the start-time for the input stream, so that other requests and events can refer to precise times in the input stream. It MUST also include the Audio-Codec header, which specifies the codec and parameters used to encode the media stream. The Request-ID for a START-MEDIA-STREAM request is used in the associated audio messages. The recognizer should respond to the START-MEDIA-STREAM request with a 200 IN-PROGRESS status message if it is able to accept the stream. When the stream ends, the recognizer should send a 200 COMPLETE status message.

STOP
The STOP method transitions the recognizer from the listening state to the idle state. No RECOGNITION-COMPLETE event is sent. The Source-Time header MUST be used, since the recognizer may still fire a RECOGNITION-COMPLETE event for any completion state it encounters prior to that time in the input stream.

5.2 Recognition Events

Recognition events are associated with 'IN-PROGRESS' request-state notifications from the 'recognizer' resource.


reco-event   = "END-OF-INPUT"         ; End of speech has been detected
             | "INTERIM-EVENT"        ; See Interim Events above
             | "INTERMEDIATE-RESULT"  ; A partial hypothesis
             | "INTERPRETATION-COMPLETE"
             | "RECOGNITION-COMPLETE" ; Similar to MRCP2 except that application/emma+xml (EMMA) will be the default Content-Type.
             | "START-OF-INPUT"       ; Start of speech has been detected

ISSUE: "RECOGNITION-COMPLETE" seems like an inappropriate name, since in continuous recognition, it isn't complete. "RECOGNITION-RESULT" would be a better name.

END-OF-INPUT

END-OF-INPUT is the logical counterpart to START-OF-INPUT, and indicates that speech has ended. The event MUST include the Source-Time header, which corresponds to the point in the input stream where the recognizer estimates speech to have ended, NOT when the endpointer finally decided that speech ended, which is a number of milliseconds later.

INTERIM-EVENT

See Interim Events above.

INTERMEDIATE-RESULT

Continuous speech (aka dictation) often requires feedback about what has been recognized thus far. Waiting for a RECOGNITION-COMPLETE event prevents this sort of user interface. INTERMEDIATE-RESULT provides this intermediate feedback. As with RECOGNITION-COMPLETE, contents are assumed to be EMMA unless an alternate Content-Type is provided.

INTERPRETATION-COMPLETE

This event is identical to the [MRCPv2] event with the same name.

RECOGNITION-COMPLETE

This method is similar to the [MRCPv2] method with the same name, except that application/emma+xml (EMMA) is the default Content-Type. The Source-Time header must be included, to indicate the point in the input stream when the event occured. When the Listen-Mode is reco-once, the recognizer will transition from the listening state to the idle state when this message is fired, and the Recognizer-State header in the event is set to "idle".

TODO: Describe how final results can be replaced in continuous recognition.

START-OF-INPUT

Indicates that start of speech has been detected. The Source-Time header MUST correspond to the point in the input stream(s) where speech was estimated to begin, NOT when the endpointer finally decided that speech began (a number of milliseconds later).

5.3 Recognition Headers

The list of valid headers for the recognizer resource include a subset of the [MRCPv2] Recognizer Header Fields, where they make sense for HTML Speech requirements, as well as a handful of headers that are required for HTML Speech.


reco-header =  Confidence-Threshold
             | Sensitivity-Level
             | Speed-Vs-Accuracy
             | N-Best-List-Length
             | No-Input-Timeout
             | Recognition-Timeout
             | Waveform-URI
             | Media-Type
             | Input-Waveform-URI
             | Completion-Cause
             | Completion-Reason
             | Recognizer-Context-Block
             | Start-Input-Timers
             | Speech-Complete-Timeout
             | Speech-Incomplete-Timeout
             | Failed-URI
             | Failed-URI-Cause
             | Save-Waveform
             | Speech-Language
             | Hotword-Min-Duration
             | Hotword-Max-Duration
             | Interpret-Text
             | audio-codec           ; The audio codec used in an input media stream
             | grammar-activate      ; Specifies a grammar or specific rule to activate.
             | grammar-deactivate    ; Specifies a grammar or specific rule to deactivate.
             | hotword               ; Whether to listen in "hotword" mode (i.e. ignore out-of-grammar speech)
             | listen-mode           ; Whether to do continuous or one-shot recognition
             | partial               ; Whether to send partial results
             | partial-interval      ; Suggested interval between partial results, in milliseconds.
             | recognizer-state      ; Indicates whether the recognizer is listening or idle
             | source-time           ; The UA local time at the request was initiated
             | user-id               ; Unique identifier for the user, so that adaptation can be used to improve accuracy.

hotword            = "Hotword:" BOOLEAN
listen-mode        = "Listen-Mode:" ("reco-once" | "reco-continuous" | vendor-listen-mode)
vendor-listen-mode = "x-" 1*UTFCHAR
recognizer-state   = "Recognizer-State:" ("listening" | "idle")
source-time        = "Source-Time:" 1*19DIGIT
audio-codec        = "Audio-Codec:" mime-media-type ; see [RFC3555]
partial            = "Partial:" BOOLEAN
partial-interval   = "Partial-Interval:" 1*5DIGIT
grammar-activate   = "Grammar-Activate:" "<" URI ["#" rule-name] [SP weight] ">" *("," "<" URI ["#" rule-name] [SP weight] ">")
rule-name          = 1*UTFCHAR
weight             = "0." 1*3DIGIT
grammar-deactivate = "Grammar-Deactivate:" "<" URI ["#" rule-name] ">" *("," "<" URI ["#" rule-name] ">")
user-id            = "User-ID:" 1*UTFCHAR

ISSUE: Does this imply further API requirements? For example, START-INPUT-TIMERS is there for a good reason, but AFAIK we haven't spoken about it. Similarly, Early-No-Match seems useful to developers. Is it?

Headers with the same names as their [MRCPv2] counterparts are considered to have the same specification. Other headers are describe as follows:

Audio-Codec

The Audio-Codec header is used in the START-MEDIA-STREAM request, to specify the codec and parameters used to encode the input stream, using the MIME media type encoding scheme specified in [RFC3555].

Grammar-Activate

The Grammar-Activate header specifies a list of grammars, and optionally specific rules within those grammars, to be activated. If no rule is specified for a grammar, the root rule is activated. It may also specify the weight of the rule. The Grammar-Activate header MAY be used in both the SET-GRAMMAR and LISTEN methods.

ISSUE: Including a header multiple times, rather than trying to encode a list with delimiters, works great for SIP's Record-Route header. I don't know if there are any gotchas with this approach.

ISSUE: Grammar-Activate/Deactivate probably don't make sense in GET/SET-PARAMS. Is this an issue? Perhaps this would be better achieved in the message body? The same format could be used.

Grammar-Deactivate

The Grammar-Deactivate header specifies a list of grammars, and optionally specific rules within those grammars, to be deactivated. If no rule is specified, all rules in the grammar are deactivated, including the root rule. The Grammar-Deactivate header MAY be used in both the SET-GRAMMAR and LISTEN methods.

Hotword

The Hotword header is analogous to the [MRCPv2] Recognition-Mode header, however it has a different name and boolean type in html-speech/1.0 in order to avoid confusion with the Listen-Mode header. When true, the recognizer functions in "hotword" mode, which essentially means that out-of-grammar speech is ignored.

Listen-Mode

Listen-Mode is used in the LISTEN request to specify whether the recognizer should listen continuously, or return to the idle state after the first RECOGNITION-COMPLETE event. It MUST NOT be used in any other type of request other than LISTEN. When the recognizer is in the listening state, it should include Listen-Mode in all event and status messages it sends.

Partial

This header is required to support the continuous speech scenario on the recognizer resource. When sent by the client in a LISTEN or SET-PARAMS request, this header controls whether or not the client is interested in partial results from the service. In this context, the term 'partial' is meant to describe mid-utterance results that provide a best guess at the user's speech thus far (e.g. "deer", "dear father", "dear father christmas"). These results should contain all recognized speech from the point of the last non-partial (ie complete) result, but it may be common for them to omit fully-qualified result attributes like an NBest list, timings, etc. The only guarantee is that the content must be EMMA. Note that this header is valid on both regular command-and-control recognition requests as well as dictation sessions. This is because at the API level, there is no syntactic difference between the recognition types. They are both simply recognition requests over an SRGS grammar or set of URL(s). Additionally partial results can be useful in command-and-control scenarios, for example: open-microphone applications, dictation enrollment applications, and lip-sync. When sent by the server, this header indicates whether the message contents represent a full or partial result. It's valid for a server to send this header on both INTERMEDIATE-RESULT, RECOGNITION-COMPLETE, and in response to a GET-PARAMS messages.

Partial-Interval

A suggestion from the client to the service on the frequency at which partial results should be sent. It is an integer value represents desired interval expressed in milliseconds. The recognizer does not need to precisely honor the requested interval, but SHOULD provide something close, if it is within the operating parameters of the implementation.

Recognizer-State

Indicates whether the recognizer is listening or idle. This MUST NOT be included by the client in any requests, and MUST be included by the recognizer in all status and event messages it sends.

Source-Time
Indicates the timestamp of a message using the client's local time. All requests sent from the client to the recognizer MUST include the Source-Time header, which must faithfully specify the client's local system time at the moment it sends the request. This enables the recognizer to correctly synchronize requests with the precise point in the input stream at which they were actually sent by the client. All event messages sent by the recognizer MUST include the Source-Time, calculated by the recognizer service based on the point in the input stream at which the event occurred, and expressed in the client's local clock time (since the recognizer knows what this was at the start of the input stream). By expressing all times in client-time, the user agent or application is able to correctly sequence events, and implement timing-sensitive scenarios, that involve other objects outside the knowledge of the recognizer service (for example, media playback objects or videogame states).
User-ID:

Recognition results are often more accurate if the recognizer can train itself to the user's speech over time. This is especially the case with dictation as vocabularies are so large. A User-ID field would allow the recognizer to establish the user's identify if the webapp decided to supply this information.

ISSUE: There were some additional headers proposed few weeks ago: Return-Punctuation; Gender-Number-Pronoun; Return-Formatting; Filter-Profanity. There was pushback that these shouldn't be in a standard because they're very vendor-specific. Thus I haven't included them in this draft. Do we agree these are appropriate to omit?

5.4 Predefined Grammars

Speech services MAY support pre-defined grammars that can be referenced through a 'builtin:' uri. For example <builtin:dictation?context=email&lang=en_US>, <builtin:date>, or <builtin:search?context=web>. These can be used as top-level grammars in the Grammar-Activate/Deactivate headers, or in rule references within other grammars. If a speech service does not support the referenced builtin or if it does not specify the builtin in combination with other active grammars, it should return a grammar compilation error.

TODO: Note somewhere that vendors are free to support other language model file formats beyond SRGS.

5.5 Recognition Examples

Write some examples of one-shot and continuous recognition, EMMA documents, partial results, vendor-extensions, grammar/rule activation/deactivation, etc.


C->S: html-speech/1.0 START-MEDIA-STREAM 41021
      Resource-ID: recognizer
      Audio-codec: audio/dsr-es202212; rate:8000; maxptime:40
      Source-Time: 12753248231 (source's local time at the start of the first packet)

S->C: binary audio packet #1 (request-id = 41201 = 1010000011110001)
         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |           request-id          |   reserved    |
        |1 0 0 0 0 0 0 0|1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 1|0 0 0 0 0 0 0 0|
        +---------------+-------------------------------+---------------+
        |                       encoded audio data                      |
        |                              ...                              |
        |                              ...                              |
        |                              ...                              |
        +---------------------------------------------------------------+

S->C: html-speech/1.0 41021 200 IN-PROGRESS (i.e. the service is accepting the audio)

C->S: binary audio packets...

C->S: html-speech/1.0 LISTEN 8322
      Resource-Identifier: recognizer
      Confidence-Threshold:0.9
      Grammar-Activate: <built-in:dictation?context=sms-message>
      Listen-Mode: reco-once
      Source-time: 12753432234 (where in the input stream recognition should start)

S->C: html-speech/2.0 START-OF-INPUT 8322 IN-PROGRESS

C->S: binary audio packets...

S->C: html-speech/2.0 RECOGNITION-COMPLETE 8322 COMPLETE

C->S: binary audio packets...
C->S: binary audio packet in which the user stops talking
C->S: binary audio packets...
 
S->C: html-speech/2.0 END-OF-INPUT 8322 IN-PROGRESS (i.e. the recognizer has detected the user stopped talking)

C->S: binary audio packet: end of stream (i.e. since the recognizer has signaled end of input, the UA decides to terminate the stream)
         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |           request-id          |   reserved    |
        |1 1 0 0 0 0 0 0|1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 1|0 0 0 0 0 0 0 0|
        +---------------------------------------------------------------+

S->C: html-speech/1.0 41021 200 RECOGNITION-COMPLETE (i.e. the service has received the end of stream)

6. Synthesis

In HTML speech applications, the synthesizer does not participate in the user interface, it merely provides rendered audio upon request, similar to any media server, along with interim events such as marks. The UA buffers the rendered audio, and the application may choose to play it to the user at some point completely unrelated to the synthesizer. It is the synthesizer's role to render the audio stream in a timely manner, at least rapidly enough to support real-time feedback. The synthesizer MAY also render and transmit the stream faster than requied for real time playback, or render multiple streams in parallel, in order to reduce latency in the application. This is a stark contrast to IVR, where the synthesizer essentially renders directly to the user's telephone, and is an active part of the user interface.

6.1 Synthesis Requests


synth-method = "SPEAK"
             | "STOP"
             | "DEFINE-LEXICON"

The set of synthesizer request methods is a subset of those defined in [MRCPv2]

SPEAK

The SPEAK method operates similarly to its [MRCPv2] namesake. The primary difference is that SPEAK results in a new audio stream being sent from the server to the client, using the same Request-ID. A SPEAK request MUST include the Audio-Codec header. When the rendering has completed, and the end-of-stream message has been sent, the synthesizer sends a SPEAK-COMPLETE event.

STOP

When the synthesizer receives a STOP request, it ceases rendering the requests specified in the Active-Request-ID header. If the Active-Request-ID header is missing, it ceases rendring all active SPEAK requests. For any SPEAK request that is ceased, the synthesiser sends an end-of-stream message, and a SPEAK-COMPLETE event.

DEFINE-LEXICON

This is identical to its namesake in [MRCPv2].

6.2 Synthesis Events

Synthesis events are associated with 'IN-PROGRESS' request-state notifications from the synthesizer resource.


synth-event  = "INTERIM-EVENT"  ; See Interim Events above
             | "SPEECH-MARKER"  ; An SSML mark has been rendered
             | "SPEAK-COMPLETE"
INTERIM-EVENT

See Interim Events above.

SPEECH-MARKER

Similar to its namesake in [MRCPv2], except that the Speech-Marker header contains a relative timestamp indicating the elapsed time from the start of the stream.

SPEAK-COMPLETE

The same as its [MRCPv2] namesake.

6.3 Synthesis Headers

The synthesis headers used in html-speech/1.0 are mostly a subset of those in [MRCPv2], with some minor modification and additions.


synth-header = active-request-id-list
             | Completion-Cause
             | Completion-Reason
             | Voice-Gender
             | Voice-Age
             | Voice-Variant
             | Voice-Name
             | Prosody-parameter ; Actually a collection of prosody headers
             | Speech-Marker
             | Speech-Language
             | Failed-URI
             | Failed-URI-Cause
             | Load-Lexicon
             | Lexicon-Search-Order
             | Audio-Codec

Audio-Codec  = "Audio-Codec:" mime-media-type ; See [RFC3555]
Audio-Codec
Because an audio stream is created in response to a SPEAK request, the audio codec and parameters must be specified in the SPEAK request, or in SET-PARAMS, using the Audio-Codec header. If the synthesizer is unable to encode with this codec, it terminates the request with a 4xx COMPLETE status message.
Speech-Marker

Similar to its namesake in [MRCPv2], except that the clock is defined as starting at zero at the beginning of the output stream. By using a relative time, the UA can calculate when to raise events based on where it is in the playback of the rendered stream.

6.4 Synthesis Examples

TODO: insert more synthesis examples


C->S: html-speech/1.0 SPEAK 3257
        Channel-Identifier:32AECB23433802@speechsynth
        Voice-gender:neutral
        Voice-Age:25
        Audio-codec:audio/flac
        Prosody-volume:medium
        Content-Type:application/ssml+xml
        Content-Length:...

        <?xml version="1.0"?>
        <speak version="1.0">
        ...

S->C: html-speech/1.0 3257 200 IN-PROGRESS
        Channel-Identifier:32AECB23433802@speechsynth
        Speech-Marker:timestamp=0

S->C: binary audio packet #1 (request-id = 3257 = 110010111001)
         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |           request-id          |   reserved    |
        |1 0 0 0 0 0 0 0|1 0 0 1 1 1 0 1 0 0 1 1 0 0 0 0|0 0 0 0 0 0 0 0|
        +---------------+-------------------------------+---------------+
        |                       encoded audio data                      |
        |                              ...                              |
        |                              ...                              |
        |                              ...                              |
        +---------------------------------------------------------------+

S->C: binary audio packets...

S->C: html-speech/1.0 SPEECH-MARKER 3257 IN-PROGRESS
        Channel-Identifier:32AECB23433802@speechsynth
        Speech-Marker:timestamp=2059000;marker-1

S->C: binary audio packets...

S->C: html-speech/1.0 SPEAK-COMPLETE 3257 COMPLETE
        Channel-Identifier:32AECB23433802@speechsynth
        Completion-Cause:000 normal
        Speech-Marker:timestamp=5011000

S->C: binary audio packets...

S->C: binary audio packet: end of stream ( message type = 0x03 )
         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |           request-id          |   reserved    |
        |1 1 0 0 0 0 0 0|1 0 0 1 1 1 0 1 0 0 1 1 0 0 0 0|0 0 0 0 0 0 0 0|
        +---------------+-------------------------------+---------------+

7. References

[EMMA]
EMMA: Extensible MultiModal Annotation markup languagehttp://www.w3.org/TR/emma/
[MIME-RTP]
MIME Type Registration of RTP Payload Formats http://www.ietf.org/rfc/rfc3555.txt
[MRCPv2]
MRCP version 2 http://tools.ietf.org/html/draft-ietf-speechsc-mrcpv2-24
[REQUIREMENTS]
Protocol Requirements http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/att-0030/protocol-reqs-commented.html
[HTTP1.1]
Hypertext Transfer Protocol -- HTTP/1.1 http://www.w3.org/Protocols/rfc2616/rfc2616.html
[RFC3555]
MIME Type Registration of RTP Payload Formats http://www.ietf.org/rfc/rfc3555.txt
[RFC5646]
Tags for Identifying Languages http://tools.ietf.org/html/rfc5646
[WS-API]
Web Sockets API, http://www.w3.org/TR/websockets/
[WS-PROTOCOL]
Web Sockets Protocol http://tools.ietf.org/pdf/draft-ietf-hybi-thewebsocketprotocol-09.pdf