HTML Speech XG
Proposed Protocol Approach

Draft Version 4, August 2nd, 2011

This version:
Posted to http://lists.w3.org/Archives/Public/public-xg-htmlspeech/
Latest version:
Posted to http://lists.w3.org/Archives/Public/public-xg-htmlspeech/
Previous version:
Draft Version 3, revision 3: http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jul/att-0025/speech-protocol-draft-03-r3.html
Draft Version 2: http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/att-0065/speech-protocol-basic-approach-02.html
Editor:
Robert Brown, Microsoft
Contributors:
Dan Burnett, Voxeo
Marc Schroeder, DFKI
Milan Young, Nuance
Michael Johnston, AT&T
Patrick Ehlen, AT&T
And other contributions from HTML Speech XG participants, http://www.w3.org/2005/Incubator/htmlspeech/

Changes Since Draft 3

In addition to minor overall editing to aid readability, the following changes were incorporated in response to feedback on the third draft:

2. Definitions
Clarified synthesizer description.
3.1 Session Establishment
Clarified that service parameters may be specified in the query string, but may be overridden using messages in the html-speech/1.0 websockets protocol once the websockets session has been established.
Clarified that advanced scenarios involving multiple engines of the same resource type, or using the same input audio stream for consumption by different types of vendor-specific resources, are out of scope.
3.2 Signaling
Changed the request-ID definition to match SRGS: 1-10 decimal digits.
3.3 Media Transmission
Removed "skip" message.
Added "start of stream" message, which removes the purpose of the START-MEDIA-STREAM request on the Recognizer (thus removing an area of confusion from section 5).
Removed Request-ID from the header, replacing it with Stream-ID, also to remove some of the confusion in section 5.
Clarified multiplexing.
Generalized from "audio" to "media" and added some text about supported media formats.
Simplified the header to just be an 8-bit message type and 24-bit stream-ID.
4.1 Getting and Setting Parameters
Rewrote the capability query headers to make them more flexible (and in theory less unwieldy if more capabilities are added in the future).
Added a header for subscribing to interim events.
4.3 Requestless Notifications
Deleted this section.
4.3 Resource Selection
Added this section do explain how resources are selected based on language and other characteristics.
5. Recognition
Clarified that grammar/rule state can only change when the recognizer is idle.
Corrected a number of errors in the state diagram.
5.1 Recognition Requests
Removed START-MEDIA-STREAM.
Added GET-GRAMMARS (and changed SET-GRAMMAR to SET-GRAMMARS).
Added METADATA.
5.2 Recognition Events
Change START/END-OF-INPUT to START/END-OF-SPEECH.
5.3 Recognition Headers
Changed grammar-activate/grammar-deactivate to active-grammars/inactive-grammars
5.4 Recording and Re-Recognizing
Added this section, which also includes re-recognition.
5.5 Predefined Grammars
Was previously numbered 5.4.
Clarified that the specific set of grammars is TBD later, and is optional.
5.6 Recognition Examples
Was previously numbered 5.5.
Corrected the existing one-shot example to match the changes.
Added a continuous reco example.
6. Synthesis
Clarified that SSML and plain text MUST be supported, and other input formats are permitted.
6.3 Synthesis Headers
Tried to more specific about how the clock works.
Added a Stream-ID header to associate a SPEAK request with an output stream.
6.4 Synthesis Examples
Cleaned up the examples

Abstract

The HTML Speech protocol is defined as a sub-protocol of WebSockets [WS-PROTOCOL], and enables HTML user agents and applications to make interoperable use of network-based speech service providers, such that applications can use the service providers of their choice, regardless of the particular user agent the application is running in. The protocol bears some similarity to [MRCPv2]. However, since the use cases for HTML Speech applications are in some places considerably different from those around which MRCPv2 was designed, the HTML Speech protocol is not merely a transcript of MRCP, but shares some design concepts, while simplifying some details, and adding others. Similarly, because the HTML Speech protocol builds on WebSockets, its session negotiation and media transport needs are quite different from those of MRCP.

TODO: Add a sentence or two about the higher level motivation.

Status of this Document [Michael Johnston]

This document is an informal rough draft that collates proposals, agreements, and open issues on the design of the necessary underlying protocol for the HTML Speech XG, for the purposes of review and discussion within the XG.

Contents

  1. Architecture
  2. Definitions
  3. Protocol Basics
    1. Session Establishment
    2. Signaling
      1. Generic Headers
      2. Request Messages
      3. Status Messages
      4. Event Messages
    3. Media Transmission
  4. General Capabilities
    1. Getting and Setting Parameters
    2. Interim Events
    3. Resource Selection
  5. Recognition
    1. Recognition Requests
    2. Recognition Events
    3. Recognition Headers
    4. Recording and Re-Recognizing
    5. Predefined Grammars
    6. Recognition Examples
  6. Synthesis
    1. Synthesis Requests
    2. Synthesis Events
    3. Synthesis Headers
    4. Synthesis Examples
  7. References

1. Architecture


             Client
|-----------------------------|
|       HTML Application      |                                            Server
|-----------------------------|                                 |--------------------------|
|       HTML Speech API       |                                 | Synthesizer | Recognizer |
|-----------------------------|                                 |--------------------------|
| HTML-Speech Protocol Client |---html-speech/1.0 subprotocol---|     HTML-Speech Server   |
|-----------------------------|                                 |--------------------------|
|      WebSockets Client      |-------WebSockets protocol-------|     WebSockets Server    |
|-----------------------------|                                 |--------------------------|

2. Definitions

Recognizer

A Recognizer performs speech recognition, with the following characteristics:

  1. Support for one or more spoken languages and acoustic scenarios.
  2. Processing of one or more input streams. The typical scenario consists of a single stream of encoded audio. But some scenarios will involve multiple audio streams, such as multiple beams from an array microphone picking up different speakers in a room; or streams of multimodal input such as gesture or motion, in addition to speech.
  3. Support for multiple simultaneous grammars/language models, including but not limited to application/srgs+xml [SRGS]. Implementations MAY support additional formats, such as ABNF SRGS or an SLM format.
  4. Support for continous recognition, generating events as appropriate such as match/no-match, detection of the start/end of speech, etc.
  5. Support for at least one "dictation" language model, enabling essentially unconstrained spoken input by the user.
  6. Support for "hotword" recognition, where the recognizer ignores speech that is out of grammar. This is particularly useful for open-mic scenarios.
  7. Support for slower than real-time recognition, since network conditions can and will introduce delays in the delivery of media.

Because continuous recognition plays an important role in HTML Speech scenarios, a Recognizer is a resource that essentially acts as a filter on its input streams. Its grammars/language models can be specified and changed, as needed by the application, and the recognizer adapts its processing accordingly. Single-shot recognition (e.g. a user on a web search page presses a button and utters a single web-search query) is a special case of this general pattern, where the application specifies its model once, and is only interested in one match event, after which it stops sending audio (if it hasn't already).

"Recognizers" are not strictly required to perform speech recognition, and may perform additional or alternative functions, such as speaker verification, emotion detection, or audio recording.

Synthesizer

A Synthesizer generates audio streams from textual input. It essentially provides a media stream with additional events, which the client buffers and plays back as required by the application. A Synthesizer service has the following characteristics:

  1. Rendering of application/ssml+xml [SSML] and text/plain input to an output audio stream. Implementations MAY support additional formats.
  2. Each synthesis request results in a separate output stream that is terminated once rendering is complete, or if it has been canceled by the client.
  3. Rendering must be performed and transmitted at least as rapidly as would be needed to support real-time playback, and preferably faster. In some cases, network conditions between service and UA may result in slower-than-real-time delivery of the stream, and the UA or application will need to cope with this appropriately.
  4. Generation of interim events, such as those corresponding to SSML marks, with precise timing. Events are transmitted by the service as closely as possible to the corresponding audio packet, to enable real-time playback by the client if required by the application.

Because a Synthesizer resource only renders a stream, and is not responsible for playback of that stream to a user, it does NOT:

  1. Provide any form of shuttle control (pausing or skipping), since this is performed by the client.
  2. Provide any control over volume, rate, pitch, etc, other than as specified in the SSML input document.
  3. Need to queue synthesis requests. It MAY service multiple simultaneous requests in parallel or in series, as deemed appropriate by the implementor.

TODO: There were some clarifying questions around this in the spec review. Robert Brown to expand on this, perhaps with an example.

3. Protocol Basics

TODO: add a section on security. Include authentication, encryption, transitive authorization to fetch resources.

3.1 Session Establishment

The WebSockets session is established through the standard WebSockets HTTP handshake, with these specifics:

RB: I removed the ability to pass standard parameters in the query string. We didn't seem to have solid agreement on this in the calls where we reviewed the 3rd draft. Are we okay with this? If we want or need to support this, we'll need to specify a subset and provide examples.

TODO: Clarify that service parameters may be specified in the query string, but may be overridden using messages in the html-speech/1.0 websockets protocol once the websockets session has been established.

For example:


C->S: GET /speechservice123?customparam=foo&otherparam=bar HTTP/1.1
      Host: examplespeechservice.com
      Upgrade: websocket
      Connection: Upgrade
      Sec-WebSocket-Key: OIUSDGY67SDGLjkSD&g2 (for example)
      Sec-WebSocket-Version: 9
      Sec-WebSocket-Protocol: html-speech/1.0, x-proprietary-speech

S->C: HTTP/1.1 101 Switching Protocols
      Upgrade: websocket
      Connection: Upgrade
      Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=
      Sec-WebSocket-Protocol: html-speech/1.0

Once the WebSockets session is established, the UA can begin sending requests and media to the service, which can respond with events, responses or media.

A session may have a maximum of one synthesizer resource and one recognizer resource. If an application requires multiple resources of the same type (such as, for example, two synthesizers from different vendors), it must use separate WebSocket sessions.

NOTE: In MRCP, session negotiation also involves negotiating unique channel IDs (e.g. 128397521@recognizer) for the various resource types the client will need (recognizer, synthesizer, etc). In html-speech/1.0 this is unnecessary, since the WebSockets connection itself provides a unique shared context between the client and server, and resources are referred to directly by type, without the need for channel-IDs.

There is no association of state between sessions. If a service wishes to provide a special association between separate sessions, it may do so behind the scenes (for example, to re-use audio input from one session in another session without resending it, or to cause service-side barge-in of TTS in one session by recognition in another session, would be service-specific extensions).

TODO: clarify that advanced scenarios involving multiple engines of the same resource type, or using the same input audio stream for consumption by different types of vendor-specific resources, are out of scope. These may be implemented behind the scenes by the service.

3.2 Signaling

The signaling design borrows its basic pattern from [MRCPv2], where there are three classes of control messages:

Requests
C->S requests from the UA to the service. The client requests a method (SPEAK, LISTEN, STOP, etc) from a particular remote speech resource.
Status Notifications
S->C general status notification messages from the service to the UA, marked as either PENDING, IN-PROGRESS or COMPLETE.
Events
S->C named events from the service to the UA, that are essentially special cases of 'IN-PROGRESS' request-state notifications.

control-message =   start-line ; i.e. use the typical MIME message format
                  *(header CRLF)
                    CRLF
                   [body]
start-line      =   request-line | status-line | event-line
header          =  <Standard MIME header format> ; actual headers depend on the type of message
body            =  *OCTET                        ; depends on the type of message

The interaction is full-duplex and asymmetrical: service activity is instigated by requests the UA, which may be multiple and overlapping, and where each request results in one or more messages from the service back to the UA.

For example:


C->S: html-speech/1.0 SPEAK 3257
        Resource-ID:synthesizer
        Audio-codec:audio/basic
        Content-Type:text/plain

        Hello world! I speak therefore I am.

S->C: html-speech/1.0 3257 200 IN-PROGRESS

S->C: media for 3257

C->S: html-speech/1.0 SPEAK 3258
        Resource-ID:synthesizer
        Audio-codec:audio/basic
        Content-Type:text/plain

        As for me, all I know is that I know nothing.

S->C: html-speech/1.0 3257 200 IN-PROGRESS

S->C: media for 3258

S->C: more media for 3257

S->C: html-speech/1.0 SPEAK-COMPLETE 3257 COMPLETE

S->C: more media for 3258

S->C: html-speech/1.0 SPEAK-COMPLETE 3258 COMPLETE

The service MAY choose to serialize its processing of certain requests (such as only rendering one SPEAK request at a time), but MUST still accept multiple active requests.

3.2.1 Generic Headers


generic-header  =
                | accept
                | accept-charset
                | content-base
                | logging-tag
                | resource-id
                | vendor-specific
                | content-type
                | content-encoding

resource-id     = "Resource-ID:" ("recognizer" | "synthesizer" | vendor-resource)
vendor-resource = "x-" 1*UTFCHAR

accept           = <same as [MRCPv2]>
accept-charset   = <same as [MRCPv2]>
content-base     = <same as [MRCPv2]>
content-type     = <same as [MRCPv2]>
content-encoding = <same as [MRCPv2]>
logging-tag      = <same as [MRCPv2]>
vendor-specific  = <same as [MRCPv2]>

NOTE: This is mostly a strict subset of the [MRCPv2] generic headers, many of which have been omitted as either unnecessary or inappropriate for HTML speech client/server scenarios.

Resource-ID
The Resource-ID header is included in all signaling messages. In requests, it indicates the resource to which the request is directed. In status messages and events, it indicates the resource from which the message originated.
Accept
The Accept header is similar to its namesake in [MRCPv2]. It MAY be included in any message to indicate the content types that will be accepted by the sender of the message from the receiver of the message. When absent, the following defaults should be assumed: clients will accept "application/emma+xml" from recognizers; recognizers will accept "application/srgs+xml"; synthesizers will accept "application/ssml+xml".
Accept-Charset
The Accept-Charset header is similar to its namesake in [MRCPv2]. When absent, any charset may be used. This header has two general purposes: so the client can indicate the charset it will accept in recognition results; and so the synthesizer can indicate the charset it will accept for SSML documents.
Content-Base
The Content-Base header is similar to its namesake in [MRCPv2]. When a message contains an entity that includes relative URIs, Content-Base provides the absolute URI against which they are based.
Logging-Tag
The Logging-Tag header is similar to its namesake in [MRCPv2]. It is generally only used in requests, or in response to GET-PARAMS.
Vendor-Specific-Parameters
The Vendor-Specific-Parameters header is similar to its namesake in [MRCPv2].

3.2.2 Request Messages

Request messages are sent from the client to the server, usually to request an action or modify a setting. Each request has its own request-id, which is unique within a given WebSockets html-speech session. Any status or event messages related to a request use the same request-id. All request-ids MUST have an integer value between 0 and 1010-1 (i.e. 1-10 decimal digits).

TODO: revise this number. 2^16-1 seems small. It may have security implications. Perhaps 2^24-1?


request-line   = version SP method-name SP request-id SP CRLF
version        = "html-speech/" 1*DIGIT "." 1*DIGIT ; html-speech/1.0
method-name    = general-method | synth-method | reco-method | proprietary-method
request-id     = 1*10DIGIT

NOTE: In MRCP, all messages include their message length, so that they can be framed in what is otherwise an open stream of data. In html-speech/1.0, framing is already provided by WebSockets, and message length is not needed, and therefore not included.

For example, to request the recognizer to interpret text as if it were spoken:


C->S: html-speech/1.0 INTERPRET 8322
      Resource-Identifier: recognizer
      Active-Grammars: <http://myserver/mygrammar.grxml>
      Interpret-Text: Send a dozen yellow roses and some expensive chocolates to my mother

3.2.3 Status Messages

Status messages are sent by the server, to indicate the state of a request.


status-line   =  version SP request-id SP status-code SP request-state CRLF								
status-code   =  3DIGIT       ; Specific codes TBD, but probably similar to those used in MRCP

; All communication from the server is labeled with a request state.
request-state = "COMPLETE"    ; Processing of the request has completed.
              | "IN-PROGRESS" ; The request is being fulfilled.
              | "PENDING"     ; Processing of the request has not begun.

Specific status code values would follow the general pattern used in [MRCPv2]:

2xx Success Codes
4xx Client Failure Codes
5xx Server Failure

TODO: Determine status code values.

3.2.4 Event Messages

Event messages are sent by the server, to indicate specific data, such as synthesis marks, speech detection, and recognition results. They are essentially specialized status messages.


event-line   =  version SP event-name SP request-id SP request-state CRLF								
event-name    =  synth-event | reco-event | proprietary-event

For example, an event indicating that the recognizer has detected the start of speech:

S->C: html-speech/2.0 START-OF-SPEECH 8322 IN-PROGRESS
      Resource-ID: recognizer
      Source-time: 12753439912 (when speech was detected)

3.3 Media Transmission

HTML Speech applications feature a wide variety of media transmission scenarios. The number of media streams at any given time is not fixed. A recognizer may accept one or more input streams, which may start and end at any time as microphones or other input devices are activated/deactivated by the application or the user. Recognizers do not require their data in real-time, and will generally prefer to wait for delayed packets in order to maintain accuracy, whereas a human listener would rather just tolerate the clicks and pops of missing packets so they can continue listening in real time. Applications may, and often will, request the synthesis of multiple SSML documents at the same time, which are buffered by the UA for playback at the application's discretion. The synthesizer needs to return rendered data to the client rapidly (generally faster than real time), and MAY render multiple requests in parallel if it has the capacity to do so.

Advanced implementations of HTML Speech may incorporate multiple channels of audio in a single transmission. For example, living-room devices with microphone arrays may send separate streams in order to capture the speech of multiple individuals within the room. Or, for example, some devices may send parallel streams with alternative encodings that may not be human-consumable (like standard codecs) but contain information that is of particular value to a recognition service.

In html-speech/1.0, audio (or other media) is packetized and transmitted as a series of WebSockets binary messages, on the same WebSockets session used for the control messages.


media-packet        =  binary-message-type
                       binary-stream-id
                       binary-data
binary-message-type =  OCTET ; Values > 0x03 are reserved. 0x00 is undefined.
binary-stream-id    = 3OCTET ; Unique identifier for the stream 0..224-1
binary-data         = *OCTET 

TODO: reduce message-type to a few bits, and expand request-id.

The binary-stream-id field is used to identify the messages for a particular stream. It is a 24-bit unsigned integer. Its value for any given stream is assigned by the sender (client or server) in the first message of that stream, and must be unique to the sender within the WebSockets session.

The binary-message-type field has these defined values:

0x01: Start of Stream
The message indicates the start of a new stream. This MUST be the first message in any stream. The stream-ID must be new and unique to the session. This same stream-ID is used in all future messages in the stream. The message body contains a 64-bit NTP timestamp containing the local time at the stream originator, which is used as the base time for calculating event timing. The timestamp is followed by the ASCII-encoded MIME media type (see [RFC 3555])describing the format that the stream's media is encoded in. This is usually some sort of audio encoding, and at a minimum all implementations MUST support 8kHz single-channel mulaw (audio/basic) and 8kHz single-channel 16-bit linear PCM (audio/L16;rate=8000). Implementations MAY support other content types, for example: to recognize from a video stream; to provide a stream of recognition preprocessing coefficients; to provide textual metadata streams; or to provide auxilliary multimodal input streams such as touches/strokes, gestures, clicks, compass bearings, etc.

TODO: note that at least muLaw/aLaw/PCM must be supported.


         message type = 0x01; stream-id = 112233; media-type = audio/amr-wb
         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |                   stream-id                   |
        |1 0 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
        +---------------+-----------------------------------------------+
        |1 0 0 0 1 0 1 1 1 1 0 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0| } NTP Timestamp
        |0 0 0 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0 1 1 0 0 0 0 1 0 1 0 1 0 1| }
        |      61              75              64              69       | a u d i
        |      6F              2F              61              6D       | o / a m
        |      72              2D              77              62       | r - w b
        +---------------------------------------------------------------+
0x02: Media
The message is a media packet, and contains encoded media data.

         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |                   stream-id                   |
        |0 1 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
        +---------------+-----------------------------------------------+
        |                       encoded audio data                      |
        |                              ...                              |
        |                              ...                              |
        |                              ...                              |
        +---------------------------------------------------------------+
The encoding format is specified in the start of stream message (0x01).

NOTE: The design does not permit the transmission of media as text messages. WebSockets already provides native support for binary messages and base-64 encoding of binary data into text would incur an unnecessry 33% transmission overhead.

TODO: Change "Audio" to "Media". Clarify that its encoding is specified by a mime content type in the request that initiated the stream (SPEAK or START-MEDIA-STREAM), and while it will usually be some form of audio encoding, it MAY be any content type, including text, pen strokes, touches/clicks, compass bearings, etc.

TODO: DELETE the Skip message type.

0x03: End-of-stream
The 0x03 end-of-stream message indicates the end of the media stream, and MUST be used to terminate a media stream. Any future media stream messages with the same stream-id are invalid.

         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |                   stream-id                   |
        |1 1 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
        +---------------+-----------------------------------------------+

TODO: add a paragraph about the interleaving of audio and signaling.

A sequence of media messages with the same stream-ID represents an in-order contiguous stream of data. Because the messages are sent in-order and audio packets cannot be lost (WebSockets uses TCP), there is no need for sequence numbering or timestamps. The sender just packetizes audio from the encoder and sends it, while the receiver just un-packs the messages and feeds them to the consumer (e.g. the recognizer's decoder, or the TTS playback buffer). Timing of coordinated events is calculated by decoded offset from the beginning of the stream.

Media streams are multiplexed with signaling messages. Multiple media streams can be also be multiplexed on the same socket. The WebSockets stack de-multiplexes text and binary messages, thus separating signaling from media, while the stream-ID on each media message is used to de-multiplex the messages into separate media streams.

There is no strict constraint on the size and frequency of audio messages. Nor is there a requirement for all audio packets to encode the same duration of sound. However, implementations SHOULD seek to minimize interference with the flow of other messages on the same socket, by sending messages that encode between 20 and 80 milliseconds of media. Since a WebSockets frame header is typically only 4 bytes, overhead is minimal and implementations SHOULD err on the side of sending smaller packets more frequently.

A synthesis service MAY (and typically will) send audio faster than real-time, and the client MUST be able to handle this.

A recognition service MUST be prepared to receive slower-than-real-time audio due to practical throughput limitations of the network.

Although most services will be strictly either recognition or synthesis services, some services may support both in the same session. While this is a more advanced scenario, the design does not introduce any constraints to prevent it. Indeed, both the client and server MAY send audio streams in the same session.

TODO: Write the rationale for why we mix media and signal in the same session. [Michael Johnston]

4. General Capabilities

TODO: There is an open issue to do with transitive access control. The client sends a URI to the service, which the client can access, but the service cannot, because it is not authorized to do so. How does the client grant access to the resource to the service? There are two design contenders. The first is to use the cookie technique that MRCP uses. The second is to use a virtual tag, which we discussed briefly at the F2F - Michael Bodell owes a write-up. In the absence of that write-up, perhaps the default position should be to use cookies.

4.1 Getting and Setting Parameters

TODO: Specify which headers are sticky. URI request parameters aren't standardized.

The GET-PARAMS and SET-PARAMS requests are the same as their [MRCPv2] counterparts. They are used to discover and set the configuration parameters of a resource (recognizer or synthesiser). Like all messages, they must always include the Resource-ID header. SET/GET-PARAMS work with global parameter settings. Individual requests may set different values that apply only to that request.


general-method = "SET-PARAMS"
               | "GET-PARAMS"

header         = capability-query-header
               | interim-event-header
               | reco-header
               | synth-header

capability-query-header =
                 "Supported-Content:" mime-type *("," mime-type)
               | "Supported-Languages:" lang-tag *("," lang-tag) ; See [RFC5646]
               | "Builtin-Grammars:" "<" URI ">" *("," "<" URI ">")

interim-event-header =
                 "Interim-Events:" event-name *("," event-name)
event-name = 1*UTFCHAR

Additional headers are introduced in html-speech/1.0 to provide a way for the application/UA to determine whether a resource supports the basic capabilities it needs. In most cases applications will know service's resource capabilities ahead of time. However, some applications may be more adaptable, or may wish to double-check at runtime. To determine resource capabilities, the UA sends a GET-PARAMS request to the resource, containing a set of capabilities, to which the resource responds with the specific subset it actually supports.

TODO: how do we check for other configuration settings? e.g. what grammars are available? e.g. supported grammar format (srgs-xml vs srgs-ebnf vs some-slm-format).

ISSUE: this could become unwieldy as more parameters are added. Is there a more generic approach?

Supported-Languages
This read-only property is used by the client to discover whether a resource supports a particular set of languages. Unlike most headers, when a blank value is used in GET-PARAMS, the resource will respond with a blank header rather than the full set of languages it supports. This avoids the resource having to respond with a potentially cumbersome and possibly ambiguous list of languages and dialects. Instead, the client must include the set of languages in which it is interested as the value of the Supported-Languages header in the GET-PARAMS request. The service will respond with the subset of these languages that it actually supports.
Supported-Content
This read-only property is used to discover whether a resource supports particular encoding formats for input or output data. Given the broad variety of codecs, and the large set of parameter permutations for each codec, it is impractical for a resource to advertize all media encodings it could possibly support. Hence, when a blank value is used in GET-PARAMS, the resource will respond with a blank value. Instead, the client must supply the of data encodings it is interested in. The resource responds with the subset it actually supports. This is used not only to discover supported media encoding formats, but also, to discover other input and output data formats, such as alternatives to SRGS, EMMA and SSML.
Builtin-Grammars
This read-only property is used by the client to discover whether a recognizer has a particular set of built-in grammars. The client provides a list of builtin: URIs in the SET-PARAMS request, to which the recognizer responds with the subset of URIs it actually supports.
Interim-Events
This read/write property contains the set of interim events the client would like the service to send (see Interim Events below).

For example:


C->S: html-speech/1.0 GET-PARAMS 34132
      resource-id: recognizer
      supported-content: audio/basic, audio/amr-wb, 
                         audio/x-wav;channels=2;formattag=pcm;samplespersec=44100,
                         audio/dsr-es202212; rate:8000; maxptime:40,
                         application/x-ngram+xml
      supported-languages: en-AU, en-GB, en-US, en (A variety of English dialects are desired)
      builtin-grammars: <builtin:dictation?topic=websearch>, 
                        <builtin:dictation?topic=message>, 
                        <builtin:ordinals>, 
                        <builtin:datetime>, 
                        <builtin:cities?locale=USA>

S->C: html-speech/1.0 34132 200 COMPLETE
      resource-id: recognizer
      supported-content: audio/basic, audio/dsr-es202212; rate:8000; maxptime:40
      supported-languages: en-GB, en (The recognizer supports UK English, but will work with any English)
      builtin-grammars: <builtin:dictation?topic=websearch>, <builtin:dictation?topic=message>

C->S: html-speech/1.0 GET-PARAMS 48223
      resource-id: synthesizer
      supported-content: audio/ogg, audio/flac, audio/basic
      supported-languages: en-AU, en-GB

S->C: html-speech/1.0 48223 200 COMPLETE
      resource-id: synthesizer
      supported-content: audio/flac, audio/basic
      supported-languages: en-GB

4.2 Interim Events

Speech services may care to send optional vendor-specific interim events during the processing of a request. For example: some recognizers are capable of providing additional information as they process input audio; and some synthesizers are capable of firing progress events on word, phoneme, and viseme boundaries. These are exposed through the HTML Speech API as events that the webapp can listen for if it knows to do so. A service vendor MAY require a vendor-specific value to be set with SET-PARAMS before a it starts to fire certain events.


interim-event =   version request-ID SP INTERIM-EVENT CRLF 
                *(header CRLF)
                  CRLF
                 [body]
event-name-header = "Event-Name:" event-name

The Event-Name header is required and must contain a value that was previously subscribed to with the Interim-Events header.

The Request-ID and Content-Type headers are required, and any data conveyed by the event must be contained in the body.

4.3 Resource Selection

Applications will generally want to select resources with certain capabilities, such as the ability to recognize certain languages, work well in specific acoustic conditions, work well with specific genders or ages, speak particular languages, speak with a particular style, age or gender, etc.

There are three ways in which resource selection can be achieved, each of which has relevance:

By URI

Any service may enable applications to encode resource requirements as query string parameters in the URI, or by using specific URIs with known resources. The specific URI format and parameter scheme is by necessity not standardized and is defined by the implementer based on their architecture and service offerings. For example:


ws://example1.net:2233/webreco/de-de/cell-phone
ws://example2.net/?reco-lang=en-UK&reco-acoustic=10-foot-open-room&sample-rate=16kHz&channels=2
ws://example3.net/speech?reco-lang=es-es&tts-lang=es-es&tts-gender=female&tts-vendor=acmesynth
ws://example4.com/profile=af3e-239e-9a01-66c0
By Request Header

Request headers may also be used to select specific resource capabilities. Synthesizer parameters are set through SET-PARAMS or SPEAK; whereas recognition paramaters are set either through SET-PARAMS or LISTEN. There is a small set of standard headers that can be used with each resource: the Speech-Language header may be used with both the recognizer and synthesizer, and the synthesizer may also accept a variety of voice selection parameters as headers. However, a resource does not need to support these headers where it does not have the ability to so. If a particular header value is unsupported, the request should fail with a status of 407 "Unsupported Header Field Value". For example:


C->S: html-speech/1.0 LISTEN 8322
      Resource-ID: Recognizer
      Speech-Language: fr-CA

C->S: html-speech/1.0 SET-PARAMS 8323
      Resource-ID: Recognizer
      Speech-Language: pt-BR

C->S: html-speech/1.0 SPEAK 8324
      Resource-ID: Synthesizer
      Speech-Language: ko-KR
      Voice-Age: 35
      Voice-Gender: female

C->S: html-speech/1.0 SET-PARAMS 8325
      Resource-ID: Synthesizer
      Speech-Language: sv-SE
      Voice-Name: Kiana
By Input Document

The [SRGS] and [SSML] input documents for the recognizer and synthesizer will specify the language for the overall document, and MAY specify languages for specific subsections of the document. The resource consuming these documents SHOULD honor these language assignments when they occur. If a resource is unable to do so, it should error with a 4xx status "Unsupported content language". (It should be noted that at the time of writing, most currently available recognizer and synthesizer implementations will be unable to suppor this capability.)

Generally speaking, unless a service is unusually adaptable, applications are better off using specific URLs that encode the abilities they need, so that the appropriate resources can be allocated during session initiation.

5. Recognition

A recognizer resource is either in the "listening" state, or the "idle" state. Because continuous recognition scenarios often don't have dialog turns or other down-time, all functions are performed in series on the same input stream(s). The key distinction between the idle and listening states is the obvious one: when listening, the recognizer processes incoming media and produces results; whereas when idle, the recognizer SHOULD buffer audio but will not process it. For example: text dictation applications commonly have a variety of command grammars that are activated and deactivated to enable editing and correction modes; in open-microphone multimodal applications, the application will listen continuously, but change the set of active grammars based on the user's other non-speech interactions with the app. Grammars can be loaded, and rules activated or deactivated, while the recognizer is idle (but not while it is listening).

TODO: some turn based recognizers can't change state in a recognition. What happens then? Answer: state should only change while idle.

TODO: what happens to timers if we enter the listening state without having an input stream? Timers should be based on the start of the input stream. Answer: can't listen when there's no input.

Recognition is accomplished with a set of messages and events, to a certain extent inspired by those in [MRCPv2].


Idle State                 Listening State
    |                            |
    |--\                         |
    |  DEFINE-GRAMMAR            |
    |<-/                         |
    |                            |
    |--\                         |
    |  SET-GRAMMARS              |
    |<-/                         |
    |                            |
    |--\                         |--\
    |  GET-GRAMMARS              |  GET-GRAMMARS
    |<-/                         |<-/
    |                            |
    |--\                         |
    |  INFO                      |
    |<-/                         |
    |                            |
    |---------LISTEN------------>|
    |                            |
    |                            |--\
    |                            |  INTERIM-EVENT
    |                            |<-/
    |                            |
    |                            |--\
    |                            |  START-OF-SPEECH
    |                            |<-/
    |                            |
    |                            |--\
    |                            |  START-INPUT-TIMERS
    |                            |<-/
    |                            |
    |                            |--\
    |                            |  END-OF-SPEECH
    |                            |<-/
    |                            |
    |                            |--\
    |                            |  INFO
    |                            |<-/
    |                            |
    |                            |--\
    |                            |  INTERMEDIATE-RESULT
    |                            |<-/
    |                            |
    |                            |--\
    |                            |  RECOGNITION-COMPLETE
    |                            | (when mode = recognize-continuous)
    |                            |<-/
    |                            |
    |<---RECOGNITION-COMPLETE----|
    |(when mode = recognize-once)|
    |                            |
    |                            |
    |<--no media streams remain--|
    |                            |
    |                            |
    |<----------STOP-------------|
    |                            |
    |                            |
    |<---some 4xx/5xx errors-----|
    |                            |
    |--\                         |--\
    |  INTERPRET                 |  INTERPRET
    |<-/                         |<-/
    |                            |
    |--\                         |--\
    |  INTERPRETATION-COMPLETE   |  INTERPRETATION-COMPLETE
    |<-/                         |<-/
    |                            |

5.1 Recognition Requests

TODO: add a C->S message for sending metadata to the recognizer.


reco-method  = "LISTEN"             ; Transitions Idle -> Listening
             | "START-INPUT-TIMERS" ; Starts the timer for the various input timeout conditions
             | "STOP"               ; Transitions Listening -> Idle
             | "DEFINE-GRAMMAR"     ; Pre-loads & compiles a grammar, assigns a temporary URI for reference in other methods
             | "SET-GRAMMARS"       ; Activates and deactivates grammars and rules
             | "GET-GRAMMARS"       ; Returns the current grammar and rule state
             | "CLEAR-GRAMMARS"     ; Unloads all grammars, whether active or inactive
             | "INTERPRET"          ; Interprets input text as though it was spoken
             | "INFO"               ; Sends metadata to the recognizer
LISTEN

The LISTEN method transitions the recognizer from the idle state to the listening state. The recognizer then processes the media input streams against the set of active grammars. The request MUST include the Source-Time header, which is used by the Recognizer to determine the point in the input stream(s) that the recognizer should start processing from (which won't necessarily be the start of the stream). The request MUST also include the Listen-Mode header to indicate whether the recognizer should perform continuous recognition, a single recognition, or vendor-specific processing.

A LISTEN request MAY also activiate or deactivate grammars and rules using the Active-Grammars and Inactive-Grammars headers. These grammars/rules are considered to be activated/deactivated from the point specified in the Source-Time header.

NOTE: LISTEN does NOT use the same grammar specification technique as the MRCP RECOGNIZE method. In html-speech/1.0 this would add unnecessary and redundant complexity, since all the necessary functionality is already present in other html-speech/1.0 methods.

When there are no input media streams, and the Input-Waveform-URI header has not been specified, the recognizer cannot enter the listening state, and the listen request will fail (4xx). When in the listening state, and all input streams have ended, the recognizer automatically transitions to the idle state, and issues a RECOGNITION-COMPLETE event, with Completion-Cause set to 4xx (tbd).

TODO: should be "when" not "if".

TODO: Specify Completion-Cause value for no input stream.

A LISTEN request that is made while the recognizer is already listening results in a 402 error ("Method not valid in this state", since it is already listening).

START-INPUT-TIMERS

This is identical to the [MRCPv2] method with the same name. It is useful, for example, when the application wants to enable voice barge-in during a prompt, but doesn't want to start the time-out clock until after the prompt has completed.

TODO: collapse START-MEDIA-STREAM.

STOP

The STOP method transitions the recognizer from the listening state to the idle state. No RECOGNITION-COMPLETE event is sent. The Source-Time header MUST be used, since the recognizer may still fire a RECOGNITION-COMPLETE event for any completion state it encounters prior to that time in the input stream.

A STOP request that is sent while the the recognizer is idle results in a 402 response (method not valid in this state, since there is nothing to stop).

DEFINE-GRAMMAR

The DEFINE-GRAMMAR method is similar to its namesake in [MRCPv2]. DEFINE-GRAMMAR does not activate a grammar, it simply causes the recognizer to pre-load and compile it, and associates it with a temporary URI that can then be used to activate or deactivate the grammar or one of its rules. DEFINE-GRAMMAR is not required in order to use a grammar, since the recognizer can load grammars on demand as needed. However, it is useful when an application wants to ensure a large grammar is pre-loaded and ready for use prior to the recognizer entering the listening state. DEFINE-GRAMMAR can be used when the recognizer is in either the listening or idle state.

All recognizer services MUST support grammars in the SRGS XML format, and MAY support additional alternative grammar/language-model formats.

SET-GRAMMARS

The SET-GRAMMARS method is used to activate and deactivate grammars and rules, using the Active-Grammars and Inactive-Grammars headers. The Source-Time header MUST be used, and activations/deactivations are considered to take place at precisely that time in the input stream(s).

SET-GRAMMARS may only be requested when the recognizer is in the idle state. It will fail (4xx) if requested in the listening state.

The recognizer MUST support grammars in the [SRGS] XML format, and may support grammars (or other forms of language model) in other formats.

ISSUE: Do we need an explicit method for this, or is SET-PARAMS enough? One option is to not allow them on set/get-params. Another is to say that if get/set-params does exactly the same thing, then there's no need for this method. If there's a default set of active grammars, then get-params might be required. Get-params may also be useful for defensive programming. Inline grammars don't have URIs. Suggestion is to add a GET-GRAMMARS, and disallow get/set-params.

GET-GRAMMARS

The GET-GRAMMARS method is used to query the set of active grammars and rules. The recognizer should respond with a 200 COMPLETE status message, containing an Active-Grammars header that lists all of the currently active grammars and rules.

TODO: Should GET-GRAMMARS also return the list of inactive grammars/rules? It's not clear how that would be useful. Also, the list of inactive rules could be rather long and unwieldy.

CLEAR-GRAMMARS

In continuous recognition, a variety of grammars may be loaded over time, potentially resulting in unused grammars consuming memory resources in the recognizer. The CLEAR-GRAMMARS method unloads all grammars, whether active or inactive. Any URIs previously defined in DefineGrammar become invalid.

INTERPRET

The INTERPRET method is similar to its namesake in [MRCPv2], and processes the input text according to the set of grammar rules that are active at the time it is received by the recognizer. It MUST include the Interpret-Text header. The use of INTERPRET is orthogonal to any audio processing the recognizer may be doing, and will not affect any audio processing. The recognizer can be in either the listening or idle state.

INFO

In multimodal applications, some recognizers will benefit from additional context. Clients can use the INFO request to send this context. The Content-Type header should specify the type of data, and the data itself is contained in the message body.

TODO: Note somewhere that vendors are free to support other language model file formats beyond SRGS.

5.2 Recognition Events

Recognition events are associated with 'IN-PROGRESS' request-state notifications from the 'recognizer' resource.


reco-event   = "START-OF-SPEECH"      ; Start of speech has been detected
             | "END-OF-SPEECH"        ; End of speech has been detected
             | "INTERIM-EVENT"        ; See Interim Events above
             | "INTERMEDIATE-RESULT"  ; A partial hypothesis
             | "RECOGNITION-COMPLETE" ; Similar to MRCP2 except that application/emma+xml (EMMA) will be the default Content-Type.
             | "INTERPRETATION-COMPLETE"

TODO: change to START/END-OF-SPEECH

END-OF-SPEECH

END-OF-INPUT is the logical counterpart to START-OF-INPUT, and indicates that speech has ended. The event MUST include the Source-Time header, which corresponds to the point in the input stream where the recognizer estimates speech to have ended, NOT when the endpointer finally decided that speech ended (which will be a number of milliseconds later).

INTERIM-EVENT

See Interim Events above.

INTERMEDIATE-RESULT

Continuous speech (aka dictation) often requires feedback about what has been recognized thus far. Waiting for a RECOGNITION-COMPLETE event prevents this sort of user interface. INTERMEDIATE-RESULT provides this intermediate feedback. As with RECOGNITION-COMPLETE, contents are assumed to be EMMA unless an alternate Content-Type is provided.

INTERPRETATION-COMPLETE

This event is identical to the [MRCPv2] event with the same name.

RECOGNITION-COMPLETE

This method is similar to the [MRCPv2] method with the same name, except that application/emma+xml (EMMA) is the default Content-Type. The Source-Time header must be included, to indicate the point in the input stream when the event occured. When the Listen-Mode is reco-once, the recognizer will transition from the listening state to the idle state when this message is fired, and the Recognizer-State header in the event is set to "idle".

TODO: Describe how final results can be replaced in continuous recognition.

TODO: no match is returned, is the EMMA no-match document required?

TODO: Insert some EMMA document examples.

START-OF-SPEECH

Indicates that start of speech has been detected. The Source-Time header MUST correspond to the point in the input stream(s) where speech was estimated to begin, NOT when the endpointer finally decided that speech began (a number of milliseconds later).

5.3 Recognition Headers

The list of valid headers for the recognizer resource include a subset of the [MRCPv2] Recognizer Header Fields, where they make sense for HTML Speech requirements, as well as a handful of headers that are required for HTML Speech.


reco-header =  ; Headers borrowed from MRCP
               Confidence-Threshold
             | Sensitivity-Level
             | Speed-Vs-Accuracy
             | N-Best-List-Length
             | No-Input-Timeout
             | Recognition-Timeout
             | Waveform-URI
             | Media-Type
             | Input-Waveform-URI
             | Completion-Cause
             | Completion-Reason
             | Recognizer-Context-Block
             | Start-Input-Timers
             | Speech-Complete-Timeout
             | Speech-Incomplete-Timeout
             | Failed-URI
             | Failed-URI-Cause
             | Save-Waveform
             | Speech-Language
             | Hotword-Min-Duration
             | Hotword-Max-Duration
             | Interpret-Text
             ; Headers added for html-speech/1.0
             | audio-codec           ; The audio codec used in an input media stream
             | active-grammars       ; Specifies a grammar or specific rule to activate.
             | inactive-grammars     ; Specifies a grammar or specific rule to deactivate.
             | hotword               ; Whether to listen in "hotword" mode (i.e. ignore out-of-grammar speech)
             | listen-mode           ; Whether to do continuous or one-shot recognition
             | partial               ; Whether to send partial results
             | partial-interval      ; Suggested interval between partial results, in milliseconds.
             | recognizer-state      ; Indicates whether the recognizer is listening or idle
             | source-time           ; The UA local time at the request was initiated
             | user-id               ; Unique identifier for the user, so that adaptation can be used to improve accuracy.
             | Wave-Start-Time       ; The start point of a recognition in the audio referred to by Waveform-URI.
             | Wave-End-Time         ; The end point of a recognition in the audio referred to by Waveform-URI.

hotword            = "Hotword:" BOOLEAN
listen-mode        = "Listen-Mode:" ("reco-once" | "reco-continuous" | vendor-listen-mode)
vendor-listen-mode = "x-" 1*UTFCHAR
recognizer-state   = "Recognizer-State:" ("listening" | "idle")
source-time        = "Source-Time:" 1*20DIGIT
audio-codec        = "Audio-Codec:" mime-media-type ; see [RFC3555]
partial            = "Partial:" BOOLEAN
partial-interval   = "Partial-Interval:" 1*5DIGIT
active-grammars    = "Grammar-Activate:" "<" URI ["#" rule-name] [SP weight] ">" *("," "<" URI ["#" rule-name] [SP weight] ">")
rule-name          = 1*UTFCHAR
weight             = "0." 1*3DIGIT
inactive-grammars  = "Grammar-Deactivate:" "<" URI ["#" rule-name] ">" *("," "<" URI ["#" rule-name] ">")
user-id            = "User-ID:" 1*UTFCHAR
wave-start-time    = "Wave-Start-Time:" 1*DIGIT ["." 1*DIGIT]
wave-end-time      = "Wave-End-Time:"  1*DIGIT ["." 1*DIGIT]

TODO: discuss how recognition from file would work.

Headers with the same names as their [MRCPv2] counterparts are considered to have the same specification. Other headers are describe as follows:

Audio-Codec

The Audio-Codec header is used in the START-MEDIA-STREAM request, to specify the codec and parameters used to encode the input stream, using the MIME media type encoding scheme specified in [RFC3555].

Active-Grammars

The Active-Grammars header specifies a list of grammars, and optionally specific rules within those grammars. The header is used in SET-GRAMMARS or LISTEN to activate grammars/rules, and in GET-GRAMMARS to list the active grammars/rules. If no rule is specified for a grammar, the root rule is activated. This header may also specify the weight of the rule.

This header cannot be used in GET/SET-PARAMS

ISSUE: Grammar-Activate/Deactivate probably don't make sense in GET/SET-PARAMS. Is this an issue? Perhaps this would be better achieved in the message body? The same format could be used.

Inactive-Grammars

The Inactive-Grammars header specifies a list of grammars, and optionally specific rules within those grammars, to be deactivated. If no rule is specified, all rules in the grammar are deactivated, including the root rule. The Grammar-Deactivate header MAY be used in both the SET-GRAMMAR and LISTEN methods.

This header cannot be used in GET/SET-PARAMS

Hotword

The Hotword header is analogous to the [MRCPv2] Recognition-Mode header, however it has a different name and boolean type in html-speech/1.0 in order to avoid confusion with the Listen-Mode header. When true, the recognizer functions in "hotword" mode, which essentially means that out-of-grammar speech is ignored.

Listen-Mode

Listen-Mode is used in the LISTEN request to specify whether the recognizer should listen continuously, or return to the idle state after the first RECOGNITION-COMPLETE event. It MUST NOT be used in any other type of request other than LISTEN. When the recognizer is in the listening state, it should include Listen-Mode in all event and status messages it sends.

Partial

This header is required to support the continuous speech scenario on the recognizer resource. When sent by the client in a LISTEN or SET-PARAMS request, this header controls whether or not the client is interested in partial results from the service. In this context, the term 'partial' is meant to describe mid-utterance results that provide a best guess at the user's speech thus far (e.g. "deer", "dear father", "dear father christmas"). These results should contain all recognized speech from the point of the last non-partial (ie complete) result, but it may be common for them to omit fully-qualified result attributes like an NBest list, timings, etc. The only guarantee is that the content must be EMMA. Note that this header is valid on both regular command-and-control recognition requests as well as dictation sessions. This is because at the API level, there is no syntactic difference between the recognition types. They are both simply recognition requests over an SRGS grammar or set of URL(s). Additionally partial results can be useful in command-and-control scenarios, for example: open-microphone applications, dictation enrollment applications, and lip-sync. When sent by the server, this header indicates whether the message contents represent a full or partial result. It's valid for a server to send this header on both INTERMEDIATE-RESULT, RECOGNITION-COMPLETE, and in response to a GET-PARAMS messages.

Partial-Interval

A suggestion from the client to the service on the frequency at which partial results should be sent. It is an integer value represents desired interval expressed in milliseconds. The recognizer does not need to precisely honor the requested interval, but SHOULD provide something close, if it is within the operating parameters of the implementation.

Recognizer-State

Indicates whether the recognizer is listening or idle. This MUST NOT be included by the client in any requests, and MUST be included by the recognizer in all status and event messages it sends.

Source-Time

Indicates the timestamp of a message using the client's local time. All requests sent from the client to the recognizer MUST include the Source-Time header, which must faithfully specify the client's local system time at the moment it sends the request. This enables the recognizer to correctly synchronize requests with the precise point in the input stream at which they were actually sent by the client. All event messages sent by the recognizer MUST include the Source-Time, calculated by the recognizer service based on the point in the input stream at which the event occurred, and expressed in the client's local clock time (since the recognizer knows what this was at the start of the input stream). By expressing all times in client-time, the user agent or application is able to correctly sequence events, and implement timing-sensitive scenarios, that involve other objects outside the knowledge of the recognizer service (for example, media playback objects or videogame states).

TODO: What notation should be used? The Media Fragments Draft, "Temporal Dimensions" section has some potentially viable formats, such as the "wall clock" Zulu-time format.

User-ID

Recognition results are often more accurate if the recognizer can train itself to the user's speech over time. This is especially the case with dictation as vocabularies are so large. A User-ID field would allow the recognizer to establish the user's identify if the webapp decided to supply this information.

Wave-Start/End-Time
See Recording and Re-Recognizing.

ISSUE: There were some additional headers proposed few weeks ago: Return-Punctuation; Gender-Number-Pronoun; Return-Formatting; Filter-Profanity. There was pushback that these shouldn't be in a standard because they're very vendor-specific. Thus I haven't included them in this draft. Do we agree these are appropriate to omit? NOTE: we decided to omit them.

5.4 Recording and Re-Recognizing

Some applications will wish to re-recognize an utterance using different grammars. For example, an application may accept a broad range of input, and use the first round of recognition simply to classify an utterance so that it can use a more focused grammar on the second round. Others will wish to record an utterance for future use. For example, an application transcribes an utterance to text may store a recording so that untranscribed information is not lost (tone, emotion, etc). While these are not mainstream scenarios, they are both valid and inevitable, and may be achieved using the headers provided for recognition.

If the Save-Waveform header is set to true (with SET-PARAMS or LISTEN), then the recognizer will save the input audio. Consequent RECOGNITION-COMPLETE events sent by the recognizer will contain a URI in the Waveform-URI header which refers to the stored audio. In the case of continuous recognition, the Waveform-URI header refers to all of the audio captured so far. The application may fetch the audio from this URI, assuming it has appropriate credentials (the credential policy is determined by the service provider). The application may also use the URI as input to future LISTEN requests by passing the URI in the Input-Waveform-URI header.

When RECOGNITION-COMPLETE returns a Waveform-URI header, it also returns the time interval within the recorded waveform that the recognition result applies to, in the Wave-Start-Time and Wave-End-Time headers, which indicate the offsets in seconds from the start of the waveform. A client MAY also use the SourceTime header of other events such as START-OF-SPEECH and END-OF-SPEECH to calculate other intervals of interest. When using the Input-Wavefor-URI header, the client may suffix the URI with an "interval" parameter to indicate that the recognizer should only decode that particular interval of the audio:


interval = "interval=" start "," end
start    = seconds | "start"
end      = seconds | "end"
seconds  = 1*DIGIT ["." 1*DIGIT]

For example:


http://example.com/retainedaudio/fe429ac870a?interval=0.3,2.86
http://example.com/temp44235.wav?interval=0.65,end

TODO: does the Waveform-URI return a URI for each input stream, or are all input streams magically encoded into a single stream?

TODO: does the Input-Waveform-URI cause any existing input streams to be ignored?

5.5 Predefined Grammars

Speech services MAY support pre-defined grammars that can be referenced through a 'builtin:' uri. For example <builtin:dictation?context=email&lang=en_US>, <builtin:date>, or <builtin:search?context=web>. These can be used as top-level grammars in the Grammar-Activate/Deactivate headers, or in rule references within other grammars. If a speech service does not support the referenced builtin or if it does not specify the builtin in combination with other active grammars, it should return a grammar compilation error.

The specific set of predefined grammars is to be defined later. However, there MUST be a certain small set of predefined grammars that a user agent's default speech recognizer MUST support. For non-default recognizers, support for predefined grammars is optional, and the set that is supported is also defined by the service provider (and may include proprietary grammars, e.g. builtin:x-acme-parts-catalog).

TODO: perhaps the specific set of grammars should be a MUST for the default built-in user agent, for a certain small set of grammars, but MAY for 3rd-party services. Can't solve this now - but note as an issue to be solved later.

5.5 Recognition Examples

TODO: Write some examples of one-shot and continuous recognition, EMMA documents, partial results, vendor-extensions, grammar/rule activation/deactivation, etc.

Example of reco-once


C->S: binary message: start of stream (stream-id = 112233)
         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |                   stream-id                   |
        |1 0 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
        +---------------+-----------------------------------------------+
        |1 0 0 0 1 0 1 1 1 1 0 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0| } NTP Timestamp
        |0 0 0 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0 1 1 0 0 0 0 1 0 1 0 1 0 1| }
        |      61              75              64              69       | a u d i
        |      6F              2F              61              6D       | o / a m
        |      72              2D              77              62       | r - w b
        +---------------------------------------------------------------+

C->S: binary message: media packet (stream-id = 112233)
         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |                   stream-id                   |
        |0 1 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
        +---------------+-----------------------------------------------+
        |                       encoded audio data                      |
        |                              ...                              |
        |                              ...                              |
        |                              ...                              |
        +---------------------------------------------------------------+

C->S: more binary media packets...

C->S: html-speech/1.0 LISTEN 8322
      Resource-Identifier: recognizer
      Confidence-Threshold:0.9
      Active-Grammars: <built-in:dictation?context=message>
      Listen-Mode: reco-once
      Source-time: 12753432234 (where in the input stream recognition should start)

S->C: html-speech/2.0 START-OF-SPEECH 8322 IN-PROGRESS

C->S: more binary media packets...

C->S: binary audio packets...
C->S: binary audio packet in which the user stops talking
C->S: binary audio packets...
 
S->C: html-speech/1.0 END-OF-SPEECH 8322 IN-PROGRESS (i.e. the recognizer has detected the user stopped talking)

C->S: binary audio packet: end of stream (i.e. since the recognizer has signaled end of input, the UA decides to terminate the stream)
         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |                   stream-id                   |
        |1 1 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
        +---------------+-----------------------------------------------+

S->C: html-speech/1.0 RECOGNITION-COMPLETE 8322 COMPLETE (because mode = reco-once, it the request completes when reco completes)

Example of continuous reco with intermediate results


C->S: binary message: start of stream (stream-id = 112233)
         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |                   stream-id                   |
        |1 0 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
        +---------------+-----------------------------------------------+
        |1 0 0 0 1 0 1 1 1 1 0 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0| } NTP Timestamp
        |0 0 0 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0 1 1 0 0 0 0 1 0 1 0 1 0 1| }
        |      61              75              64              69       | a u d i
        |      6F              2F              61              6D       | o / a m
        |      72              2D              77              62       | r - w b
        +---------------------------------------------------------------+

C->S: binary message: media packet (stream-id = 112233)
         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |                   stream-id                   |
        |0 1 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
        +---------------+-----------------------------------------------+
        |                       encoded audio data                      |
        |                              ...                              |
        |                              ...                              |
        |                              ...                              |
        +---------------------------------------------------------------+

C->S: more binary media packets...

C->S: html-speech/1.0 LISTEN 8322
      Resource-Identifier: recognizer
      Confidence-Threshold:0.9
      Active-Grammars: <built-in:dictation?context=message>
      Listen-Mode: reco-continuous
      Partial: TRUE
      Source-time: 12753432234 (where in the input stream recognition should start)

C->S: more binary media packets...

S->C: html-speech/2.0 START-OF-SPEECH 8322 IN-PROGRESS
      Source-time: 12753439912 (when speech was detected)

C->S: more binary media packets...

S->C: html-speech/2.0 INTERMEDIATE-RESULT 8322 IN-PROGRESS

C->S: more binary media packets...
 
S->C: html-speech/1.0 END-OF-SPEECH 8322 IN-PROGRESS (i.e. the recognizer has detected the user stopped talking)

C->S: more binary media packets...

S->C: html-speech/1.0 RECOGNITION-COMPLETE 8322 IN-PROGRESS (because mode = reco-continuous, the request remains IN-PROGRESS)

C->S: more binary media packets...

S->C: html-speech/2.0 START-OF-SPEECH 8322 IN-PROGRESS

S->C: html-speech/2.0 INTERMEDIATE-RESULT 8322 IN-PROGRESS

S->C: html-speech/1.0 RECOGNITION-COMPLETE 8322 IN-PROGRESS

S->C: html-speech/2.0 INTERMEDIATE-RESULT 8322 IN-PROGRESS

S->C: html-speech/1.0 RECOGNITION-COMPLETE 8322 IN-PROGRESS (because mode = reco-continuous, the request remains IN-PROGRESS)

S->C: html-speech/1.0 END-OF-SPEECH 8322 IN-PROGRESS (i.e. the recognizer has detected the user stopped talking)

C->S: binary audio packet: end of stream (i.e. since the recognizer has signaled end of input, the UA decides to terminate the stream)
         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |                   stream-id                   |
        |1 1 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
        +---------------+-----------------------------------------------+

S->C: html-speech/1.0 RECOGNITION-COMPLETE 8322 COMPLETE
      Recognizer-State:idle
      Completion-Cause: XXX (TBD)
      Completion-Reason: No Input Streams
      

6. Synthesis

In HTML speech applications, the synthesizer service does not participate directly in the user interface. Rather, it simply provides rendered audio upon request, similar to any media server, plus interim events such as marks. The UA buffers the rendered audio, and the application may choose to play it to the user at some point completely unrelated to the synthesizer service. It is the synthesizer's role to render the audio stream in a timely manner, at least rapidly enough to support real-time feedback. The synthesizer MAY also render and transmit the stream faster than requied for real time playback, or render multiple streams in parallel, in order to reduce latency in the application. This is a stark contrast to IVR, where the synthesizer essentially renders directly to the user's telephone, and is an active part of the user interface.

The synthesizer MUST support [[SSML] AND plain text input. A synthesizer MAY also accept other input formats. In all cases, the client should use the content-type header to indicate the input format.

TODO: Mention SSML. Use content-type to differentiate between SSML and plain text.

6.1 Synthesis Requests


synth-method = "SPEAK"
             | "STOP"
             | "DEFINE-LEXICON"

The set of synthesizer request methods is a subset of those defined in [MRCPv2]

SPEAK

The SPEAK method operates similarly to its [MRCPv2] namesake. The primary difference is that SPEAK results in a new audio stream being sent from the server to the client, using the same Request-ID. A SPEAK request MUST include the Audio-Codec header. When the rendering has completed, and the end-of-stream message has been sent, the synthesizer sends a SPEAK-COMPLETE event.

STOP

When the synthesizer receives a STOP request, it ceases rendering the requests specified in the Active-Request-ID header. If the Active-Request-ID header is missing, it ceases rendring all active SPEAK requests. For any SPEAK request that is ceased, the synthesiser sends an end-of-stream message, and a SPEAK-COMPLETE event.

DEFINE-LEXICON

This is identical to its namesake in [MRCPv2].

6.2 Synthesis Events

Synthesis events are associated with 'IN-PROGRESS' request-state notifications from the synthesizer resource.


synth-event  = "INTERIM-EVENT"  ; See Interim Events above
             | "SPEECH-MARKER"  ; An SSML mark has been rendered
             | "SPEAK-COMPLETE"
INTERIM-EVENT

See Interim Events above.

SPEECH-MARKER

Similar to its namesake in [MRCPv2], except that the Speech-Marker header contains a relative timestamp indicating the elapsed time from the start of the stream.

Implementations should send the SPEECH-MARKER as closely as possible to the corresponding media packet so clients may play the media and fire events in real time if needed.

TODO: this should be sent adjacent to the audio packet at the same time point, so clients can play back in real time.

SPEAK-COMPLETE

The same as its [MRCPv2] namesake.

6.3 Synthesis Headers

The synthesis headers used in html-speech/1.0 are mostly a subset of those in [MRCPv2], with some minor modification and additions.


synth-header = ; headers borrowed from [MRCPv2]
               active-request-id-list
             | Completion-Cause
             | Completion-Reason
             | Voice-Gender
             | Voice-Age
             | Voice-Variant
             | Voice-Name
             | Prosody-parameter ; Actually a collection of prosody headers
             | Speech-Marker
             | Speech-Language
             | Failed-URI
             | Failed-URI-Cause
             | Load-Lexicon
             | Lexicon-Search-Order
               ; new headers for html-speech/1.0
             | Audio-Codec
             | Stream-ID


Audio-Codec  = "Audio-Codec:" mime-media-type ; See [RFC3555]
Stream-ID    = 1*8DIGIT ; decimal representation of 24-bit stream-ID
Audio-Codec
Because an audio stream is created in response to a SPEAK request, the audio codec and parameters must be specified in the SPEAK request, or in SET-PARAMS, using the Audio-Codec header. If the synthesizer is unable to encode with this codec, it terminates the request with a 4xx COMPLETE status message.
Speech-Marker

Similar to its namesake in [MRCPv2], except that the clock is defined as the local time at the service. By using the timestamp from the beginning of the stream, and the timestamp of this event, the UA can calculate when to raise the event to the application based on where it is in the playback of the rendered stream.

Stream-ID
Specifies the ID of the stream that contains the rendered audio, so that the UA can associate audio streams it receives with particular SPEAK requests.

6.4 Synthesis Examples

TODO: insert more synthesis examples

TODO: synthesizing multiple prompts in parallel for playback in the UA when the app needs them


C->S: html-speech/1.0 SPEAK 3257
        Resource-ID:synthesizer
        Voice-gender:neutral
        Voice-Age:25
        Audio-codec:audio/flac
        Prosody-volume:medium
        Content-Type:application/ssml+xml

        <?xml version="1.0"?>
        <speak version="1.0">
        ...

S->C: html-speech/1.0 3257 200 IN-PROGRESS
        Resource-ID:synthesizer
        Stream-ID: 112233
        Speech-Marker:timestamp=0

S->C: binary message: start of stream (stream-id = 112233)
         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |                   stream-id                   |
        |1 0 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
        +---------------+-----------------------------------------------+
        |1 0 0 0 1 0 1 1 1 1 0 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0| } NTP Timestamp
        |0 0 0 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0 1 1 0 0 0 0 1 0 1 0 1 0 1| }
        |      61              75              64              69       | a u d i
        |      6F              2F              66              6C       | o / f l
        |      61              63       +-------------------------------+ a c
        +-------------------------------+                                


S->C: binary message: media packet (stream-id = 112233)
         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |                   stream-id                   |
        |0 1 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
        +---------------+-----------------------------------------------+
        |                       encoded audio data                      |
        |                              ...                              |
        |                              ...                              |
        |                              ...                              |
        +---------------------------------------------------------------+

S->C: more binary media packets...

S->C: html-speech/1.0 SPEECH-MARKER 3257 IN-PROGRESS
        Resource-ID:synthesizer
        Speech-Marker:timestamp=2059000;marker-1

S->C: more binary media packets...

S->C: html-speech/1.0 SPEAK-COMPLETE 3257 COMPLETE
        Resource-ID:Synthesizer
        Completion-Cause:000 normal
        Speech-Marker:timestamp=5011000

S->C: binary audio packets...

S->C: binary audio packet: end of stream ( message type = 0x03 )
         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |                   stream-id                   |
        |1 1 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
        +---------------+-----------------------------------------------+

7. References

[EMMA]
EMMA: Extensible MultiModal Annotation markup languagehttp://www.w3.org/TR/emma/
[MIME-RTP]
MIME Type Registration of RTP Payload Formats http://www.ietf.org/rfc/rfc3555.txt
[MRCPv2]
MRCP version 2 http://tools.ietf.org/html/draft-ietf-speechsc-mrcpv2-24
[REQUIREMENTS]
Protocol Requirements http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/att-0030/protocol-reqs-commented.html
[HTTP1.1]
Hypertext Transfer Protocol -- HTTP/1.1 http://www.w3.org/Protocols/rfc2616/rfc2616.html
[RFC3555]
MIME Type Registration of RTP Payload Formats http://www.ietf.org/rfc/rfc3555.txt
[RFC5646]
Tags for Identifying Languages http://tools.ietf.org/html/rfc5646
[SRGS]
Speech Recognition Grammar Specification Version 1.0 http://www.w3.org/TR/speech-grammar/
[SSML]
Speech Synthesis Markup Language (SSML) Version 1.0 http://www.w3.org/TR/speech-synthesis/
[WS-API]
Web Sockets API, http://www.w3.org/TR/websockets/
[WS-PROTOCOL]
Web Sockets Protocol http://tools.ietf.org/pdf/draft-ietf-hybi-thewebsocketprotocol-09.pdf