HTML Speech XG
Proposed Protocol Approach

Draft Version 6, Revision 1, October 18th, 2011

This version:
Posted to http://lists.w3.org/Archives/Public/public-xg-htmlspeech/
Latest version:
Posted to http://lists.w3.org/Archives/Public/public-xg-htmlspeech/
Previous versions:
Draft Version 6: http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Oct/att-0033/speech-protocol-draft-06.htm
Draft Version 5: http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Sep/att-0012/speech-protocol-draft-05.htm
Draft Version 4: http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Aug/att-0004/speech-protocol-draft-04.html
Draft Version 3, revision 3: http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jul/att-0025/speech-protocol-draft-03-r3.html
Draft Version 2: http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/att-0065/speech-protocol-basic-approach-02.html
Editor:
Robert Brown, Microsoft
Contributors:
Dan Burnett, Voxeo
Marc Schroeder, DFKI
Milan Young, Nuance
Michael Johnston, AT&T
Patrick Ehlen, AT&T
And other contributions from HTML Speech XG participants, http://www.w3.org/2005/Incubator/htmlspeech/

Status of this Document

This document is an informal rough draft that collates proposals, agreements, and open issues on the design of the necessary underlying protocol for the HTML Speech XG, for the purposes of review and discussion within the XG.

Abstract

Multimodal interfaces enable users to interact with web applications using multiple different modalities. The HTML Speech protocol, and associated HTML Speech API, are designed to enable speech modalities as part of a common multimodal user experience combining spoken and graphical interaction across browsers. The specific goal of the HTML Speech protocol is to enable a web application to utilize the same network-based speech resources regardless of the browser used to render the application. The HTML Speech protocol is defined as a sub-protocol of WebSockets [WS-PROTOCOL], and enables HTML user agents and applications to make interoperable use of network-based speech service providers, such that applications can use the service providers of their choice, regardless of the particular user agent the application is running in. The protocol bears some similarity to [MRCPv2] where it makes sense to borrow from that prior art. However, since the use cases for HTML Speech applications are in many cases considerably different from those around which MRCPv2 was designed, the HTML Speech protocol is not a direct transcript of MRCP. Similarly, because the HTML Speech protocol builds on WebSockets, its session negotiation and media transport needs are quite different from those of MRCP.

Contents

  1. Architecture
  2. Definitions
  3. Protocol Basics
    1. Session Establishment
    2. Control Messages
      1. Request Messages
      2. Status Messages
      3. Event Messages
    3. Media Transmission
    4. Security
    5. Time Stamps
  4. General Capabilities
    1. Generic Headers
    2. Capabilities Discovery
    3. Interim Events
    4. Resource Selection
  5. Recognition
    1. Recognition Requests
    2. Recognition Events
    3. Recognition Headers
    4. Predefined Grammars
    5. Recognition Examples
  6. Synthesis
    1. Synthesis Requests
    2. Synthesis Events
    3. Synthesis Headers
    4. Synthesis Examples
  7. References

1. Architecture


             Client
|-----------------------------|
|       HTML Application      |                                            Server
|-----------------------------|                                 |--------------------------|
|       HTML Speech API       |                                 | Synthesizer | Recognizer |
|-----------------------------|                                 |--------------------------|
|  Web-Speech Protocol Client |----web-speech/1.0 subprotocol---|     Web-Speech Server    |
|-----------------------------|                                 |--------------------------|
|      WebSockets Client      |-------WebSockets protocol-------|     WebSockets Server    |
|-----------------------------|                                 |--------------------------|

2. Definitions

Recognizer

Because continuous recognition plays an important role in HTML Speech scenarios, a Recognizer is a resource that essentially acts as a filter on its input streams. Its grammars/language models can be specified and changed, as needed by the application, and the recognizer adapts its processing accordingly. Single-shot recognition (e.g. a user on a web search page presses a button and utters a single web-search query) is a special case of this general pattern, where the application specifies its model once, and is only interested in one match event, after which it stops sending audio (if it hasn't already).

A Recognizer performs speech recognition, with the following characteristics:

  1. Support for one or more spoken languages and acoustic scenarios.
  2. Processing of one or more input streams. The typical scenario consists of a single stream of encoded audio. But some scenarios will involve multiple audio streams, such as multiple beams from an array microphone picking up different speakers in a room; or streams of multimodal input such as gesture or motion, in addition to speech.
  3. Support for multiple simultaneous grammars/language models, including but not limited to application/srgs+xml [SRGS]. Implementations MAY support additional formats, such as ABNF SRGS or an SLM format.
  4. Support for continous recognition, generating events as appropriate such as match/no-match, detection of the start/end of speech, etc.
  5. Support for at least one "dictation" language model, enabling essentially unconstrained spoken input by the user.
  6. Support for "hotword" recognition, where the recognizer ignores speech that is out of grammar. This is particularly useful for open-mic scenarios.
  7. Support for recognition of media delivered slower than real-time, since network conditions can and will introduce delays in the delivery of media.

"Recognizers" are not strictly required to perform speech recognition, and may perform additional or alternative functions, such as speaker verification, emotion detection, or audio recording.

Synthesizer

A Synthesizer generates audio streams from textual input. It essentially produces a media stream with additional events, which the user agent buffers and plays back as required by the application. A Synthesizer service has the following characteristics:

  1. Rendering of application/ssml+xml [SSML] or text/plain input to an output audio stream. Implementations MAY support additional input text formats.
  2. Each synthesis request results in a separate output stream that is terminated once rendering is complete, or if it has been canceled by the client.
  3. Rendering must be performed and transmitted at least as rapidly as would be needed to support real-time playback, and preferably faster. In some cases, network conditions between service and UA may result in slower-than-real-time delivery of the stream, and the UA or application will need to cope with this appropriately.
  4. Generation of interim events, such as those corresponding to SSML marks, with precise timing. Events are transmitted by the service as closely as possible to the corresponding audio packet, to enable real-time playback by the client if required by the application.
  5. Multiple Synthesis requests may be issued concurrently rather than queued serially. This is because HTML pages will need to be able to simultaneously prepare multiple synthesis objects in anticipation of a variety of user-generated events. A server MAY service simultaneous requests in parallel or in series, as deemed appropriate by the implemento, but it MUST accept them in parallel (i.e. support having multiple outstanding requests).
  6. Because a Synthesizer resource only renders a stream, and is not responsible for playback of that stream to a user, it does NOT provide any form of shuttle control (pausing or skipping), since this is performed by the client; nor does it provide any control over volume, rate, pitch, etc, other than as specified in the SSML input document.

3. Protocol Basics

In the HTML speech protocol, the control signals and the media itself are transported over the same Websocket connection. Earlier implementations utilized a simple HTTP connection for speech recognition and synthesis. Use cases involving continuous recognition motivated the move to Websockets. This simple design avoids all the normal media problems of session negotiation, packet delivery, port & IP address assignments, NAT-traversal, etc, since the underlying WebSocket already satisfies these requirements. A beneficial side-effect of this design is that by limiting the protocol to Websockets over HTTP there should be less problems with firewalls compared to having a separate RTP connection of other for the media transport. This design is different from MRCP, which is oriented around telephony/IVR and all its impediments, rather than HTML and WebServices, and is motivated by simplicity and desire to keep the protocol within HTTP.

3.1 Session Establishment

The WebSockets session is established through the standard WebSockets HTTP handshake, with these specifics:

For example:


C->S: GET /speechservice123?customparam=foo&otherparam=bar HTTP/1.1
      Host: examplespeechservice.com
      Upgrade: websocket
      Connection: Upgrade
      Sec-WebSocket-Key: OIUSDGY67SDGLjkSD&g2 (for example)
      Sec-WebSocket-Version: 9
      Sec-WebSocket-Protocol: web-speech/1.0, x-proprietary-speech

S->C: HTTP/1.1 101 Switching Protocols
      Upgrade: websocket
      Connection: Upgrade
      Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=
      Sec-WebSocket-Protocol: web-speech/1.0

Once the WebSockets session is established, the UA can begin sending requests and media to the service, which can respond with events, responses or media.

A session MAY have a maximum of one synthesizer resource and one recognizer resource. If an application requires multiple resources of the same type (such as, for example, two synthesizers from different vendors), it MUST use separate WebSocket sessions.

There is no association of state between sessions. If a service wishes to provide a special association between separate sessions, it may do so behind the scenes (for example, to re-use audio input from one session in another session without resending it, or to cause service-side barge-in of TTS in one session by recognition in another session, would be service-specific extensions).

3.2 Control Messages

The signaling design borrows its basic pattern from [MRCPv2], where there are three classes of control messages:

Requests
C->S requests from the UA to the service. The client requests a method (SPEAK, LISTEN, STOP, etc) from a particular remote speech resource.
Status Notifications
S->C general status notification messages from the service to the UA, marked as either PENDING, IN-PROGRESS or COMPLETE.
Named Events
S->C named events from the service to the UA, that are essentially special cases of 'IN-PROGRESS' request-state notifications.

control-message =   start-line ; i.e. use the typical MIME message format
                  *(header CRLF)
                    CRLF
                   [body]
start-line      =   request-line | status-line | event-line
header          =  <Standard MIME header format> ; case-insensitive. Actual headers depend on the type of message
body            =  *OCTET                        ; depends on the type of message

The interaction is full-duplex and asymmetrical: service activity is instigated by requests the UA, which may be multiple and overlapping, and where each request results in one or more messages from the service back to the UA.

For example:


C->S: web-speech/1.0 SPEAK 3257                    ; request synthesis of string
        Resource-ID:synthesizer
        Audio-codec:audio/basic
        Content-Type:text/plain

        Hello world! I speak therefore I am.

S->C: web-speech/1.0 3257 200 IN-PROGRESS         ; server confirms it will start synthesizing

S->C: media for 3257                               ; receive synthesized media

S->C: web-speech/1.0 SPEAK-COMPLETE 3257 COMPLETE ; done!

3.2.1 Request Messages

Request messages are sent from the client to the server, usually to request an action or modify a setting. Each request has its own request-id, which is unique within a given WebSockets web-speech session. Any status or event messages related to a request use the same request-id. All request-ids MUST have a non-negative integer value of 1-10 decimal digits.


request-line   = version SP method-name SP request-id SP CRLF
version        = "web-speech/" 1*DIGIT "." 1*DIGIT ; web-speech/1.0
method-name    = general-method | synth-method | reco-method | proprietary-method
request-id     = 1*10DIGIT

NOTE: In some other protocols, messages also include their message length, so that they can be framed in what is otherwise an open stream of data. In web-speech/1.0, framing is already provided by WebSockets, and message length is not needed, and therefore not included.

For example, to request the recognizer to interpret text as if it were spoken:


C->S: web-speech/1.0 INTERPRET 8322
      Resource-Identifier: recognizer
      Active-Grammars: <http://myserver/mygrammar.grxml>
      Interpret-Text: Send a dozen yellow roses and some expensive chocolates to my mother

3.2.2 Status Messages

Status messages are sent from the server to the client, to indicate the state of a request.


status-line   =  version SP request-id SP status-code SP request-state CRLF
status-code   =  3DIGIT       ; Specific codes TBD, but probably similar to those used in MRCP

; All communication from the server is labeled with a request state.
request-state = "COMPLETE"    ; Processing of the request has completed.
              | "IN-PROGRESS" ; The request is being fulfilled.
              | "PENDING"     ; Processing of the request has not begun.

Specific status code values follow a pattern similar to [MRCPv2]:

2xx Success Codes
4xx Client Failure Codes
5xx Server Failure

3.2.3 Event Messages

Event messages are sent by the server, to indicate specific data, such as synthesis marks, speech detection, and recognition results. They are essentially specialized status messages.


event-line   =  version SP event-name SP request-id SP request-state CRLF								
event-name    =  synth-event | reco-event | proprietary-event

For example, an event indicating that the recognizer has detected the start of speech:

S->C: web-speech/2.0 START-OF-SPEECH 8322 IN-PROGRESS
      Resource-ID: recognizer
      Source-time: 2011-09-06T21:47:31.981+1:30 (when speech was detected)

3.3 Media Transmission

HTML Speech applications feature a wide variety of media transmission scenarios. The number of media streams at any given time is not fixed. For example:

Whereas a human listener will tolerate the clicks and pops of missing packets so they can continue listening in real time, recognizers do not require their data in real-time, and will generally prefer to wait for delayed packets in order to maintain accuracy.

Advanced implementations of HTML Speech may incorporate multiple channels of audio in a single transmission. For example, a living-room device with a microphone array may send separate streams capturing the speech of multiple individuals within the room. Or, for example, a device may send parallel streams with alternative encodings that may not be human-consumable but contain information that is of particular value to a recognition service.

In web-speech/1.0, audio (or other media) is packetized and transmitted as a series of WebSockets binary messages, on the same WebSockets session used for the control messages.


media-packet        =  binary-message-type
                       binary-stream-id
                       binary-data
binary-message-type =  OCTET ; Values > 0x03 are reserved. 0x00 is undefined.
binary-stream-id    = 3OCTET ; Unique identifier for the stream 0..224-1
binary-data         = *OCTET 


         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |                   stream-id                   |
        +---------------+-----------------------------------------------+
        |                              ...                              |
        |                             Data                              |
        |                              ...                              |
        +---------------------------------------------------------------+

The binary-stream-id field is used to identify the messages for a particular stream. It is a 24-bit unsigned integer. Its value for any given stream is assigned by the sender (client or server) in the initial message of the stream, and must be unique to the sender within the WebSockets session.

A sequence of media messages with the same stream-ID represents an in-order contiguous stream of data. Because the messages are sent in-order and audio packets cannot be lost (WebSockets uses TCP), there is no need for sequence numbering or timestamps. The sender just packetizes audio from the encoder and sends it, while the receiver just un-packs the messages and feeds them to the consumer (e.g. the recognizer's decoder, or the TTS playback buffer). Timing of coordinated events is calculated by decoded offset from the beginning of the stream.

The WebSockets stack de-multiplexes text and binary messages, thus separating signaling from media, while the stream-ID on each media message is used to de-multiplex the messages into separate media streams.

The binary-message-type field has these defined values:

0x01: Start of Stream
The message indicates the start of a new stream. This MUST be the first message in any stream. The stream-ID must be new and unique to the session. This same stream-ID is used in all future messages in the stream. The message body contains a 64-bit NTP timestamp [RFC 1305] containing the local time at the stream originator, which is used as the base time for calculating event timing. The timestamp is followed by the ASCII-encoded MIME media type (see [RFC 3555])describing the format that the stream's media is encoded in (usually an audio encoding). All implementations MUST support 8kHz single-channel mulaw (audio/basic) and 8kHz single-channel 16-bit linear PCM (audio/L16;rate=8000). Implementations MAY support other content types, for example: to recognize from a video stream; to provide a stream of recognition preprocessing coefficients; to provide textual metadata streams; or to provide auxilliary multimodal input streams such as touches/strokes, gestures, clicks, compass bearings, etc.

message type = 0x01; stream-id = 112233; media-type = audio/amr-wb
         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |                   stream-id                   |
        |1 0 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
        +---------------+-----------------------------------------------+
        |1 0 0 0 1 0 1 1 1 1 0 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0| } NTP Timestamp
        |0 0 0 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0 1 1 0 0 0 0 1 0 1 0 1 0 1| }
        |      61              75              64              69       | a u d i
        |      6F              2F              61              6D       | o / a m
        |      72              2D              77              62       | r - w b
        +---------------------------------------------------------------+
0x02: Media
The message is a media packet, and contains encoded media data.

         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |                   stream-id                   |
        |0 1 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
        +---------------+-----------------------------------------------+
        |                       encoded audio data                      |
        |                              ...                              |
        |                              ...                              |
        |                              ...                              |
        +---------------------------------------------------------------+
The encoding format is specified in the start of stream message (0x01).

There is no strict constraint on the size and frequency of audio messages. Nor is there a requirement for all audio packets to encode the same duration of sound. However, implementations SHOULD seek to minimize interference with the flow of other messages on the same socket, by sending messages that encode between 20 and 80 milliseconds of media. Since a WebSockets frame header is typically only 4 bytes, overhead is minimal and implementations SHOULD err on the side of sending smaller packets more frequently.

A synthesis service MAY (and typically will) send audio faster than real-time, and the client MUST be able to handle this.

A recognition service MUST be prepared to receive slower-than-real-time audio due to practical throughput limitations of the network.

The design does not permit the transmission of binary media as base-64 text messages, since WebSockets already provides native support for binary messages. Base-64 encoding would incur an unnecessary 33% transmission overhead.

0x03: End-of-stream
The 0x03 end-of-stream message indicates the end of the media stream, and MUST be used to terminate a media stream. Any future media stream messages with the same stream-id are invalid.

         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |                   stream-id                   |
        |1 1 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
        +---------------+-----------------------------------------------+

Although most services will be strictly either recognition or synthesis services, some services may support both in the same session. While this is a more advanced scenario, the design does not introduce any constraints to prevent it. Indeed, both the client and server MAY send audio streams in the same session.

3.4 Security

Both the signaling and media transmission aspects of the web-speech/1.0 protocol inherit a number of security features from the underlying WebSockets protocol [WS-PROTOCOL]:

Server Authentication

Clients may authenticate servers using standard TLS, simply by using the WSS: uri scheme rather than the WS: scheme in the service URI. This is standard WebSockets functionality in much the same way as HTTP specifies TLS by using the HTTPS: scheme.

Encryption

Similarly, all traffic (media and signaling) is encrypted by TLS, when using the WSS: uri scheme.

In practice, to prevent man-in-the-middle snooping of a user's voice, user agents SHOULD NOT use the WS: scheme, and SHOULD ONLY use the WSS: scheme. In non-mainstream cases, such as service-to-service mashups, or specialized user agents for secured networks, the unencrypted WS: scheme MAY be used.

User Authentication

User authentication, when required by a server, will commonly be done using the standard [HTTP] challenge-response mechanism in the initial websocket bootstrap. A server may also choose to use TLS client authentication, and although this will probably be uncommon, WebSockets stacks should support it.

Application Authentication:
Web services may wish to authenticate the application requesting service. There is no standardized way to do this. However, the Vendor-Specific-Parameters header can be used to perform proprietary authentication using a key-value-pair scheme defined by the service.

HTML speech network scenarios also have security boundaries outside of signaling and media:

Transitive Access to Resources

A client may require a server to access resources from a third location. Such resources may include SRGS documents, SSML documents, audio files, etc. This may either be a result of the application referring to the resource by URI; or of an already loaded resource containing a URI reference to a separate resource. In these cases the server will need permission to access these resources. There are three ways in which this may be accomplished:

  1. Out-of-band configuration. The developer, administrator, or a provisioning system may control both the speech server and the server containing the referenced resource. In this case they may configure appropriate accounts and network access permissions beforehand.
  2. Limited-use URIs. The server containing the resource may issue limited-use URIs, that may be valid for a small finite number of uses, or for a limited period, in order to minimise the exposure of the resource. The developer would obtain these URIs as needed, using a mechanism that is proprietary to the resource server.
  3. Cookies. The client may have a cookie containing a secret that is used to authorize access to the resource. In this case, the cookie may be passed to the speech server using cookie headers in a request. The speech server would then use this cookie when accessing resources required by that request.

Access to Retained Media

Through the use of certain headers during speech recognition, the client may request the server to retain a recording of the input media, and make this recording available at a URL for retrieval. The server that holds the recording MAY secure this recording by using standard HTTP security mechanisms: it MAY authenticate client using standard HTTP challenge/response; it MAY use TLS to encrypt the recording when transmitting it back to the client; and it MAY use TLS to authenticate the client. The server that holds a recording MAY also discard a recording after a reasonable period, as determined by the server.

3.5 Time Stamps

Timestamps are used in a variety of headers in the protocol. Binary messages use the 64-bit NTP timestamp format, as defined in [RFC 1305]. Text messages use the encoding format defined in [RFC 3339] "Date and Time on the Internet: Timestamps", and reproduced here:



   date-time       = full-date "T" full-time
   ; For example: 2011-09-06T10:33:16.612Z
   ;          or: 2011-09-06T21:47:31.981+1:30

   full-date       = date-fullyear "-" date-month "-" date-mday
   full-time       = partial-time time-offset

   date-fullyear   = 4DIGIT
   date-month      = 2DIGIT  ; 01-12
   date-mday       = 2DIGIT  ; 01-28, 01-29, 01-30, 01-31 based on
                             ; month/year

   partial-time    = time-hour ":" time-minute ":" time-second
                     [time-secfrac]
   time-hour       = 2DIGIT  ; 00-23
   time-minute     = 2DIGIT  ; 00-59
   time-second     = 2DIGIT  ; 00-58, 00-59, 00-60 based on leap second
                             ; rules
   time-secfrac    = "." 1*DIGIT
   time-numoffset  = ("+" / "-") time-hour ":" time-minute
   time-offset     = "Z" / time-numoffset

4. General Capabilities

4.1 Generic Headers

These headers may be used in any control message. All header names are case-insensitive.


generic-header  =
                | accept
                | accept-charset
                | content-base
                | logging-tag
                | resource-id
                | vendor-specific
                | content-type
                | content-encoding

resource-id     = "Resource-ID:" ("recognizer" | "synthesizer" | vendor-resource)
vendor-resource = "x-" 1*UTFCHAR

accept           = <indicates the content-types the sender will accept>
accept-charset   = <indicates the character set the sender will accept>
content-base     = <the base for relative URIs>
content-type     = <the type of content contained in the message body>
content-encoding = <the encoding of message body content>
logging-tag      = <a tag to be inserted into server logs>
vendor-specific  = "Vendor-Specific-Parameters:" vendor-specific-av-pair 
                   *[";" vendor-specific-av-pair] CRLF
vendor-specific-av-pair = vendor-av-pair-name "=" vendor-av-pair-value


Resource-ID
The Resource-ID header is included in all signaling messages. In requests, it indicates the resource to which the request is directed. In status messages and events, it indicates the resource from which the message originated.
Accept
The Accept header MAY be included in any message to indicate the content types that will be accepted by the sender of the message from the receiver of the message. When absent, the following defaults should be assumed: clients will accept "application/emma+xml" from recognizers; recognizers will accept "application/srgs+xml"; synthesizers will accept "application/ssml+xml".
Accept-Charset
When absent, any charset may be used. This header has two general purposes: so the client can indicate the charset it will accept in recognition results; and so the synthesizer can indicate the charset it will accept for SSML documents.
Content-Base
When a message contains an entity that includes relative URIs, Content-Base provides the absolute URI against which they are based.
Logging-Tag
Specifies a tag to be inserted into server logs. It is generally only used in requests, or in response to GET-PARAMS.
Vendor-Specific-Parameters
A catch-all header for vendors to include their own name/value pairs..

4.2 Capabilities Discovery

The web-speech/1.0 protocol provides a way for the application/UA to determine whether a resource supports the basic capabilities it needs. In most cases applications will know a service's resource capabilities ahead of time. However, some applications may be more adaptable, or may wish to double-check at runtime. To determine resource capabilities, the UA sends a GET-PARAMS request to the resource, containing a set of capabilities, to which the resource responds with the specific subset it actually supports.

The GET-PARAMS request is much more limited in scope than its [MRCPv2] counterpart. It is only used to discover the capabilities of a resource. Like all messages, they must always include the Resource-ID header.


general-method = "GET-PARAMS"

header         = capability-query-header
               | interim-event-header
               | reco-header
               | synth-header

capability-query-header =
                 "Supported-Content:" mime-type *("," mime-type)
               | "Supported-Languages:" lang-tag *("," lang-tag) ; See [RFC5646]
               | "Builtin-Grammars:" "<" URI ">" *("," "<" URI ">")

interim-event-header =
                 "Interim-Events:" event-name *("," event-name)
event-name = 1*UTFCHAR

Supported-Languages
This read-only property is used by the client to discover whether a resource supports a particular set of languages. Unlike most headers, when a blank value is used in GET-PARAMS, the resource will respond with a blank header rather than the full set of languages it supports. This avoids the resource having to respond with a potentially cumbersome and possibly ambiguous list of languages and dialects. Instead, the client must include the set of languages in which it is interested as the value of the Supported-Languages header in the GET-PARAMS request. The service will respond with the subset of these languages that it actually supports.
Supported-Content
This read-only property is used to discover whether a resource supports particular encoding formats for input or output data. Given the broad variety of codecs, and the large set of parameter permutations for each codec, it is impractical for a resource to advertize all media encodings it could possibly support. Hence, when a blank value is used in GET-PARAMS, the resource will respond with a blank value. Instead, the client must supply the of data encodings it is interested in. The resource responds with the subset it actually supports. This is used not only to discover supported media encoding formats, but also, to discover other input and output data formats, such as alternatives to SRGS, EMMA and SSML.
Builtin-Grammars
This read-only property is used by the client to discover whether a recognizer has a particular set of built-in grammars. The client provides a list of builtin: URIs in the GET-PARAMS request, to which the recognizer responds with the subset of URIs it actually supports.
Interim-Events
This read/write property contains the set of interim events the client would like the service to send (see Interim Events below).

For example, discovering whether the recognizer supports the desired CODECs, grammar format, languages/dialects and built-in grammars:


C->S: web-speech/1.0 GET-PARAMS 34132
      resource-id: recognizer
      supported-content: audio/basic, audio/amr-wb, 
                         audio/x-wav;channels=2;formattag=pcm;samplespersec=44100,
                         audio/dsr-es202212; rate:8000; maxptime:40,
                         application/x-ngram+xml
      supported-languages: en-AU, en-GB, en-US, en (A variety of English dialects are desired)
      builtin-grammars: <builtin:dictation?topic=websearch>, 
                        <builtin:dictation?topic=message>, 
                        <builtin:ordinals>, 
                        <builtin:datetime>, 
                        <builtin:cities?locale=USA>

S->C: web-speech/1.0 34132 200 COMPLETE
      resource-id: recognizer
      supported-content: audio/basic, audio/dsr-es202212; rate:8000; maxptime:40
      supported-languages: en-GB, en (The recognizer supports UK English, but will work with any English)
      builtin-grammars: <builtin:dictation?topic=websearch>, <builtin:dictation?topic=message>

For example, discovering whether the synthesizer supports the desired CODECs, languages/dialects, and content markup format:


C->S: web-speech/1.0 GET-PARAMS 48223
      resource-id: synthesizer
      supported-content: audio/ogg, audio/flac, audio/basic, application/ssml+xml
      supported-languages: en-AU, en-GB

S->C: web-speech/1.0 48223 200 COMPLETE
      resource-id: synthesizer
      supported-content: audio/basic, application/ssml+xml
      supported-languages: en-GB

4.3 Interim Events

Speech services may care to send optional vendor-specific interim events during the processing of a request. For example: some recognizers are capable of providing additional information as they process input audio; and some synthesizers are capable of firing progress events on word, phoneme, and viseme boundaries. These are exposed through the HTML Speech API as events that the webapp can listen for if it knows to do so. A service vendor MAY require a vendor-specific value to be set in a LISTEN or SPEAK request before it starts to fire certain events.


interim-event =   version request-ID SP INTERIM-EVENT CRLF 
                *(header CRLF)
                  CRLF
                 [body]
event-name-header = "Event-Name:" event-name

The Event-Name header is required and must contain a value that was previously subscribed to with the Interim-Events header.

The Request-ID and Content-Type headers are required, and any data conveyed by the event must be contained in the body.

For example, a synthesis service might choose to communicate visemes through interim events:


C->S: web-speech/1.0 SPEAK 3257
        Resource-ID:synthesizer
        Audio-codec:audio/basic
        Content-Type:text/plain

        Hello world! I speak therefore I am.

S->C: web-speech/1.0 3257 200 IN-PROGRESS

S->C: media for 3257

S->C: web-speech/1.0 INTERIM-EVENT 3257 IN-PROGRESS
        Resource-ID:synthesizer
        Event-Name:x-viseme-event
        Content-Type:application/x-viseme-list
        
        "Hello"
        0.500 H
        0.850 A
        1.050 L
        1.125 OW
        1.800 SILENCE

S->C: more media for 3257

S->C: web-speech/1.0 INTERIM-EVENT 3257 IN-PROGRESS
        Resource-ID:synthesizer
        Event-Name:x-viseme-event
        Content-Type:application/x-viseme-list
        
        "World"
        2.200 W 
        2.350 ER
        2.650 L
        2.800 D
        3.100 SILENCE

S->C: etc

4.4 Resource Selection

Applications will generally want to select resources with certain capabilities, such as the ability to recognize certain languages, work well in specific acoustic conditions, work well with specific genders or ages, speak particular languages, speak with a particular style, age or gender, etc.

There are three ways in which resource selection can be achieved, each of which has relevance:

By URI

This is the preferred mechanism.

Any service may enable applications to encode resource requirements as query string parameters in the URI, or by using specific URIs with known resources. The specific URI format and parameter scheme is by necessity not standardized and is defined by the implementer based on their architecture and service offerings.

For example, a German recognizer with a cell-phone acoustic environment model:

ws://example1.net:2233/webreco/de-de/cell-phone

A UK English recognizer for a two-beam living-room array microphone:

ws://example2.net/?reco-lang=en-UK&reco-acoustic=10-foot-open-room&sample-rate=16kHz&channels=2

Spanish recognizer and synthesizer, where the synthesizer uses a female voice provided by AcmeSynth:

ws://example3.net/speech?reco-lang=es-es&tts-lang=es-es&tts-gender=female&tts-vendor=AcmeSynth

A pre-defined profile specified by an ID string:

ws://example4.com/profile=af3e-239e-9a01-66c0
By Request Header

Request headers may also be used to select specific resource capabilities. Synthesizer parameters are set through the SPEAK request; whereas recognition paramaters are set either through the LISTEN request. There is a small set of standard headers that can be used with each resource: the Speech-Language header may be used with both the recognizer and synthesizer, and the synthesizer may also accept a variety of voice selection parameters as headers. A resource MAY honor these headers, but does not need to where it does not have the ability to so. If a particular header value is unsupported, the request should fail with a status of 409 "Unsupported Header Field Value".

For example, a client requires Canadian French recognition but it isn't available:


C->S: web-speech/1.0 LISTEN 8322
      Resource-ID: Recognizer
      Speech-Language: fr-CA

S->C: web-speech/1.0 8322 409 COMPLETE ; 409, since fr-CA isn't supported.
      resource-id: Recognizer

A client requres Brazilian Portuguese, and is successful:


C->S: web-speech/1.0 LISTEN 8323
      Resource-ID: Recognizer
      Speech-Language: pt-BR

S->C: web-speech/1.0 8323 200 COMPLETE
      resource-id: Recognizer

Speak with the voice of a Korean woman in her mid-thirties:


C->S: web-speech/1.0 SPEAK 8324
      Resource-ID: Synthesizer
      Speech-Language: ko-KR
      Voice-Age: 35
      Voice-Gender: female

Speak with a Swedish voice named "Kiana":


C->S: web-speech/1.0 SPEAK 8325
      Resource-ID: Synthesizer
      Speech-Language: sv-SE
      Voice-Name: Kiana

This approach is very versatile. However some implementations will be incapable of this kind of versatility in practice.

By Input Document

The [SRGS] and [SSML] input documents for the recognizer and synthesizer will specify the language for the overall document, and MAY specify languages for specific subsections of the document. The resource consuming these documents SHOULD honor these language assignments when they occur. If a resource is unable to do so, it should error with a 481 status "Unsupported content language". (It should be noted that at the time of writing, most currently available recognizer and synthesizer implementations will be unable to suppor this capability.)

Generally speaking, given the current typical state of speech technology, unless a service is unusually adaptable, applications will be most successful using specific proprietary URLs that encode the abilities they need, so that the appropriate resources can be allocated during session initiation.

5. Recognition

A recognizer resource is either in the "listening" state, or the "idle" state. Because continuous recognition scenarios often don't have dialog turns or other down-time, all functions are performed in series on the same input stream(s). The key distinction between the idle and listening states is the obvious one: when listening, the recognizer processes incoming media and produces results; whereas when idle, the recognizer SHOULD buffer audio but will not process it.

Recognition is accomplished with a set of messages and events, to a certain extent inspired by those in [MRCPv2].


Idle State                 Listening State
    |                            |
    |--\                         |
    |  DEFINE-GRAMMAR            |
    |<-/                         |
    |                            |
    |--\                         |
    |  INFO                      |
    |<-/                         |
    |                            |
    |---------LISTEN------------>|
    |                            |
    |                            |--\
    |                            |  INTERIM-EVENT
    |                            |<-/
    |                            |
    |                            |--\
    |                            |  START-OF-SPEECH
    |                            |<-/
    |                            |
    |                            |--\
    |                            |  START-INPUT-TIMERS
    |                            |<-/
    |                            |
    |                            |--\
    |                            |  END-OF-SPEECH
    |                            |<-/
    |                            |
    |                            |--\
    |                            |  INFO
    |                            |<-/
    |                            |
    |                            |--\
    |                            |  INTERMEDIATE-RESULT
    |                            |<-/
    |                            |
    |                            |--\
    |                            |  RECOGNITION-COMPLETE
    |                            | (when mode = recognize-continuous)
    |                            |<-/
    |                            |
    |<---RECOGNITION-COMPLETE----|
    |(when mode = recognize-once)|
    |                            |
    |                            |
    |<--no media streams remain--|
    |                            |
    |                            |
    |<----------STOP-------------|
    |                            |
    |                            |
    |<---some 4xx/5xx errors-----|
    |                            |
    |--\                         |--\
    |  INTERPRET                 |  INTERPRET
    |<-/                         |<-/
    |                            |
    |--\                         |--\
    |  INTERPRETATION-COMPLETE   |  INTERPRETATION-COMPLETE
    |<-/                         |<-/
    |                            |

5.1 Recognition Requests


reco-method  = "LISTEN"             ; Transitions Idle -> Listening
             | "START-INPUT-TIMERS" ; Starts the timer for the various input timeout conditions
             | "STOP"               ; Transitions Listening -> Idle
             | "DEFINE-GRAMMAR"     ; Pre-loads & compiles a grammar, assigns a temporary URI for reference in other methods
             | "CLEAR-GRAMMARS"     ; Unloads all grammars, whether active or inactive
             | "INTERPRET"          ; Interprets input text as though it was spoken
             | "INFO"               ; Sends metadata to the recognizer
LISTEN

The LISTEN method transitions the recognizer from the idle state to the listening state. The recognizer then processes the media input streams against the set of active grammars. The request MUST include the Source-Time header, which is used by the Recognizer to determine the point in the input stream(s) that the recognizer should start processing from (which won't necessarily be the start of the stream). The request MUST also include the Listen-Mode header to indicate whether the recognizer should perform continuous recognition, a single recognition, or vendor-specific processing.

A LISTEN request MAY also activiate or deactivate grammars and rules using the Active-Grammars and Inactive-Grammars headers. These grammars/rules are considered to be activated/deactivated from the point specified in the Source-Time header.

When there are no input media streams, and the Input-Waveform-URI header has not been specified, the recognizer cannot enter the listening state, and the listen request will fail (480). When in the listening state, and all input streams have ended, the recognizer automatically transitions to the idle state, and issues a RECOGNITION-COMPLETE event, with Completion-Cause set to 080 ("no-input-stream").

A LISTEN request that is made while the recognizer is already listening results in a 402 error ("Method not valid in this state", since it is already listening).

START-INPUT-TIMERS

This is used to indicate when the input timeout clock should start. For example, when the application wants to enable voice barge-in during a prompt, but doesn't want to start the time-out clock until after the prompt has completed, it will delay sending this request until it's finished playing the prompt.

STOP

The STOP method transitions the recognizer from the listening state to the idle state. No RECOGNITION-COMPLETE event is sent. The Source-Time header MUST be used, since the recognizer may still fire a RECOGNITION-COMPLETE event for any completion state it encounters prior to that time in the input stream.

A STOP request that is sent while the the recognizer is idle results in a 402 response (method not valid in this state, since there is nothing to stop).

DEFINE-GRAMMAR

The DEFINE-GRAMMAR method does not activate a grammar. It simply causes the recognizer to pre-load and compile it, and associates it with a temporary URI that can then be used to activate or deactivate the grammar or one of its rules. DEFINE-GRAMMAR is not required in order to use a grammar, since the recognizer can load grammars on demand as needed. However, it is useful when an application wants to ensure a large grammar is pre-loaded and ready for use prior to the recognizer entering the listening state. DEFINE-GRAMMAR can be used when the recognizer is in either the listening or idle state.

All recognizer services MUST support grammars in the SRGS XML format, and MAY support additional alternative grammar/language-model formats.

The client SHOULD remember the temporary URIs, but if it loses track, it can always re-issue the DEFINE-GRAMMAR request, which MUST not result in a service error as long as the mapping is consistent with the original request. Once in place, the URI MUST be honored by the service for the duration of the session. If the service runs low in resources, it is free to unload the URI's payload, but must always continue to honor the URI even if it means reloading the grammar (performance notwithstanding).

Refer to [MRCPv2] for more details on this method.

CLEAR-GRAMMARS

In continuous recognition, a variety of grammars may be loaded over time, potentially resulting in unused grammars consuming memory resources in the recognizer. The CLEAR-GRAMMARS method unloads all grammars, whether active or inactive. Any URIs previously defined in DefineGrammar become invalid.

INTERPRET

The INTERPRET method processes the input text according to the set of grammar rules that are active at the time it is received by the recognizer. It MUST include the Interpret-Text header. The use of INTERPRET is orthogonal to any audio processing the recognizer may be doing, and will not affect any audio processing. The recognizer can be in either the listening or idle state.

An INTERPRET request MAY also activiate or deactivate grammars and rules using the Active-Grammars and Inactive-Grammars headers, but only if the recognizer is in the Idle state. These grammars/rules are considered to be activated/deactivated from the point specified in the Source-Time header.

INFO

In multimodal applications, some recognizers will benefit from additional context. Clients can use the INFO request to send this context. The Content-Type header should specify the type of data, and the data itself is contained in the message body.

5.2 Recognition Events

Recognition events are associated with 'IN-PROGRESS' request-state notifications from the 'recognizer' resource.


reco-event   = "START-OF-SPEECH"      ; Start of speech has been detected
             | "END-OF-SPEECH"        ; End of speech has been detected
             | "INTERIM-EVENT"        ; See Interim Events above
             | "INTERMEDIATE-RESULT"  ; A partial hypothesis
             | "RECOGNITION-COMPLETE" ; Similar to MRCP2 except that application/emma+xml (EMMA) will be the default Content-Type.
             | "INTERPRETATION-COMPLETE"
END-OF-SPEECH

END-OF-INPUT is the logical counterpart to START-OF-INPUT, and indicates that speech has ended. The event MUST include the Source-Time header, which corresponds to the point in the input stream where the recognizer estimates speech to have ended, NOT when the endpointer finally decided that speech ended (which will be a number of milliseconds later).

INTERIM-EVENT

See Interim Events above. For example, a recognition service may send interim events to indicate it's begun to recognize a phrase, or to indicate that noise or cross-talk on the input channel is degrading accuracy.

INTERMEDIATE-RESULT

Continuous speech (aka dictation) often requires feedback about what has been recognized thus far. Waiting for a RECOGNITION-COMPLETE event prevents this sort of user interface. INTERMEDIATE-RESULT provides this intermediate feedback. As with RECOGNITION-COMPLETE, contents are assumed to be EMMA unless an alternate Content-Type is provided.

INTERPRETATION-COMPLETE

This event contains the result of an INTERPRET request.

RECOGNITION-COMPLETE

This method is similar to the [MRCPv2] method with the same name, except that application/emma+xml (EMMA) is the default Content-Type. The Source-Time header must be included, to indicate the point in the input stream when the event occured. When the Listen-Mode is reco-once, the recognizer will transition from the listening state to the idle state when this message is fired, and the Recognizer-State header in the event is set to "idle".

Where applicable, the body of the message SHOULD contain an EMMA that is consistent with the Completion-Cause.

START-OF-SPEECH

Indicates that start of speech has been detected. The Source-Time header MUST correspond to the point in the input stream(s) where speech was estimated to begin, NOT when the endpointer finally decided that speech began (a number of milliseconds later).

5.3 Recognition Headers

The list of valid headers for the recognizer resource include a subset of the [MRCPv2] Recognizer Header Fields, where they make sense for HTML Speech requirements, as well as a handful of headers that are required for HTML Speech.


reco-header =  ; Headers borrowed from MRCP
               Confidence-Threshold
             | Sensitivity-Level
             | Speed-Vs-Accuracy
             | N-Best-List-Length
             | No-Input-Timeout
             | Recognition-Timeout
             | Media-Type
             | Input-Waveform-URI
             | Completion-Cause
             | Completion-Reason
             | Recognizer-Context-Block
             | Start-Input-Timers
             | Speech-Complete-Timeout
             | Speech-Incomplete-Timeout
             | Failed-URI
             | Failed-URI-Cause
             | Save-Waveform
             | Speech-Language
             | Hotword-Min-Duration
             | Hotword-Max-Duration
             | Interpret-Text
             | Vendor-Specific       ; see Generic Headers

             ; Headers added for web-speech/1.0
             | audio-codec           ; The audio codec used in an input media stream
             | active-grammars       ; Specifies a grammar or specific rule to activate.
             | inactive-grammars     ; Specifies a grammar or specific rule to deactivate.
             | hotword               ; Whether to listen in "hotword" mode (i.e. ignore out-of-grammar speech)
             | listen-mode           ; Whether to do continuous or one-shot recognition
             | partial               ; Whether to send partial results
             | partial-interval      ; Suggested interval between partial results, in milliseconds.
             | recognizer-state      ; Indicates whether the recognizer is listening or idle
             | source-time           ; The UA local time at the request was initiated
             | user-id               ; Unique identifier for the user, so that adaptation can be used to improve accuracy.
             | Wave-Start-Time       ; The start point of a recognition in the audio referred to by Waveform-URIs.
             | Wave-End-Time         ; The end point of a recognition in the audio referred to by Waveform-URIs.
             | Waveform-URIs         ; List of URIs to recorded input streams

hotword            = "Hotword:" BOOLEAN
listen-mode        = "Listen-Mode:" ("reco-once" | "reco-continuous" | vendor-listen-mode)
vendor-listen-mode = "x-" 1*UTFCHAR
recognizer-state   = "Recognizer-State:" ("listening" | "idle")
source-time        = "Source-Time:" 1*20DIGIT
audio-codec        = "Audio-Codec:" mime-media-type ; see [RFC3555]
partial            = "Partial:" BOOLEAN
partial-interval   = "Partial-Interval:" 1*5DIGIT
active-grammars    = "Active-Grammars:" "<" URI ["#" rule-name] [SP weight] ">" *("," "<" URI ["#" rule-name] [SP weight] ">")
rule-name          = 1*UTFCHAR
weight             = 1*3DIGIT ["." 1*3DIGIT]
inactive-grammars  = "Inactive-Grammars:" "<" URI ["#" rule-name] ">" *("," "<" URI ["#" rule-name] ">")
user-id            = "User-ID:" 1*UTFCHAR
wave-start-time    = "Wave-Start-Time:" 1*DIGIT ["." 1*DIGIT]
wave-end-time      = "Wave-End-Time:"  1*DIGIT ["." 1*DIGIT]
waveform-URIs      = "Waveform-URIs:" "<" URI ">" *("," "<" URI ">")

TODO: discuss how recognition from file would work.

Headers with the same names as their [MRCPv2] counterparts are considered to have the same specification. Other headers are describe as follows:

Audio-Codec

The Audio-Codec header is used in the START-MEDIA-STREAM request, to specify the codec and parameters used to encode the input stream, using the MIME media type encoding scheme specified in [RFC3555].

Active-Grammars

The Active-Grammars header specifies a list of grammars, and optionally specific rules within those grammars. The header is used in LISTEN to activate grammars/rules. If no rule is specified for a grammar, the root rule is activated. This header may also specify the relative weight of the rule. If weight is not specified, the default weight is "1".

Inactive-Grammars

The Inactive-Grammars header specifies a list of grammars, and optionally specific rules within those grammars, to be deactivated. If no rule is specified, all rules in the grammar are deactivated, including the root rule. The Grammar-Deactivate header MAY be used in the LISTEN method.

Hotword

The Hotword header is analogous to the [MRCPv2] Recognition-Mode header, however it has a different name and boolean type in web-speech/1.0 in order to avoid confusion with the Listen-Mode header. When true, the recognizer functions in "hotword" mode, which essentially means that out-of-grammar speech is ignored.

Listen-Mode

Listen-Mode is used in the LISTEN request to specify whether the recognizer should listen continuously, or return to the idle state after the first RECOGNITION-COMPLETE event. It MUST NOT be used in any other type of request other than LISTEN. When the recognizer is in the listening state, it should include Listen-Mode in all event and status messages it sends.

Partial

This header is required to support the continuous speech scenario on the recognizer resource. When sent by the client in a LISTEN request, this header controls whether or not the client is interested in partial results from the service. In this context, the term 'partial' is meant to describe mid-utterance results that provide a best guess at the user's speech thus far (e.g. "deer", "dear father", "dear father christmas"). These results should contain all recognized speech from the point of the last non-partial (ie complete) result, but it may be common for them to omit fully-qualified result attributes like an NBest list, timings, etc. The only guarantee is that the content must be EMMA. Note that this header is valid on both regular command-and-control recognition requests as well as dictation sessions. This is because at the API level, there is no syntactic difference between the recognition types. They are both simply recognition requests over an SRGS grammar or set of URL(s). Additionally partial results can be useful in command-and-control scenarios, for example: open-microphone applications, dictation enrollment applications, and lip-sync. When sent by the server, this header indicates whether the message contents represent a full or partial result. It's valid for a server to send this header on both INTERMEDIATE-RESULT, RECOGNITION-COMPLETE, and in response to a GET-PARAMS messages.

Partial-Interval

A suggestion from the client to the service on the frequency at which partial results should be sent. It is an integer value represents desired interval expressed in milliseconds. The recognizer does not need to precisely honor the requested interval, but SHOULD provide something close, if it is within the operating parameters of the implementation.

Recognizer-State

Indicates whether the recognizer is listening or idle. This MUST NOT be included by the client in any requests, and MUST be included by the recognizer in all status and event messages it sends.

Source-Time

Indicates the timestamp of a message using the client's local time. All requests sent from the client to the recognizer MUST include the Source-Time header, which must faithfully specify the client's local system time at the moment it sends the request. This enables the recognizer to correctly synchronize requests with the precise point in the input stream at which they were actually sent by the client. All event messages sent by the recognizer MUST include the Source-Time, calculated by the recognizer service based on the point in the input stream at which the event occurred, and expressed in the client's local clock time (since the recognizer knows what this was at the start of the input stream). By expressing all times in client-time, the user agent or application is able to correctly sequence events, and implement timing-sensitive scenarios, that involve other objects outside the knowledge of the recognizer service (for example, media playback objects or videogame states).

User-ID

Recognition results are often more accurate if the recognizer can train itself to the user's speech over time. This is especially the case with dictation as vocabularies are so large. A User-ID field would allow the recognizer to establish the user's identify if the webapp decided to supply this information.

Wave-Start-Time, Wave-End-Time, Input Waveform-URI and Waveform-URIs

Some applications will wish to re-recognize an utterance using different grammars. For example, an application may accept a broad range of input, and use the first round of recognition simply to classify an utterance so that it can use a more focused grammar on the second round. Others will wish to record an utterance for future use. For example, an application transcribes an utterance to text may store a recording so that untranscribed information is not lost (tone, emotion, etc). While these are not mainstream scenarios, they are both valid and inevitable, and may be achieved using the headers provided for recognition.

If the Save-Waveform header is set to true (with LISTEN), then the recognizer will save the input audio. Consequent RECOGNITION-COMPLETE events sent by the recognizer will contain a URI in the Waveform-URI header which refers to the stored audio (multiple URIs when multiple input streams are present). In the case of continuous recognition, the Waveform-URI header refers to all of the audio captured so far. The application may fetch the audio from this URI, assuming it has appropriate credentials (the credential policy is determined by the service provider). The application may also use the URI as input to future LISTEN requests by passing the URI in the Input-Waveform-URI header.

When RECOGNITION-COMPLETE returns a Waveform-URI header, it also returns the time interval within the recorded waveform that the recognition result applies to, in the Wave-Start-Time and Wave-End-Time headers, which indicate the offsets in seconds from the start of the waveform. A client MAY also use the SourceTime header of other events such as START-OF-SPEECH and END-OF-SPEECH to calculate other intervals of interest. When using the Input-Wavefor-URI header, the client may suffix the URI with an "interval" parameter to indicate that the recognizer should only decode that particular interval of the audio:


interval = "interval=" start "," end
start    = seconds | "start"
end      = seconds | "end"
seconds  = 1*DIGIT ["." 1*DIGIT]

For example:


http://example.com/retainedaudio/fe429ac870a?interval=0.3,2.86
http://example.com/temp44235.wav?interval=0.65,end

When the Input-Waveform-URI header is used, all other input streams are ignored.

5.4 Predefined Grammars

Speech services MAY support pre-defined grammars that can be referenced through a 'builtin:' uri. For example:

builtin:dictation?context=email&lang=en_US
builtin:date
builtin:search?context=web

These can be used as top-level grammars in the Grammar-Activate/Deactivate headers, or in rule references within other grammars. If a speech service does not support the referenced builtin or if it does not specify the builtin in combination with other active grammars, it should return a grammar compilation error.

The specific set of predefined grammars is to be defined later. However, there MUST be a certain small set of predefined grammars that a user agent's default speech recognizer MUST support. For non-default recognizers, support for predefined grammars is optional, and the set that is supported is also defined by the service provider and may include proprietary grammars (e.g. builtin:x-acme-parts-catalog).

5.5 Recognition Examples

Example of reco-once


Start streaming audio:

C->S: binary message: start of stream (stream-id = 112233)
         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |                   stream-id                   |
        |1 0 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
        +---------------+-----------------------------------------------+
        |1 0 0 0 1 0 1 1 1 1 0 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0| } NTP Timestamp
        |0 0 0 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0 1 1 0 0 0 0 1 0 1 0 1 0 1| }
        |      61              75              64              69       | a u d i
        |      6F              2F              61              6D       | o / a m
        |      72              2D              77              62       | r - w b
        +---------------------------------------------------------------+

C->S: binary message: media packet (stream-id = 112233)
         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |                   stream-id                   |
        |0 1 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
        +---------------+-----------------------------------------------+
        |                       encoded audio data                      |
        |                              ...                              |
        |                              ...                              |
        |                              ...                              |
        +---------------------------------------------------------------+

C->S: more binary media packets...

Send the LISTEN request:

C->S: web-speech/1.0 LISTEN 8322
      Resource-Identifier: recognizer
      Confidence-Threshold:0.9
      Active-Grammars: <built-in:dictation?context=message>
      Listen-Mode: reco-once
      Source-time: 2011-09-06T21:47:31.981+1:30 (where in the input stream recognition should start)

S->C: web-speech/2.0 START-OF-SPEECH 8322 IN-PROGRESS

C->S: more binary media packets...

C->S: binary audio packets...
C->S: binary audio packet in which the user stops talking
C->S: binary audio packets...
 
S->C: web-speech/1.0 END-OF-SPEECH 8322 IN-PROGRESS (i.e. the recognizer has detected the user stopped talking)

C->S: binary audio packet: end of stream (i.e. since the recognizer has signaled end of input, the UA decides to terminate the stream)
         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |                   stream-id                   |
        |1 1 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
        +---------------+-----------------------------------------------+

S->C: web-speech/1.0 RECOGNITION-COMPLETE 8322 COMPLETE (because mode = reco-once, it the request completes when reco completes)
      Resource-Identifier: recognizer
      
      <emma:emma version="1.0"
      ...etc

Example of continuous reco with intermediate results


C->S: binary message: start of stream (stream-id = 112233)
         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |                   stream-id                   |
        |1 0 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
        +---------------+-----------------------------------------------+
        |1 0 0 0 1 0 1 1 1 1 0 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0| } NTP Timestamp
        |0 0 0 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0 1 1 0 0 0 0 1 0 1 0 1 0 1| }
        |      61              75              64              69       | a u d i
        |      6F              2F              61              6D       | o / a m
        |      72              2D              77              62       | r - w b
        +---------------------------------------------------------------+

C->S: binary message: media packet (stream-id = 112233)
         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |                   stream-id                   |
        |0 1 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
        +---------------+-----------------------------------------------+
        |                       encoded audio data                      |
        |                              ...                              |
        |                              ...                              |
        |                              ...                              |
        +---------------------------------------------------------------+

C->S: more binary media packets...

C->S: web-speech/1.0 LISTEN 8322
      Resource-Identifier: recognizer
      Confidence-Threshold:0.9
      Active-Grammars: <built-in:dictation?context=message>
      Listen-Mode: reco-continuous
      Partial: TRUE
      Source-time: 2011-09-06T21:47:31.981+1:30 (where in the input stream recognition should start)

C->S: more binary media packets...

S->C: web-speech/2.0 START-OF-SPEECH 8322 IN-PROGRESS
      Source-time: 12753439912 (when speech was detected)

C->S: more binary media packets...

S->C: web-speech/2.0 INTERMEDIATE-RESULT 8322 IN-PROGRESS

C->S: more binary media packets...
 
S->C: web-speech/1.0 END-OF-SPEECH 8322 IN-PROGRESS (i.e. the recognizer has detected the user stopped talking)

C->S: more binary media packets...

S->C: web-speech/1.0 RECOGNITION-COMPLETE 8322 IN-PROGRESS (because mode = reco-continuous, the request remains IN-PROGRESS)

C->S: more binary media packets...

S->C: web-speech/2.0 START-OF-SPEECH 8322 IN-PROGRESS

S->C: web-speech/2.0 INTERMEDIATE-RESULT 8322 IN-PROGRESS

S->C: web-speech/1.0 RECOGNITION-COMPLETE 8322 IN-PROGRESS

S->C: web-speech/2.0 INTERMEDIATE-RESULT 8322 IN-PROGRESS

S->C: web-speech/1.0 RECOGNITION-COMPLETE 8322 IN-PROGRESS (because mode = reco-continuous, the request remains IN-PROGRESS)

S->C: web-speech/1.0 END-OF-SPEECH 8322 IN-PROGRESS (i.e. the recognizer has detected the user stopped talking)

C->S: binary audio packet: end of stream (i.e. since the recognizer has signaled end of input, the UA decides to terminate the stream)
         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |                   stream-id                   |
        |1 1 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
        +---------------+-----------------------------------------------+

S->C: web-speech/1.0 RECOGNITION-COMPLETE 8322 COMPLETE
      Recognizer-State:idle
      Completion-Cause: 080
      Completion-Reason: No Input Streams

1-Best EMMA document

Example showing 1-best with an XML semantics within emma:interpretation. The 'interpretation' is contained within the emma:interpretation element. The 'utterance' is the value of emma:tokens and 'confidence' is the value of emma:confidence.


<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd"
    xmlns="http://www.example.com/example">
    <emma:grammar id="gram1" 
    grammar-type="application/srgs-xml" <!-- From EMMA 1.1 -->
    ref="http://acme.com/flightquery.grxml"/>
  <emma:interpretation id="int1" 
	emma:start="1087995961542" 
	emma:end="1087995963542"
	emma:medium="acoustic" 
	emma:mode="voice"
	emma:confidence="0.75"
   emma:lang="en-US"
   emma:grammar-ref="gram1"
   emma:media-type="audio/x-wav; rate:8000;"
   emma:signal="http://example.com/signals/145.wav"
	emma:tokens="flights from boston to denver"
   emma:process="http://example.com/my_asr.xml">
      <origin>Boston</origin>
      <destination>Denver</destination>

  </emma:interpretation>
</emma:emma>

N-Best EMMA Doucment

Example showing multiple recognition results and their associated interpretations.


<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd"
    xmlns="http://www.example.com/example">
    <emma:grammar id="gram1" 
    grammar-type="application/srgs-xml" 
    ref="http://acme.com/flightquery.grxml"/>
    <emma:grammar id="gram2" 
    grammar-type="application/srgs-xml" 
    ref="http://acme.com/pizzaorder.grxml"/>
  <emma:one-of id="r1" 
	emma:start="1087995961542"
	emma:end="1087995963542"
	emma:medium="acoustic" 
	emma:mode="voice"
   emma:lang="en-US"
   emma:media-type="audio/x-wav; rate:8000;"
   emma:signal="http://example.com/signals/789.wav"
   emma:process="http://example.com/my_asr.xml">
    <emma:interpretation id="int1" 
    	emma:confidence="0.75"
    	emma:tokens="flights from boston to denver"
       emma:grammar-ref="gram1">
      		<origin>Boston</origin>
      		<destination>Denver</destination>
    </emma:interpretation>
    <emma:interpretation id="int2" 
    	emma:confidence="0.68"
    	emma:tokens="flights from austin to denver"
		emma:grammar-ref="gram1">
      		<origin>Austin</origin>
      		<destination>Denver</destination>
    </emma:interpretation>
  </emma:one-of>
</emma:emma>

No-match EMMA Document

In the case of a no-match the EMMA result returned MUST be annotated as emma:uninterpreted="true".


<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
    http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:interpretation id="interp1" 
  	emma:uninterpreted="true"
    emma:medium="acoustic" 
    emma:mode="voice"
    emma:process="http://example.com/my_asr.xml"/>
</emma:emma>

No-input

In the case of a no-match the EMMA interpretation returned must be annotated as emma:no-input="true" and the <emma:interpretation> element must be empty.


<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:interpretation id="int1" 
	emma:no-input="true"
	emma:medium="acoustic"
	emma:mode="voice"
   emma:process="http://example.com/my_asr.xml"/>
</emma:emma>

Multimodal EMMA Document

Example showing a multimodal interpretation resulting from combination of speech input with a mouse event passed in through a control metadata message.


<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:interpretation
      emma:medium="acoustic tactile" 
      emma:mode="voice touch"
      emma:lang="en-US"
      emma:start="1087995963542"
	   emma:end="1087995964542"
      emma:process="http://example.com/myintegrator.xml">
    <emma:derived-from resource="voice1" composite="true"/>
    <emma:derived-from resource="touch1" composite="true"/>
    <command>
       <action>zoom</action>
       <location>
         <point>42.1345 -37.128</point>
        </location>
     </command>
  </emma:interpretation>
   <emma:derivation>
  		<emma:interpretation id="voice1"
			emma:medium="acoustic"
			emma:mode="voice"
           emma:lang="en-US"
           emma:start="1087995963542"
	        emma:end="1087995964542"
           emma:media-type="audio/x-wav; rate:8000;"
			emma:tokens="zoom in here"
           emma:signal="http://example.com/signals/456.wav"
           emma:process="http://example.com/my_asr.xml">
 			<command>
       		 <action>zoom</action>
       		 <location/>
     		</command>  
        </emma:interpretation>
        <emma:interpretation id="touch1"
			emma:medium="tactile"
			emma:mode="touch"
           emma:start="1087995964000"
	        emma:end="1087995964000">
             <point>42.1345 -37.128</point>
        </emma:interpretation>
   </emma:derivation>
</emma:emma>

Lattice EMMA Document

As an example of a lattice of semantic interpretations, in a travel application where the source is either "Boston" or "Austin"and the destination is either "Newark" or "New York", the possibilities might be represented in a lattice as follows:


<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:grammar id="gram1" 
    grammar-type="application/srgs-xml" 
    ref="http://acme.com/flightquery.grxml"/>
  <emma:interpretation id="interp1"
	emma:medium="acoustic" 
	emma:mode="voice"
	emma:start="1087995961542" 
	emma:end="1087995963542"
	emma:medium="acoustic" 
	emma:mode="voice"
	emma:confidence="0.75"
	emma:lang="en-US"
	emma:grammar-ref="gram1"
   emma:signal="http://example.com/signals/123.wav"
	emma:media-type="audio/x-wav; rate:8000;"
   emma:process="http://example.com/my_asr.xml">
     <emma:lattice initial="1" final="8">
      <emma:arc from="1" to="2">flights</emma:arc>
      <emma:arc from="2" to="3">to</emma:arc>
      <emma:arc from="3" to="4">boston</emma:arc>
      <emma:arc from="3" to="4">austin</emma:arc>
      <emma:arc from="4" to="5">from</emma:arc>
      <emma:arc from="5" to="6">portland</emma:arc>
      <emma:arc from="5" to="6">oakland</emma:arc>
      <emma:arc from="6" to="7">today</emma:arc>
      <emma:arc from="7" to="8">please</emma:arc>
      <emma:arc from="6" to="8">tomorrow</emma:arc>
    </emma:lattice>
  </emma:interpretation>
</emma:emma>

6. Synthesis

In HTML speech applications, the synthesizer service does not participate directly in the user interface. Rather, it simply provides rendered audio upon request, similar to any media server, plus interim events such as marks. The UA buffers the rendered audio, and the application may choose to play it to the user at some point completely unrelated to the synthesizer service. It is the synthesizer's role to render the audio stream in a timely manner, at least rapidly enough to support real-time playback. The synthesizer MAY also render and transmit the stream faster than required for real time playback, or render multiple streams in parallel, in order to reduce latency in the application. This is a stark contrast to the IVR implemented by MRCP, where the synthesizer essentially renders directly to the user's telephone, and is an active part of the user interface.

The synthesizer MUST support [[SSML] AND plain text input. A synthesizer MAY also accept other input formats. In all cases, the client should use the content-type header to indicate the input format.

6.1 Synthesis Requests


synth-method = "SPEAK"
             | "STOP"
             | "DEFINE-LEXICON"

The set of synthesizer request methods is a subset of those defined in [MRCPv2]

SPEAK

The SPEAK method operates similarly to its [MRCPv2] namesake. The primary difference is that SPEAK results in a new audio stream being sent from the server to the client, using the same Request-ID. A SPEAK request MUST include the Audio-Codec header. When the rendering has completed, and the end-of-stream message has been sent, the synthesizer sends a SPEAK-COMPLETE event.

STOP

When the synthesizer receives a STOP request, it ceases rendering the requests specified in the Active-Request-ID header. If the Active-Request-ID header is missing, it ceases rendring all active SPEAK requests. For any SPEAK request that is ceased, the synthesiser sends an end-of-stream message, and a SPEAK-COMPLETE event.

DEFINE-LEXICON

This is used to load or unload a lexicon, and is identical to its namesake in [MRCPv2].

6.2 Synthesis Events

Synthesis events are associated with 'IN-PROGRESS' request-state notifications from the synthesizer resource.


synth-event  = "INTERIM-EVENT"  ; See Interim Events above
             | "SPEECH-MARKER"  ; An SSML mark has been rendered
             | "SPEAK-COMPLETE"
INTERIM-EVENT

See Interim Events above.

SPEECH-MARKER

This event indicates that an SSML mark has been rendered. It uses the Speech-Marker header, which contains a timestamp indicating where in the stream the mark occurred, and the label associated in the mark.

Implementations should send the SPEECH-MARKER as closely as possible to the corresponding media packet so clients may play the media and fire events in real time if needed.

SPEAK-COMPLETE

Indicates that rendering of the SPEAK request has completed.

6.3 Synthesis Headers

The synthesis headers used in web-speech/1.0 are mostly a subset of those in [MRCPv2], with some minor modification and additions.


synth-header = ; headers borrowed from [MRCPv2]
               active-request-id-list
             | Completion-Cause
             | Completion-Reason
             | Voice-Gender
             | Voice-Age
             | Voice-Variant
             | Voice-Name
             | Prosody-parameter ; Actually a collection of headers, see [MRCPv2]
             | Speech-Marker
             | Speech-Language
             | Failed-URI
             | Failed-URI-Cause
             | Load-Lexicon
             | Lexicon-Search-Order
             | Vendor-Specific       ; see Generic Headers
               ; new headers for web-speech/1.0
             | Audio-Codec
             | Stream-ID ; read-only

Speech-Marker = "Speech-Marker:" "timestamp" "=" date-time [";" 1*(UTFCHAR)] 
                ; e.g. Speech-Marker:timestamp=2011-09-06T10:33:16.612Z;banana
Audio-Codec   = "Audio-Codec:" mime-media-type ; See [RFC3555]
Stream-ID     = 1*8DIGIT ; decimal representation of 24-bit stream-ID
Audio-Codec

Because an audio stream is created in response to a SPEAK request, the audio codec and parameters must be specified in the SPEAK request using the Audio-Codec header. If the synthesizer is unable to encode with this codec, it terminates the request with a 409 (unsupported header field) COMPLETE status message.

Speech-Marker

This event indicates when an SSML mark is rendered. It is similar to its namesake in [MRCPv2], except that the clock is defined as the local time at the service, and the timestamp format is as defined in this document. By using the timestamp from the beginning of the stream, and the timestamp of this event, the UA can calculate when to raise the event to the application based on where it is in the playback of the rendered stream.

Stream-ID

Specifies the ID of the stream that contains the rendered audio, so that the UA can associate audio streams it receives with particular SPEAK requests. This is a read-only parameter, returned in responses to the SPEAK request.

6.4 Synthesis Examples

Simple rendering of plain text

The most straightforward use case for TTS is the synthesis of one utterance at a time. This is inevitable for just-in-time rendition of speech, for example in dialogue systems or in in-car navigation scenarios. Here, the web application will send a single speech synthesis SPEAK request to the speech service.


C->S: web-speech/1.0 SPEAK 3257
        Resource-ID:synthesizer
        Audio-codec:audio/flac
        Speech-Language: de-DE
        Content-Type:text/plain

        Hallo, ich heiße Peter.

S->C: web-speech/1.0 3257 200 IN-PROGRESS
        Resource-ID:synthesizer
        Stream-ID: 112233
        Speech-Marker:timestamp=2011-09-06T10:33:16.612Z

S->C: binary message: start of stream (stream-id = 112233)
         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |                   stream-id                   |
        |1 0 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
        +---------------+-----------------------------------------------+
        |1 0 0 0 1 0 1 1 1 1 0 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0| } NTP Timestamp
        |0 0 0 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0 1 1 0 0 0 0 1 0 1 0 1 0 1| }
        |      61              75              64              69       | a u d i
        |      6F              2F              66              6C       | o / f l
        |      61              63       +-------------------------------+ a c
        +-------------------------------+                                

S->C: more binary media packets...

S->C: web-speech/1.0 SPEAK-COMPLETE 3257 COMPLETE
        Resource-ID:Synthesizer
        Completion-Cause:000 normal
        Speech-Marker:timestamp=2011-09-06T10:33:26.922Z

S->C: binary audio packets...

S->C: binary audio packet: end of stream ( message type = 0x03 )
         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |                   stream-id                   |
        |1 1 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
        +---------------+-----------------------------------------------+

Simple Rendering of SSML

For richer markup of the text, it is possible to use the SSML format for sending an annotated request. For example, it is possible to propose an appropriate pronunciation or to indicate where to insert pauses. (SSML example adapted from http://www.w3.org/TR/speech-synthesis11/#edef_break)


C->S: web-speech/1.0 SPEAK 3257
        Resource-ID:synthesizer
        Voice-gender:neutral
        Voice-Age:25
        Audio-codec:audio/flac
        Prosody-volume:medium
        Content-Type:application/ssml+xml

        <?xml version="1.0"?>
        <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
                xml:lang="en-US">
           Please make your choice. <break time="3s"/>
           Click any of the buttons to indicate your preference.
        </speak>

Remainder of example as above

Bulk Requests for Synthesis

Some use cases require relatively static speech output which can be known at the time of loading a web page. In these cases, all required speech output can be requested in parallel as multiple concurrent requests. Callback methods in the web api are responsible to relate each speech stream to the appropriate place in the web application.

On the protocol level, the request of multiple speech streams concurrently is realized as follows.


C->S: web-speech/1.0 SPEAK 3257
        Resource-ID:synthesizer
        Audio-codec:audio/basic
        Speech-Language: es-ES
        Content-Type:text/plain

        Hola, me llamo Maria.

C->S: web-speech/1.0 SPEAK 3258
        Resource-ID:synthesizer
        Audio-codec:audio/basic
        Speech-Language: en-UK
        Content-Type:text/plain

        Hi, I'm George.

C->S: web-speech/1.0 SPEAK 3259
        Resource-ID:synthesizer
        Audio-codec:audio/basic
        Speech-Language: de-DE
        Content-Type:text/plain

        Hallo, ich heiße Peter.

S->C: web-speech/1.0 3257 200 IN-PROGRESS

S->C: media for 3257

S->C: web-speech/1.0 3258 200 IN-PROGRESS

S->C: media for 3258

S->C: web-speech/1.0 3259 200 IN-PROGRESS

S->C: media for 3259

S->C: more media for 3257

S->C: web-speech/1.0 SPEAK-COMPLETE 3257 COMPLETE

S->C: more media for 3258

S->C: web-speech/1.0 SPEAK-COMPLETE 3258 COMPLETE

S->C: more media for 3259

S->C: web-speech/1.0 SPEAK-COMPLETE 3259 COMPLETE

The service MAY choose to serialize its processing of certain requests (such as only rendering one SPEAK request at a time), but MUST still accept multiple active requests.

Multimodal Coordination with Bookmarks

In order to synchronize the speech content with other events in the web application, it is possible to mark relevant points in time using the SSML tag. When the speech is played back, a callback method is called for these markers, allowing the web application to present, e.g., visual displays synchronously.

(Example adapted from http://www.w3.org/TR/speech-synthesis11/#S3.3.2)


C->S: web-speech/1.0 SPEAK 3257
        Resource-ID:synthesizer
        Voice-gender:neutral
        Voice-Age:25
        Audio-codec:audio/flac<
        Prosody-volume:medium
        Content-Type:application/ssml+xml

        <?xml version="1.0"?>
        <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
                xml:lang="en-US">
        Would you like to sit <mark name="window_seat"/> here at the window, or 
        rather <mark name="aisle_seat"/> here at the aisle?
        </speak>

S->C: web-speech/1.0 3257 200 IN-PROGRESS
        Resource-ID:synthesizer
        Stream-ID: 112233
        Speech-Marker:timestamp=2011-09-06T10:33:16.612Z

S->C: binary message: start of stream (stream-id = 112233)
         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |                   stream-id                   |
        |1 0 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
        +---------------+-----------------------------------------------+
        |1 0 0 0 1 0 1 1 1 1 0 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0| } NTP Timestamp
        |0 0 0 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0 1 1 0 0 0 0 1 0 1 0 1 0 1| }
        |      61              75              64              69       | a u d i
        |      6F              2F              66              6C       | o / f l
        |      61              63       +-------------------------------+ a c
        +-------------------------------+                                

S->C: more binary media packets...

S->C: web-speech/2.0 SPEECH-MARKER 3257 IN-PROGRESS
        Resource-ID:synthesizer
        Stream-ID: 112233
        Speech-Marker:timestamp=2011-09-06T10:33:18.310Z;window_seat

S->C: more binary media packets...

S->C: web-speech/2.0 SPEECH-MARKER 3257 IN-PROGRESS
        Resource-ID:synthesizer
        Stream-ID: 112233
        Speech-Marker:timestamp=2011-09-06T10:33:21.008Z;aisle_seat

S->C: more binary media packets...

S->C: web-speech/1.0 SPEAK-COMPLETE 3257 COMPLETE
        Resource-ID:Synthesizer
        Completion-Cause:000 normal
        Speech-Marker:timestamp=2011-09-06T10:33:23.881Z

S->C: binary audio packets...

S->C: binary audio packet: end of stream ( message type = 0x03 )
         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |                   stream-id                   |
        |1 1 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
        +---------------+-----------------------------------------------+

7. References

[EMMA]
EMMA: Extensible MultiModal Annotation markup languagehttp://www.w3.org/TR/emma/ TODO: update link to EMMA 1.1
[MIME-RTP]
MIME Type Registration of RTP Payload Formats http://www.ietf.org/rfc/rfc3555.txt
[MRCPv2]
MRCP version 2 http://tools.ietf.org/html/draft-ietf-speechsc-mrcpv2-24
[REQUIREMENTS]
Protocol Requirements http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/att-0030/protocol-reqs-commented.html
[HTTP1.1]
Hypertext Transfer Protocol -- HTTP/1.1 http://www.w3.org/Protocols/rfc2616/rfc2616.html
[RFC1305]
Network Time Protocol (version 3) http://www.ietf.org/rfc/rfc1305.txt
[RFC3339]
Date and Time on the Internet: Timestamps http://www.ietf.org/rfc/rfc3339.txt
[RFC3555]
MIME Type Registration of RTP Payload Formats http://www.ietf.org/rfc/rfc3555.txt
[RFC5646]
Tags for Identifying Languages http://tools.ietf.org/html/rfc5646
[SRGS]
Speech Recognition Grammar Specification Version 1.0 http://www.w3.org/TR/speech-grammar/
[SSML]
Speech Synthesis Markup Language (SSML) Version 1.0 http://www.w3.org/TR/speech-synthesis/
[TLS]
RFC 5246: The Transport Layer Security (TLS) Protocol Version 1.2 http://tools.ietf.org/html/rfc5246
[WS-API]
Web Sockets API, http://www.w3.org/TR/websockets/
[WS-PROTOCOL]
Web Sockets Protocol http://tools.ietf.org/pdf/draft-ietf-hybi-thewebsocketprotocol-09.pdf