HTML Speech XG
Proposed Protocol Approach

Draft Version 2, 24th June, 2011

This version:
Posted to http://lists.w3.org/Archives/Public/public-xg-htmlspeech/
Latest version:
Posted to http://lists.w3.org/Archives/Public/public-xg-htmlspeech/
Previous version:
http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/att-0008/speech-protocol-basic-approach-01.html
Editor:
Robert Brown, Microsoft
Contributors:
Milan Young, Nuance
Michael Johnston, AT&T
Patrick Ehlen, AT&T
And other contributions from HTML Speech XG participants, http://www.w3.org/2005/Incubator/htmlspeech/

Abstract

The basic approach is to use the WebSockets protocol [WS-PROTOCOL] as the transport for both audio and signaling, such that any interaction session with a service can be accomplished with a single WebSockets session.

Status of this Document

This document is an informal rough draft that collates proposals, agreements, and open issues on the design of the necessary underlying protocol for the HTML Speech XG, for the purposes of review and discussion within the XG.

Contents

  1. Design Concept
  2. Session Establishment
  3. Signaling
    1. General Signals
    2. Recognition Signals
    3. Synthesis Signals
    4. Server Notifications
  4. Media
    1. Media Transmission
    2. Media Signaling for Recognition
    3. Media Signaling for Synthesis
  5. Design Exclusions
    1. Exclusion of Recording
    2. Exclusion of Speaker Verification
    3. Exclusion of User Enrollment
    4. Exclusion of DTMF
    5. Exclusion of Queueing
    6. Exclusion of SDP
    7. Exclusion of RTP
  6. References

Design Concept

TODO: Write up design concept with some example sequence diagrams to illustrate how the protocol would work. This shouldn't just be a repeat of MRCP examples. It should also show some of things that could happen in an HTML application that wouldn't happen in an IVR application. *Especially* synthesis. For example: synthesizing multiple prompts in parallel for playback in the UA when the app needs them, rather than synthesis occurring on a prompt queue; or recognition from both an audio stream and gesture stream; or recognition from two streams with alternate encodings (one of which is decodable to be listened to by humans, the other is specifically encoded for the recognizer; or recognition from multiple beams of an array mic, with multiple simultaneous users.

Session Establishment

The WebSockets session is established through the standard WebSockets HTTP handshake, with these specifics:

For example:


C->S: GET /speechservice123?customparam=foo&otherparam=bar HTTP/1.1
      Host: examplespeechservice.com
      Upgrade: websocket
      Connection: Upgrade
      Sec-WebSocket-Key: OIUSDGY67SDGLjkSD&g2 (for example)
      Sec-WebSocket-Version: 9
      Sec-WebSocket-Protocol: html-speech, x-custom-speech

S->C: HTTP/1.1 101 Switching Protocols
      Upgrade: websocket
      Connection: Upgrade
      Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=
      Sec-WebSocket-Protocol: html-speech

Once the WebSockets session is established, the UA can begin sending requests and media to the service, which can respond with events, responses or media (in the case of TTS).

Signaling

The signaling design borrows heavily from [MRCPv2]. The interaction is full-duplex and asymmetrical: service activity is instigated by requests the UA, which may be multiple and overlapping, where each request results in one or more messages from the service back to the UA.

There are three classes of control messages:

  1. C->S requests from the UA to the service. The client requests a "method" from a particular remote speech resource. These methods are assumed to be defined as in MRCP2 except where noted.
  2. S->C general status notification messages from the service to the UA, marked as either PENDING, IN-PROGRESS or COMPLETE.
  3. S->C named events from the service to the UA, that are essentially special cases of 'IN-PROGRESS' request-state notifications.


control-message =   start-line ; i.e. use the typical MIME message format
                  *(header CRLF)
                    CRLF
                   [body]
start-line      =   request-line | status-line | event-line
header          =  <Standard MIME header format> ; actual headers depend on the type of message
body            =  *OCTET                        ; depends on the type of message

Request messages are sent from the client to the server, usually to request an action or modify a setting. Each request has its own request-id, which is unique within a given WebSockets html-speech session. Any status or event messages related to a request use the same request-id. All request-ids MUST have an integer value between 0 and 65535. This upper value is less that that used in MRCP, and was chosen so that request-ids can be associated with audio streams using a 16-bit field in binary audio messages.

Do we really need to specify the target resource like MRCP does? Are there any cases where this is not already implicit?


request-line   = version SP message-length SP method-name SP request-id SP CRLF
version        = "html-speech/" 1*DIGIT "." 1*DIGIT ; html-speech/1.0
message-length = 1*DIGIT
method-name    = general-method | synth-method | reco-method | proprietary-method
request-id     = <Decimal integer between 0 and 65535>

TODO: Is message-length really necessary? Presumably its only value in MRCP is in framing, which we get automatically in WebSockets.

Status messages are sent by the server, to indicate the state of a request.


status-line   =  version SP message-length SP request-id SP status-code SP request-state CRLF								
status-code   =  3DIGIT       ; Specific codes TBD, but probably similar to those used in MRCP

; All communication from the server is labeled with a request state.
request-state = "COMPLETE"    ; Processing of the request has completed.
              | "IN-PROGRESS" ; The request is being fulfilled.
              | "PENDING"     ; Processing of the request has not begun.

TODO: Determine status code values.

Event messages are sent by the server, to indicate specific data, such as synthesis marks, speech detection, and recognition results.


status-line   =  version SP message-length SP event-name SP request-id SP request-state CRLF								
event-name    =  synth-event | reco-event | proprietary-event

TODO: write something about the list of supported headers

General Messages

The GET-PARAMS and SET-PARAMS requests are the same as their [MRCPv2] counterparts. They are used to discover and set the configuration parameters of a resource (recognizer or synthesiser).

The set of general headers is also the same as the [MRCPv2] Generic Message Headers, except channel-identifier (which relates to SDP, which is not used in the the html-speech protocol).


general-method = "SET-PARAMS"
               | "GET-PARAMS"
               | "SESSION-QUERY"

general-header-from-mrcp =
                  accept
               |  accept-charset
               | "Active-Request-Id-List:" request-id *("," request-id)
               | "Proxy-Sync-Id:" 1*VCHAR
               |  content-type
               |  content-id
               | "Content-Base:" absoluteURI
               | "Content-Encoding:" *WSP content-coding *(*WSP "," *WSP content-coding *WSP )
               | "Content-Location:" ( absoluteURI / relativeURI )
               | "Content-Length:" 1*19DIGIT
               | "Fetch-Timeout:" 1*19DIGIT
               |  cache-control
               | "Logging-Tag:" 1*UTFCHAR
               |  set-cookie
               |  set-cookie2
               | "Vendor-Specific-Parameters:" [vendor-specific-av-pair *(";" vendor-specific-av-pair)]

vendor-specific-av-pair = 1*UTFCHAR "=" value

session-query-header =
                 "Recognizer-Media:" mime-media-type *("," mime-media-type) ; See [RFC3555]
               | "Recognizer-Lang:" lang-tag *("," lang-tag) ; See [RFC5646]
               | "Synthesizer-Media:" mime-media-type *("," mime-media-type) ; See [RFC3555]
               | "Synthesizer-Lang:" lang-tag *("," lang-tag) ; See [RFC5646]

TODO: Fill in session query headers.

TODO: Do we really need Content-Length? This is useful for framing otherwise open-ended messages in HTTP and MRCP, but it's redundant in WebSockets messages.

TODO: Are cookie headers appropriate? These make sense in a voice-browser world, where MRCP is central to the infrastructure. But in the visual browser world, the UA is very particular about cookies and the servers it's willing to reveal them to. We should have solid use cases for these.

The SESSION-QUERY request is introduced in the html-speech protocol to provide a way for the application/UA to determine whether the service supports the basic language and codec capabilities it needs. In most cases applications will know service capabilities ahead of time. However, some applications may be more adaptable, or may wish to double-check at runtime. To determine service capabilities, the UA sends a SESSION-QUERY request to the service, containing a set of capabilities, to which the service responds with the specific subset it actually supports. Unlike GET/SET-PARAMS, SESSION-QUERY is not directed at a particular resource.

For example:



C->S: html-speech/1.0 ... SESSION-QUERY 34132
      recognizer-media: audio/basic, audio/amr-wb,
                        audio/x-wav;channels=2;formattag=pcm;samplespersec=44100,
                        audio/dsr-es202212; rate:8000; maxptime:40
      recognizer-lang: en-AU, en-GB, en-US, en
      synthesizer-media: audio/ogg, audio/flac, audio/basic
      synthesizer-lang: en-AU, en-GB

S->C: html-speech/1.0 ... 34132 200 COMPLETE
      recognizer-media: audio/basic, audio/dsr-es202212; rate:8000; maxptime:40
      recognizer-lang: en-GB, en
      synthesizer-media: audio/flac, audio/basic
      synthesizer-lang: en-GB

TODO: Should investigate supporting "Generic Headers at the session level (ie without specifying a 'recognizer' or 'synthesizer' resource). If so, then perhaps SESSION-QUERY is not required.

Recognition Messages

Recognition is accomplished with a set of messages and events that have the same meaning as their [MRCPv2] counterparts. The headers for these messages are the same as the MRCPv2 Recognizer Header Fields, except those related to enrollment and DTMF input (DTMF-*, Input-Type, Clear-DTMF-Buffer), and verification (Ver-Buffer-Utterance).


reco-method  = "DEFINE-GRAMMAR"
             | "RECOGNIZE" ; Similar to MRCP2 with TBD support for right and left context in message body,
                           ; which cannot be included as headers because of ASCII limitation. 
             | "START-INPUT-TIMERS"
             | "STOP"
             | "INTERPRET"
             | "START-AUDIO" ; No counterpart in MRCP. Used to initiate an media input stream.

; Recognition events are associated with 'IN-PROGRESS' request-state notifications from the 'recognizer' resource.
reco-event   = "START-OF-INPUT" ; Note the the timestamp on the event should be when speech was estimated to begin, 
                                ; NOT when the endpointer finally decided that speech began (M milliseconds later). 
             | "RECOGNITION-COMPLETE" Similar to MRCP2 except that application/emma+xml (EMMA) will be the default Content-Type.
             | "INTERPRETATION-COMPLETE"
             | "END-OF-INPUT"   ; No counterpart in the MRCP2 standard, this event is the logical counterpart of START-OF-INPUT.
                                ; Note that the timestamp on the event should be at the point when speech was estimated to have ended, 
                                ; NOT when the endpointer finally decided that speech ended (M milliseconds later).
             | "INTERMEDIATE-RESULT" ; No counterpart in the MRCP2 standard. Continuous speech (aka dictation) often requires 
                                     ; feedback about what has been recognized thus far. Waiting for a monolithic RECOGNITION-COMPLETE event 
                                     ; at the end of the RECOGNITION transaction does not usually lead to user-friendly interfaces. 
                                     ; This INTERMEDIATE-RESULT (not part of MRCP2), provides this "live" channel. As with RECOGNITION-COMPLETE, 
                                     ; contents are assumed to be EMMA unless an alternate Content-Type is provided. 

reco-header-from-mrcp = 
               "Confidence-Threshold:" FLOAT
             | "Sensitivity-Level:" FLOAT
             | "Speed-Vs-Accuracy:" FLOAT
             | "N-Best-List-Length:" 1*19DIGIT
             | "No-Input-Timeout:" 1*19DIGIT
             | "Recognition-Timeout:" 1*19DIGIT
             | "Waveform-URI:" ["<" uri ">;size=" 1*19DIGIT ";duration=" 1*19DIGIT]
             | "Media-Type:" media-type-value
             | "Input-Waveform-URI:" uri
             | "Completion-Cause:" 3DIGIT SP 1*VCHAR
             | "Completion-Reason:" quoted-string
             | "Recognizer-Context-Block:" [1*VCHAR]
             | "Start-Input-Timers:" BOOLEAN
             | "Speech-Complete-Timeout:" 1*19DIGIT
             | "Speech-Incomplete-Timeout:" 1*19DIGIT
             | "Failed-URI:" absoluteURI
             | "Failed-URI-Cause" ":" 1*UTFCHAR
             | "Save-Waveform:" BOOLEAN
             | "New-Audio-Channel:" BOOLEAN
             | "Speech-Language:" 1*VCHAR
             | "Recognition-Mode:" 1*ALPHA
             | "Cancel-If-Queue:" BOOLEAN
             | "Hotword-Min-Duration:" 1*19DIGIT
             | "Hotword-Max-Duration:" 1*19DIGIT
             | "Interpret-Text:" 1*VCHAR
             | "Early-No-Match:" BOOLEAN

reco-header-introduced-by-speech =
               "Source-Time:" 1*19DIGIT ; The UA local time at which the RECOGNIZE or START-AUDIO request was initiated.
             | "Audio-Codec:" mime-media-type ; See [RFC3555]
             | "Audio-Streams:" request-id *("," request-id) ; Used in RECOGNIZE to list the input streams, identified by their START-AUDIO request-ids.
             | "Partial:" BOOLEAN
             | "Partial-Interval-Hint:" 1*5DIGIT

TODO: Can we trim this even further? For example, it's not clear that we need Cancel-If-Queue. HTML apps won't have the same serialized dialog we see in IVR, so this may not be a meaningful header. Will the API have hotword functionality? If not, we don't need the hotword headers.

TODO: Does this imply further API requirements? For example, START-INPUT-TIMERS is there for a good reason, but AFAIK we haven't spoken about it. Similarly, Early-No-Match seems useful to developers. Is it?

TODO: Add the header(s) to indicate whether the recognition is single or continuous, and whether interim events should be returned.

One other thing we'll need to consider is how to add/remove grammars during continuous recognition. For example In dictation, it's not uncommon to have hot words that switch in and out of a command mode (i.e. enable/disable a command grammar). In open-mic multimodal apps, the app will listen continuously, but change the set of active grammars based on the user's other non-speech interactions with the app.

START-AUDIO
The START-AUDIO request is used to initiate audio input streams. For recognition, audio input streams aren't tightly coupled to any particular recognition request. In one use case, there may be multiple recognition requests in series on the same audio stream, which is equivalent to the typical IVR scenario. However, HTML apps will diverge greatly from this pattern. For example, in the most typical case, there will only be an audio stream when the app actually wants to do recognition, such as when the user clicks on a microphone graphic (as opposed to IVR apps, when there is always exactly one audio input stream that lasts for the duration of the application). Other more exotic scenarios will also become common place. For example, there will be recognition requests that consume multiple input streams, such as a games console or smart TV that feeds both audio and gesture data into the recognizer. Popular consumer entertainment devices that present this sort of UI already exist. The same sorts of living-room devices may also feed audio from multiple in-room users into the same recognition session. There will also be apps that use two overlapping recognition requests on the same input stream, such as applications that recognizes both dictation input and UI-automation commands, as seen in popular desktop dictation products.
Audio-Codec
To start a new audio stream, the UA sends a START-AUDIO request, which it follows with audio packets that share the same request-ID. START-AUDIO MUST include the Audio-Codec header, which specifies the CODEC used in the audio stream.
Source-Time
START-AUDIO MUST also include the Source-Time header, which indicates the UA's local time at the start of the audio stream, so the service can accurately express events using client-local time, and (where relevant) so that the service can synchronize the processing of multiple input streams.
Partial
This header is required to support the continuous speech scenario on the 'recognizer' resource. When sent by the client in a RECOGNITION or SET-PARAMS request, this header controls whether or not the client is interested in partial results from the service. In this context, the term 'partial' is meant to describe mid-utterance results that provide a best guess at the user's speech thus far (e.g. "deer", "dear father", "dear father christmas"). These results should contain all recognized speech from the point of the last non-partial (ie complete) result, but it may be common for them to omit fully-qualified result attributes like an NBest list, timings, etc. The only guarantee is that the content must be EMMA. Note that this header is valid on both regular command-and-control recognition requests as well as dictation sessions. This is because at the API level, there is no syntactic difference between the recognition types. They are both simply recognition requests over an SRGS grammar or set of URL(s). Additionally partial results can be useful in command-and-control scenarios when doing dictation enrollment and lip-sync. When sent by the server, this header indicates whether the message contents represent a full or partial result. It's valid for a server to send this header on both INTERMEDIATE-RESULT, RECOGNITION-COMPLETE, and in response to a GET-PARAMS messages.
Partial-Interval-Hint
A suggestion from the client to the service on the frequency at which partial results should be sent. Integer value represents desired interval expressed in milliseconds.

In order to associate a RECOGNIZE request with one (or more) media streams, the RECOGNIZE request MUST include the Audio-Streams header, which specifies the request-id(s) of the audio input streams, as well as the Source-Time header, to indicate the point in the input stream(s) from which the recognizer should start processing.

TODO: There's debate about which of the following are appropriate in a standard rather than as vendor-specific parameters.

These headers may be required to support the continuous speech scenario on the 'recognizer' resource:

User-ID: String, no default
Recognition results are often more accurate if the recognizer can train itself to the user's speech over time. This is especially the case with dictation as vocabularies are so large. A User-ID field would allow the recognizer to establish the user's identify if the webapp decided to supply this information. Otherwise the engine would operate under default LMs.
Enrollment-Phrase: String, no default
Defines the phrase that the user has spoken so that the recognizer can refine the models. Although User-Gender and User-ID are two headers that will be commonly supplied in conjunction with the enrollment phrase, that can all be specified independent.
Punctuation-Alias: String, no default
It's common for punctuation characters to have alternate names. For example the '.' character is often called 'period' or 'dot' depending upon the context. This parameter is a comma-separated list of such aliases where each alias is denoted with /. For example: './period, ./dot, ./full stop, \,/comma, \n/newline'.

TODO: There's debate about which of the following are appropriate in a standard rather than as vendor-specific parameters.

The following set of new 'recognizer' resource headers revolve around post-recognition modifications of the utterance string. The selected modifications are merged together and presented in parallel with the raw utterance (see example below).

Return-Punctuation: Boolean
If true, the speech service would return a punctuated utterance in addition to the raw utterance. For example, if the user spoke 'dear abby', the result might be 'dear abby,'.
Gender-Number-Pronoun: String
Some languages require the recognizer to conjugate verbs differently depending upon the gender and "number" of the speaker. For example, in French, this parameter might be set to one of "je", "tu", "vous", etc.
Return Formatting: Boolean
If true, the speech service would return a formatted utterance in addition to the raw utterance. For example, if the user spoke 'dear abbey', the result might be 'Dear Abbey'.
Filter-Profanity: Boolean, default true
If true, the speech server would remove suspected profanity from the utterance string. For example, if the user spoke 'my dog sally is a good bitch', the result might be 'my dog sally is a good b****'.

Assuming all of the above three formatting parameters were set to true, the user utterance of "he bought a damn nice five dollar watch in new york period", might result in the following RECOGNITION-COMPLETE payload:


<emma:interpretation id="interp1" emma:medium="acoustic" emma:mode="voice">
    <emma:lattice initial="1" final="12">
      <emma:arc from="1" to="2">he</emma:arc>
      <emma:arc from="2" to="3">bought</emma:arc>
      <emma:arc from="3" to="4">a</emma:arc>
      <emma:arc from="4" to="5">damn</emma:arc>
      <emma:arc from="5" to="6">nice</emma:arc>
      <emma:arc from="6" to="7">five</emma:arc>
      <emma:arc from="7" to="8">dollar</emma:arc>
      <emma:arc from="8" to="9">watch</emma:arc>
      <emma:arc from="9" to="10">in</emma:arc>
      <emma:arc from="10" to="11">New York</emma:arc>
      <emma:arc from="11" to="12">./period</emma:arc>
    </emma:lattice>
    <text>He bought a d*** nice $5.00 watch in New York.</text>
    <alignment>Content undefined (i.e. vendor specific)</alignment>
</emma:interpretation>

Synthesis Messages

Synthesis is accomplished with a set of messages and events that have the same meaning as their [MRCPv2] counterparts. The headers for these messages are the same as the MRCPv2 Synthesizer Header Fields, except ...


synth-method = "SPEAK"
             | "STOP"
             | "PAUSE"
             | "RESUME"
             | "BARGE-IN-OCCURRED"
             | "CONTROL"
             | "DEFINE-LEXICON"

; Synthesis events are associated with 'IN-PROGRESS' request-state notifications from the 'recognizer' resource.
synth-event  = "SPEECH-MARKER"
             | "SPEAK-COMPLETE"

synth-header-from-mrcp = 
               "Jump-Size:" speech-length-value
             | "Kill-On-Barge-In:" BOOLEAN
             | "Speaker-Profile:" uri
             | "Completion-Cause:" 3DIGIT SP 1*VCHAR
             | "Completion-Reason:" quoted-string
             | "Voice-Gender:" ("male" | "female" | "neutral")
             | "Voice-Age:" 1*3DIGIT
             | "Voice-Variant:" 1*19DIGIT
             | "Voice-Name:" 1*UTFCHAR *(1*WSP 1*UTFCHAR)
             | "Prosody-" prosody-param-name ":" prosody-param-value
             | "Speech-Marker:timestamp=" 1*20DIGIT [";" 1*(UTFCHAR / %x20)]
             | "Speech-Language:" 1*VCHAR
             | "Fetch-Hint:" ("prefetch" / "safe")
             | "Audio-Fetch-Hint:" ("prefetch" / "safe" / "stream")
             | "Failed-URI" ":" absoluteURI
             | "Failed-URI-Cause" ":" 1*UTFCHAR
             | "Speak-Restart:" BOOLEAN
             | "Speak-Length:" positive-length-value
             | "Load-Lexicon:" BOOLEAN
             | "Lexicon-Search-Order:<" absoluteURI ">" *(" " "<" absoluteURI ">")

synth-header-introduced-by-html-speech =
               "Audio-Codec:" mime-media-type ; See [RFC3555]

The Audio-Codec header is introduced by HTML-Speech because an audio stream is created in response to a SPEAK request, and lasts for the duration of the request (unlike IVR scenarios, where media channels are negotiated prior to the MRCP session). The service needs to know what CODEC to use to encode the audio stream it produces. Hence this header, which may be included in a SPEAK request, or set beforehand with SET-PARAMS.

The clock for the Speech-Marker header is defined as starting at zero at the beginning of the output stream. Whenever the UA happens to play back the corresponding portion of audio from its buffer, it can fire the event to the application.

TODO: This design has some IVR artifacts that don't make sense in HTML. In IVR, the synthesizer essentially renders directly to the user's telephone, and is an active part of the user interface. Whereas in HTML, the UA will control playback from its buffer, independent of rendering. In light of this, the CONTROL method, jump-size header, Speak-Length header, kill-on-barge-in header (and possibly others) don't really make sense.

TODO: Speaker-Profile will probably never be used, because HTML UAs will want to pass values inline, rather than store them in a separate URI-referencable resource. Should we remove it?

TODO: DEFINE-LEXICON and the Load-Lexicon header appear to be useful. Does it need to surface in the API, or is its presence in SSML enough? And if it is, why do we need the header? And also, why isn't there corresponding functionality for recognition?

Server Notifications

Within MRCP v2, the server may only send message in response to a client-driven request. Client polling via GET-PARAMS is the only option for "pushing" a message from the server to the client. It's unclear whether server push through the HTML Speech protocol and API is required functionality. These messages could, for example, be accomplished outside the specification using a separate WebSocket connection. But if this is found to be convenient, then we may choose to define a server-driven notification mechanism as follows:

TODO: we need to clarify whether there's a clear scenario for this.


server-notification = version SP NOTIFY CRLF [body]

Note that such notification lacks support for MRCP infrastructure like request-ids and headers. These were omitted because it's unclear how the client browser would make sense of the data. If webapps require support for request-ids, parameter, etc, they would probably be best encoded within the message [body].

TODO: Although some minimal set of headers may be useful, for example Content-Type or Vendor-Specific-Parameters.

If the [body] was detected as being XML or JSON, it would be nice if the client browser could automatically reflect the data as a DOM or EMCA object.

Media

Audio (or other media-like I/O) is packetized and transmitted as a series of WebSockets binary messages, on the same WebSockets session that is used for the control messages.

Media Transmission

There is no strict constraint on the size and frequency of audio messages. Nor is there a requirement for all audio packets to encode the same duration of sound. Most implementations will seek to minimize user perceived latency, by sending packet messages that encode between 20 and 80 milliseconds of sound. Since the overhead of a WebSockets frame is typically only 4 bytes, implementations should err on the side of sending smaller packets more frequently. While more critical for recognition this principal also applies to synthesis, so that interim events such as marks may be closely synchronized with the corresponding part of the audio rendering.

The design does not permit the transmission of media as text messages. Because base-64 encoding of audio data in text incurs a 33% transmission overhead, and WebSockets provides native support for binary messages, there is no need for text-encoding of audio.

A sequence of audio messages represents an in-order contiguous stream of audio data. Because the messages are sent in-order and audio packets cannot be lost (WebSockets uses TCP), there is no need for sequence numbering or timestamps. The sender just packetizes audio from the encoder and sends it, while the receiver just un-packs the messages and feeds them to the decoder. Timing is calculated by decoded offset from the beginning of the stream.

Audio messages are not aware of silence. Some codecs will efficiently encode silence in the encoded audio stream. There is also a control message (TBD) that indicates silence.

A synthesis service MAY (and typically will) send audio faster than real-time, and the client MUST be able to handle this.

A client MAY send audio to a recognition service slower than real time. While this generally causes undesirable user interface latency, it may be necessary due to practical limitations of the network. A recognition service MUST be prepared to receive slower-than-real-time audio.

Although most services will be strictly either recognition or synthesis services, some services may support both in the same session. While this is a more advanced scenario, the design does not introduce any constraints to prevent it. Indeed, both the client and server MAY send audio streams in the same session.

Advanced implementations of HTML Speech may incorporate multiple channels of audio in a single transmission. For example, living-room devices with microphone arrays may send separate streams in order to capture the speech of multiple individuals within the room. Or, for example, some devices may send parallel streams with alternative encodings that may not be human-consumable (like standard codecs) but contain information that is of particular value to a recognition service. The protocol future-proofs for this scenario by incorporating a channel ID into each message, so that separate audio channels can be multiplexed onto the same session. Channel IDs are selected by the originator of the stream, and only need to be unique within the set of channels being transmitted by that originator.


audio-packet        =  binary-message-type
                       binary-request-id
                       binary-reserved
                       binary-data
binary-message-type =  OCTET ; Values > 0x03 are reserved. 0x00 is undefined.
binary-session-id   = 2OCTET ; Matches the request-id for a SPEAK or START-AUDIO request
binary-reserved     =  OCTET ; Just buffers out to the 32-bit boundary for now, but may be useful later.
binary-data         = *OCTET 

The binary-request-id field is used to correlate the audio stream with the corresponding SPEAK or START-AUDIO request. It is a 16-bit unsigned integer.

The binary-message-type field has these defined values:

0x01: Audio
The message is an audio packet, and contains encoded audio data.
0x02: Skip
The message indicates a period of silence, and contains the new timestamp from which all future audio packets in the stream are based. This message is optional. If a UA knows that it is using a CODEC that does not efficiently encode silence, and the UA has its own silence-detection logic, it MAY use this message to avoid transmitting unecessary audio packets.
0x03: End-of-stream
The message indicates the end of the audio stream. Any future audio messages with the same request-id are invalid.

Media Signaling for Recognition

For example:




C->S: html-speech/1.0 ... START-AUDIO 41021
      Audio-codec: audio/dsr-es202212; rate:8000; maxptime:40
      Source-Time: 12753248231 (source's local time at the start of the first packet)

S->C: binary audio packet #1 (request-id = 41201 = 1010000011110001)
         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |           request-id          |   reserved    |
        |1 0 0 0 0 0 0 0|1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 1|0 0 0 0 0 0 0 0|
        +---------------+-------------------------------+---------------+
        |                       encoded audio data                      |
        |                              ...                              |
        |                              ...                              |
        |                              ...                              |
        +---------------------------------------------------------------+

S->C: html-speech/1.0 ... 41021 200 IN-PROGRESS (i.e. the service is accepting the audio)

C->S: binary audio packets...

C->S: html-speech/1.0 ... RECOGNIZE 8322
      Channel-Identifier:32AECB23433801@speechrecog
      Confidence-Threshold:0.9
      Audio-sessions: 41201 (request-id of the input stream)
      Source-time: 12753432234 (where in the input stream recognition should start)

S->C: html-speech/2.0 ... START-OF-INPUT 8322 IN-PROGRESS

C->S: binary audio packets...

S->C: html-speech/2.0 ... RECOGNITION-COMPLETE 8322 COMPLETE

C->S: binary audio packets...
C->S: binary audio packet in which the user stops talking
C->S: binary audio packets...
 
S->C: html-speech/2.0 ... END-OF-INPUT 8322 IN-PROGRESS (i.e. the recognizer has detected the user stopped talking)

C->S: binary audio packet: end of stream (i.e. since the recognizer has signaled end of input, the UA decides to terminate the stream)
         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |           request-id          |   reserved    |
        |1 1 0 0 0 0 0 0|1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 1|0 0 0 0 0 0 0 0|
        +---------------------------------------------------------------+

S->C: html-speech/1.0 ... 41021 200 COMPLETE (i.e. the service has received the end of stream)

Media Signaling for Synthesis

For example:


C->S: html-speech/1.0 ... SPEAK 3257
        Channel-Identifier:32AECB23433802@speechsynth
        Voice-gender:neutral
        Voice-Age:25
        Audio-codec:audio/flac
        Prosody-volume:medium
        Content-Type:application/ssml+xml
        Content-Length:...

        <?xml version="1.0"?>
        <speak version="1.0">
        ...

S->C: html-speech/1.0 ... 3257 200 IN-PROGRESS
        Channel-Identifier:32AECB23433802@speechsynth
        Speech-Marker:timestamp=0

S->C: binary audio packet #1 (request-id = 3257 = 110010111001)
         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |           request-id          |   reserved    |
        |1 0 0 0 0 0 0 0|1 0 0 1 1 1 0 1 0 0 1 1 0 0 0 0|0 0 0 0 0 0 0 0|
        +---------------+-------------------------------+---------------+
        |                       encoded audio data                      |
        |                              ...                              |
        |                              ...                              |
        |                              ...                              |
        +---------------------------------------------------------------+

S->C: binary audio packets...

S->C: html-speech/1.0 ... SPEECH-MARKER 3257 IN-PROGRESS
        Channel-Identifier:32AECB23433802@speechsynth
        Speech-Marker:timestamp=2059000;marker-1

S->C: binary audio packets...

S->C: html-speech/1.0 ... SPEAK-COMPLETE 3257 COMPLETE
        Channel-Identifier:32AECB23433802@speechsynth
        Completion-Cause:000 normal
        Speech-Marker:timestamp=5011000

S->C: binary audio packets...

S->C: binary audio packet: end of stream ( message type = 0x03 )
         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |           request-id          |   reserved    |
        |1 1 0 0 0 0 0 0|1 0 0 1 1 1 0 1 0 0 1 1 0 0 0 0|0 0 0 0 0 0 0 0|
        +---------------+-------------------------------+---------------+

Design Exclusions

People familiar with other speech protocols, such as MRCP, may be expecting a more of that suite of functionality to be incorporated into this protocol proposal. However, HTML speech scenarios are in some cases quite different from the IVR scenarios that influence MRCP's design. Furthermore, the architecture of the web, and the use of WebSockets, introduces different design options and constraints. This section contains various design options that were considered because of their association with MRCP, but were ultimately rejected as either unnecessary or out-of-scope.

Exclusion of RECORD

MRCP specifies an optional resource that can be used to perform audio recording. This is of particular use in telephony applications, where there are no alternatives, and where speech end-pointing is useful for ensuring that long periods of silence aren't recorded. However, given the other current and anticipated multimedia capabilities of a typical HTML user agent, the MRCP recorder resource appears to be of low value in HTML speech scenarios. The XG's conclusion is that it isn't a problem HTML Speech needs to solve. Applications that do want a speech-based recording function can use the recognizer resource with a grammar that recognizes any speech, for example by including <ruleref special="GARBAGE"> in the grammar, and retaining the audio.

Exclusion of Speaker Verification and other Non-Recognition Audio-in Scenarios

These scenarios can be handled using the RECOGNIZE method. DEFINE-GRAMMAR can be used to specify not just grammars but arbitrary models to be used to derive some kind of interpretation of the input. We are going to need to use DEFINE-GRAMMAR to specify both SRGS and SLMs already so it could also be used to point to arbitrary models that conduct other kinds of processing. Thinking through beyond EMMA to the JS result API, probably the result of this processing should show up in the 'interpretation' field.

Exclusion of User Enrollment

User enrollment is out of scope for the HTML Speech XG. It could be achieved through proprietary extensions to the 'recognition' resource.

Exclusion of DTMF

DTMF is a telephone dial pad feature, and is not an HTML or speech input method.

Exclusion of Queueing

Queueing is important when multiple synthesis transactions need to render to a single output stream in IVR. However, in HTML apps, there is no direct link between the synthesizer output and the UA's speaker. Multiple synthesis transactions can render at the same time (and often will), and will only be played back by the UA whenever the app requires.

Exclusion of SDP

SDP is used in the SIP session-negotiation prior to an MRCP session, in order to describe the MRCP session and associated audio transmission sessions, including IP address, ports and encoding formats. However, SDP is not used in the HTML Speech protocol, mainly because the problems it solves are already solved by existing mechanisms already in the HTML Speech protocol, or are outside the design parameters of HTML Speech. Firstly, the initial WebSockets handshake both describes and establishes the session. Secondly, media formats are fully described by MIME media types (see [MIME-RTP]). Thirdly, SDP is concerned with lots of details that have either already been determined by the time the WebSocket has been established (e.g. IP & port), or are of no relevance to HTML Speech (e.g. user roster).

Exclusion of RTP

MRCP relies on audio to be transported by RTP. Since HTML Speech audio is transmitted over a WebSockets connection, there is no need for an additional audio transport protocol. HTML Speech audio packets [REQUIREMENTS], only need session ID, and have no use for any of the header fields in RTP (optional padding, an optional header extension, the optional designation of multiple sources where the media stream was selected from one of those sources, and an optional marker bit, sequence number), which would add unneeded complexity for no additional benefit.

Future scenarios may imply a reason to use RTP out-of-band, and can extend the signaling protocol to support this. However, even then, they will need to perform RTP session establishment quite differently from the way it is done in MRCP, which essentially models a phone call. In HTML applications, speech audio streams will be transitory and potentially overlapping things.

Although RTP would work, it would result in added complexity for no apparent benefit. Our requirements are apparently much simpler than the problems RTP solves.

References

[MIME-RTP]
MIME Type Registration of RTP Payload Formats http://www.ietf.org/rfc/rfc3555.txt
[MRCPv2]
MRCP version 2 http://tools.ietf.org/html/draft-ietf-speechsc-mrcpv2-24
[REQUIREMENTS]
Protocol Requirements http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/att-0030/protocol-reqs-commented.html
[RFC3555]
MIME Type Registration of RTP Payload Formats http://www.ietf.org/rfc/rfc3555.txt
[RFC5646]
Tags for Identifying Languages http://tools.ietf.org/html/rfc5646
[WS-API]
Web Sockets API, http://www.w3.org/TR/websockets/
[WS-PROTOCOL]
Web Sockets Protocol http://tools.ietf.org/pdf/draft-ietf-hybi-thewebsocketprotocol-09.pdf