Author: Robert Brown, Microsoft
This isn't a formal proposal. Just shared as a framework for discussion. If we like this approach, we can flesh it out.
The basic approach is to use WebSockets [WS-PROTOCOL] as the transport for both audio and signaling, such that any interaction session with a service can be accomplished with a single WebSockets session.
The WebSockets session is established through the standard WebSockets HTTP handshake, with these specifics:
For example:
GET /speechservice123?customparam=foo&otherparam=bar HTTP/1.1
Host: examplespeechservice.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: OIUSDGY67SDGLjkSD&g2 (for example)
Sec-WebSocket-Version: 7
Sec-WebSocket-Protocol: html-speech
Audio is packetized and transmitted as a series of short messages, between which signalling messages may be sent.
message = audio-fragment | control-message
There is no strict constraint on the size and frequency of audio packet messages. Nor is there a requirement for all audio packets in a stream to encode the same duration of sound. Most implementations will seek to minimize user perceived latency by sending packet messages that encode between 20 and 80 milliseconds of sound. Since the overhead of a WebSockets frame is typically trivial (4 bytes), implementations should err on the side of sending smaller packets more frequently. While this principal is important for recognition it also applies to synthesis, so that interim events such as marks may be closely synchronized with the corresponding part of the audio rendering.
Audio messages may be expressed as either binary messages or text messages. Binary is the preferred format, in order to avoid the 33% transmission overhead of base-64 encoded text messages. However, some WebSockets implementations might not support binary messages, since these are not exposed in the HTML WebSockets API [WS-API], so text messages are specified as a better-than-nothing alternative.
The sequence of audio messages represents an in-order contiguous stream of audio data. Because the messages are sent in-order and audio packets cannot be lost (WebSockets uses TCP), there is no need for sequence numbering or timestamps. The sender just packetizes audio from the encoder and sends it, while the receiver just un-packs the messages and feeds them to the decoder. Timing is calculated by decoded offset from the beginning of the stream.
Audio messages are not aware of silence. Some codecs will efficiently encode silence in the encoded audio stream. There is also a control message (TBD) that can be used to indicate silence.
A synthesis service MAY send audio faster than real-time, and the client MUST be able to handle this. A client MAY send audio to a recognition service slower than real time. While this generally causes undesirable user interface latency, it may be necessary and unavoidable due to practical limitations of the network. Hence a recognition service MUST be prepared to receive slower-than-real-time audio.
Although most services will be strictly either recognition or synthesis services, some services may support both in the same session. While this is a more advanced scenario, the design does not introduce any constraints to prevent it. Indeed, both the client and server MAY send audio streams in the same session.
The vast majority of applications, at least in the beginning, will have a single audio channel. However, advanced implementations of HTML Speech may also incorporate multiple channels of audio in a single transmission. For example, living-room devices with microphone arrays may send separate streams in order to capture the speech of multiple individuals within the room. Or, for example, some devices may send parallel streams with alternative encodings that may not be human-consumable but contain information that is of particular value to a recognition service, or contain other non-audio media (video, gesture streams, etc). The protocol future-proofs for this scenario by incorporating a channel ID into each message, so that separate audio channels can be multiplexed onto the same session. Channel IDs are selected by the originator of the stream, and only need to be unique within the set of channels being transmitted by that originator.
audio-packet = binary-audio-packet | text-audio-packet
binary-audio-packet = binary-header binary-data ; sent as a binary message
binary-header = binary-message-type binary-channel-id binary-reserved
binary-message-type = OCTET ; Reserved DWORD to distinguish potential future binary message types.
; For now, set it to 0x01, since audio fragment is the only type.
binary-channel-id = OCTET ; Typically there's only one channel, so this will be set to 0x01.
binary-reserved = 2OCTET ; Just buffers out to the 32-bit boundary for now, but may be useful later.
binary-data = *OCTET ; The raw binary output of the
text-audio-packet = "AUDIO-FRAGMENT" SP version CRLF ; simpler than the signal message format below
"channel" = 1*DIGIT CRLF
CRLF
*base64
version = "html-speech/" 1*DIGIT "." 1*DIGIT ; html-speech/1.0
base64 = ALPHA | DIGIT | "+" | "/" | "=" ; obvious
There are three classes of control messages: requests, responses, and events. The specific set of control messages is TBD. Some examples of potential messages are included below. [MRCP2] should serve as a reasonable starting point upon which at least a subset of the necessary messages may be based.
The pattern presented here can be fleshed out to satisfy the functional needs of the API, as well as provide reasonable future-proofing for advanced applications (much like the WebSockets protocol is future-proofed beyond what's in the WebSockets API currently).
control-message = start-line ; i.e. use the typical MIME message format
*(header CRLF)
CRLF
[body]
start-line = request-line | response-line | event-line
header = <MIME header format> ; actual headers depend on the type of message
body = *OCTET ; depends on the type of message
Request messages are sent from the client to the server, usually to request an action or modify a setting. Each request has its own request-id, which is unique within a given WebSockets session. Any status or event messages related to a request use the same request-id.
request-line = method-name SP request-id SP version CRLF
method-name = reco-method | synth-method | audio-control-method
request-id = 1*DIGIT
reco-method = "SET-PARAMS" ; for example (these are cut/pasted from MRCP for illustrative purposes)
| "GET-PARAMS"
| "DEFINE-GRAMMAR"
| "RECOGNIZE"
| "GET-RESULT"
| "RECOGNITION-START-TIMERS"
| "STOP"
synth-method = "SET-PARAMS" ; for example (these are cut/pasted from MRCP for illustrative purposes)
| "GET-PARAMS"
| "SPEAK"
| "STOP"
| "PAUSE"
| "RESUME"
| "BARGE-IN-OCCURRED"
| "CONTROL"
audio-control-method = "START-AUDIO" ; indicates the codec, and perhaps other params, for an audio channel
| "JUMP-AUDIO" ; for example, to jump over a period of silence during SR, or shuttle control in TTS
Status messages are sent by the server, in response to requests from the client. They report the request state in the same way the MRCP does.
response-line = version SP request-id SP status-code SP request-state CRLF
status-code = <similar to MRCP>
request-state = "COMPLETE" ; i.e. same as MRCP
| "IN-PROGRESS"
| "PENDING"
Event messages are sent by the server, in a similar manner to MRCP event messages.
event-line = event-name SP request-id SP request-state SP version CRLF
event-name = reco-event | synth-event | audio-control-event
For both sending and receiving audio, the client chooses the media format (codec, sample rate, etc). The user agent either decides on its own, or takes into consideration a set of preferences provided by the application. An error will occur if the client happens to select a format that the service does not support. Similar situations can occur with the selection of unsuitable parameters for language/locale/gender/age, etc. This sort of error is completely avoidable in most cases because the client has prior knowledge of the service's capilities. The UA knows exactly how to use its default service, and it may even have prior knowledge of well-known services. When an application selects a particular service, it does so knowingly, and so the developer will make a deliberate codec selection if necessary.
However, in some cases specific service capabilities will not be known ahead of time. Rather than specify some sort of to-and-fro negotiation mechanism (e.g. SIP's INVITE-OK-ACK sequence). The client should just use a GET-PARAMS request to determine available codecs, etc, prior to sending audio or making SR or TTS requests.
Because there's no media stream negotiation, and media is sent in-band, the beginning of a media stream is indicated by a "START-AUDIO" method, which would contain the necessary info for the receiver to be able to decode the stream (the codec and its parameters). Depending on where we want to do timestamp offset calculations, this message may also include a base timestamp.