In addition to minor overall editing to aid readability, the following changes were incorporated in response to feedback on the third draft:
The HTML Speech protocol is defined as a sub-protocol of WebSockets [WS-PROTOCOL], and enables HTML user agents and applications to make interoperable use of network-based speech service providers, such that applications can use the service providers of their choice, regardless of the particular user agent the application is running in. The protocol bears some similarity to [MRCPv2]. However, since the use cases for HTML Speech applications are in some places considerably different from those around which MRCPv2 was designed, the HTML Speech protocol is not merely a transcript of MRCP, but shares some design concepts, while simplifying some details, and adding others. Similarly, because the HTML Speech protocol builds on WebSockets, its session negotiation and media transport needs are quite different from those of MRCP.
TODO: Add a sentence or two about the higher level motivation.
This document is an informal rough draft that collates proposals, agreements, and open issues on the design of the necessary underlying protocol for the HTML Speech XG, for the purposes of review and discussion within the XG.
Client
|-----------------------------|
| HTML Application | Server
|-----------------------------| |--------------------------|
| HTML Speech API | | Synthesizer | Recognizer |
|-----------------------------| |--------------------------|
| HTML-Speech Protocol Client |---html-speech/1.0 subprotocol---| HTML-Speech Server |
|-----------------------------| |--------------------------|
| WebSockets Client |-------WebSockets protocol-------| WebSockets Server |
|-----------------------------| |--------------------------|
A Recognizer performs speech recognition, with the following characteristics:
Because continuous recognition plays an important role in HTML Speech scenarios, a Recognizer is a resource that essentially acts as a filter on its input streams. Its grammars/language models can be specified and changed, as needed by the application, and the recognizer adapts its processing accordingly. Single-shot recognition (e.g. a user on a web search page presses a button and utters a single web-search query) is a special case of this general pattern, where the application specifies its model once, and is only interested in one match event, after which it stops sending audio (if it hasn't already).
"Recognizers" are not strictly required to perform speech recognition, and may perform additional or alternative functions, such as speaker verification, emotion detection, or audio recording.
A Synthesizer generates audio streams from textual input. It essentially provides a media stream with additional events, which the client buffers and plays back as required by the application. A Synthesizer service has the following characteristics:
Because a Synthesizer resource only renders a stream, and is not responsible for playback of that stream to a user, it does NOT:
TODO: There were some clarifying questions around this in the spec review. Robert Brown to expand on this, perhaps with an example.
TODO: add a section on security. Include authentication, encryption, transitive authorization to fetch resources.
The WebSockets session is established through the standard WebSockets HTTP handshake, with these specifics:
RB: I removed the ability to pass standard parameters in the query string. We didn't seem to have solid agreement on this in the calls where we reviewed the 3rd draft. Are we okay with this? If we want or need to support this, we'll need to specify a subset and provide examples.
TODO: Clarify that service parameters may be specified in the query string, but may be overridden using messages in the html-speech/1.0 websockets protocol once the websockets session has been established.
For example:
C->S: GET /speechservice123?customparam=foo&otherparam=bar HTTP/1.1
Host: examplespeechservice.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: OIUSDGY67SDGLjkSD&g2 (for example)
Sec-WebSocket-Version: 9
Sec-WebSocket-Protocol: html-speech/1.0, x-proprietary-speech
S->C: HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=
Sec-WebSocket-Protocol: html-speech/1.0
Once the WebSockets session is established, the UA can begin sending requests and media to the service, which can respond with events, responses or media.
A session may have a maximum of one synthesizer resource and one recognizer resource. If an application requires multiple resources of the same type (such as, for example, two synthesizers from different vendors), it must use separate WebSocket sessions.
NOTE: In MRCP, session negotiation also involves negotiating unique channel IDs (e.g. 128397521@recognizer) for the various resource types the client will need (recognizer, synthesizer, etc). In html-speech/1.0 this is unnecessary, since the WebSockets connection itself provides a unique shared context between the client and server, and resources are referred to directly by type, without the need for channel-IDs.
There is no association of state between sessions. If a service wishes to provide a special association between separate sessions, it may do so behind the scenes (for example, to re-use audio input from one session in another session without resending it, or to cause service-side barge-in of TTS in one session by recognition in another session, would be service-specific extensions).
TODO: clarify that advanced scenarios involving multiple engines of the same resource type, or using the same input audio stream for consumption by different types of vendor-specific resources, are out of scope. These may be implemented behind the scenes by the service.
The signaling design borrows its basic pattern from [MRCPv2], where there are three classes of control messages:
control-message = start-line ; i.e. use the typical MIME message format
*(header CRLF)
CRLF
[body]
start-line = request-line | status-line | event-line
header = <Standard MIME header format> ; actual headers depend on the type of message
body = *OCTET ; depends on the type of message
The interaction is full-duplex and asymmetrical: service activity is instigated by requests the UA, which may be multiple and overlapping, and where each request results in one or more messages from the service back to the UA.
For example:
C->S: html-speech/1.0 SPEAK 3257
Resource-ID:synthesizer
Audio-codec:audio/basic
Content-Type:text/plain
Hello world! I speak therefore I am.
S->C: html-speech/1.0 3257 200 IN-PROGRESS
S->C: media for 3257
C->S: html-speech/1.0 SPEAK 3258
Resource-ID:synthesizer
Audio-codec:audio/basic
Content-Type:text/plain
As for me, all I know is that I know nothing.
S->C: html-speech/1.0 3257 200 IN-PROGRESS
S->C: media for 3258
S->C: more media for 3257
S->C: html-speech/1.0 SPEAK-COMPLETE 3257 COMPLETE
S->C: more media for 3258
S->C: html-speech/1.0 SPEAK-COMPLETE 3258 COMPLETE
The service MAY choose to serialize its processing of certain requests (such as only rendering one SPEAK request at a time), but MUST still accept multiple active requests.
generic-header =
| accept
| accept-charset
| content-base
| logging-tag
| resource-id
| vendor-specific
| content-type
| content-encoding
resource-id = "Resource-ID:" ("recognizer" | "synthesizer" | vendor-resource)
vendor-resource = "x-" 1*UTFCHAR
accept = <same as [MRCPv2]>
accept-charset = <same as [MRCPv2]>
content-base = <same as [MRCPv2]>
content-type = <same as [MRCPv2]>
content-encoding = <same as [MRCPv2]>
logging-tag = <same as [MRCPv2]>
vendor-specific = <same as [MRCPv2]>
NOTE: This is mostly a strict subset of the [MRCPv2] generic headers, many of which have been omitted as either unnecessary or inappropriate for HTML speech client/server scenarios.
Request messages are sent from the client to the server, usually to request an action or modify a setting. Each request has its own request-id, which is unique within a given WebSockets html-speech session. Any status or event messages related to a request use the same request-id. All request-ids MUST have an integer value between 0 and 1010-1 (i.e. 1-10 decimal digits).
TODO: revise this number. 2^16-1 seems small. It may have security implications. Perhaps 2^24-1?
request-line = version SP method-name SP request-id SP CRLF
version = "html-speech/" 1*DIGIT "." 1*DIGIT ; html-speech/1.0
method-name = general-method | synth-method | reco-method | proprietary-method
request-id = 1*10DIGIT
NOTE: In MRCP, all messages include their message length, so that they can be framed in what is otherwise an open stream of data. In html-speech/1.0, framing is already provided by WebSockets, and message length is not needed, and therefore not included.
For example, to request the recognizer to interpret text as if it were spoken:
C->S: html-speech/1.0 INTERPRET 8322
Resource-Identifier: recognizer
Active-Grammars: <http://myserver/mygrammar.grxml>
Interpret-Text: Send a dozen yellow roses and some expensive chocolates to my mother
Status messages are sent by the server, to indicate the state of a request.
status-line = version SP request-id SP status-code SP request-state CRLF
status-code = 3DIGIT ; Specific codes TBD, but probably similar to those used in MRCP
; All communication from the server is labeled with a request state.
request-state = "COMPLETE" ; Processing of the request has completed.
| "IN-PROGRESS" ; The request is being fulfilled.
| "PENDING" ; Processing of the request has not begun.
Specific status code values would follow the general pattern used in [MRCPv2]:
TODO: Determine status code values.
Event messages are sent by the server, to indicate specific data, such as synthesis marks, speech detection, and recognition results. They are essentially specialized status messages.
event-line = version SP event-name SP request-id SP request-state CRLF
event-name = synth-event | reco-event | proprietary-event
For example, an event indicating that the recognizer has detected the start of speech:
S->C: html-speech/2.0 START-OF-SPEECH 8322 IN-PROGRESS
Resource-ID: recognizer
Source-time: 12753439912 (when speech was detected)
HTML Speech applications feature a wide variety of media transmission scenarios. The number of media streams at any given time is not fixed. A recognizer may accept one or more input streams, which may start and end at any time as microphones or other input devices are activated/deactivated by the application or the user. Recognizers do not require their data in real-time, and will generally prefer to wait for delayed packets in order to maintain accuracy, whereas a human listener would rather just tolerate the clicks and pops of missing packets so they can continue listening in real time. Applications may, and often will, request the synthesis of multiple SSML documents at the same time, which are buffered by the UA for playback at the application's discretion. The synthesizer needs to return rendered data to the client rapidly (generally faster than real time), and MAY render multiple requests in parallel if it has the capacity to do so.
Advanced implementations of HTML Speech may incorporate multiple channels of audio in a single transmission. For example, living-room devices with microphone arrays may send separate streams in order to capture the speech of multiple individuals within the room. Or, for example, some devices may send parallel streams with alternative encodings that may not be human-consumable (like standard codecs) but contain information that is of particular value to a recognition service.
In html-speech/1.0, audio (or other media) is packetized and transmitted as a series of WebSockets binary messages, on the same WebSockets session used for the control messages.
media-packet = binary-message-type
binary-stream-id
binary-data
binary-message-type = OCTET ; Values > 0x03 are reserved. 0x00 is undefined.
binary-stream-id = 3OCTET ; Unique identifier for the stream 0..224-1
binary-data = *OCTET
TODO: reduce message-type to a few bits, and expand request-id.
The binary-stream-id field is used to identify the messages for a particular stream. It is a 24-bit unsigned integer. Its value for any given stream is assigned by the sender (client or server) in the first message of that stream, and must be unique to the sender within the WebSockets session.
The binary-message-type field has these defined values:
TODO: note that at least muLaw/aLaw/PCM must be supported.
message type = 0x01; stream-id = 112233; media-type = audio/amr-wb
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| message type | stream-id |
|1 0 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
+---------------+-----------------------------------------------+
|1 0 0 0 1 0 1 1 1 1 0 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0| } NTP Timestamp
|0 0 0 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0 1 1 0 0 0 0 1 0 1 0 1 0 1| }
| 61 75 64 69 | a u d i
| 6F 2F 61 6D | o / a m
| 72 2D 77 62 | r - w b
+---------------------------------------------------------------+
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| message type | stream-id |
|0 1 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
+---------------+-----------------------------------------------+
| encoded audio data |
| ... |
| ... |
| ... |
+---------------------------------------------------------------+
TODO: Change "Audio" to "Media". Clarify that its encoding is specified by a mime content type in the request that initiated the stream (SPEAK or START-MEDIA-STREAM), and while it will usually be some form of audio encoding, it MAY be any content type, including text, pen strokes, touches/clicks, compass bearings, etc.
TODO: DELETE the Skip message type.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| message type | stream-id |
|1 1 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
+---------------+-----------------------------------------------+
TODO: add a paragraph about the interleaving of audio and signaling.
A sequence of media messages with the same stream-ID represents an in-order contiguous stream of data. Because the messages are sent in-order and audio packets cannot be lost (WebSockets uses TCP), there is no need for sequence numbering or timestamps. The sender just packetizes audio from the encoder and sends it, while the receiver just un-packs the messages and feeds them to the consumer (e.g. the recognizer's decoder, or the TTS playback buffer). Timing of coordinated events is calculated by decoded offset from the beginning of the stream.
Media streams are multiplexed with signaling messages. Multiple media streams can be also be multiplexed on the same socket. The WebSockets stack de-multiplexes text and binary messages, thus separating signaling from media, while the stream-ID on each media message is used to de-multiplex the messages into separate media streams.
There is no strict constraint on the size and frequency of audio messages. Nor is there a requirement for all audio packets to encode the same duration of sound. However, implementations SHOULD seek to minimize interference with the flow of other messages on the same socket, by sending messages that encode between 20 and 80 milliseconds of media. Since a WebSockets frame header is typically only 4 bytes, overhead is minimal and implementations SHOULD err on the side of sending smaller packets more frequently.
A synthesis service MAY (and typically will) send audio faster than real-time, and the client MUST be able to handle this.
A recognition service MUST be prepared to receive slower-than-real-time audio due to practical throughput limitations of the network.
Although most services will be strictly either recognition or synthesis services, some services may support both in the same session. While this is a more advanced scenario, the design does not introduce any constraints to prevent it. Indeed, both the client and server MAY send audio streams in the same session.
TODO: Write the rationale for why we mix media and signal in the same session. [Michael Johnston]
TODO: There is an open issue to do with transitive access control. The client sends a URI to the service, which the client can access, but the service cannot, because it is not authorized to do so. How does the client grant access to the resource to the service? There are two design contenders. The first is to use the cookie technique that MRCP uses. The second is to use a virtual tag, which we discussed briefly at the F2F - Michael Bodell owes a write-up. In the absence of that write-up, perhaps the default position should be to use cookies.
TODO: Specify which headers are sticky. URI request parameters aren't standardized.
The GET-PARAMS and SET-PARAMS requests are the same as their [MRCPv2] counterparts. They are used to discover and set the configuration parameters of a resource (recognizer or synthesiser). Like all messages, they must always include the Resource-ID header. SET/GET-PARAMS work with global parameter settings. Individual requests may set different values that apply only to that request.
general-method = "SET-PARAMS"
| "GET-PARAMS"
header = capability-query-header
| interim-event-header
| reco-header
| synth-header
capability-query-header =
"Supported-Content:" mime-type *("," mime-type)
| "Supported-Languages:" lang-tag *("," lang-tag) ; See [RFC5646]
| "Builtin-Grammars:" "<" URI ">" *("," "<" URI ">")
interim-event-header =
"Interim-Events:" event-name *("," event-name)
event-name = 1*UTFCHAR
Additional headers are introduced in html-speech/1.0 to provide a way for the application/UA to determine whether a resource supports the basic capabilities it needs. In most cases applications will know service's resource capabilities ahead of time. However, some applications may be more adaptable, or may wish to double-check at runtime. To determine resource capabilities, the UA sends a GET-PARAMS request to the resource, containing a set of capabilities, to which the resource responds with the specific subset it actually supports.
TODO: how do we check for other configuration settings? e.g. what grammars are available? e.g. supported grammar format (srgs-xml vs srgs-ebnf vs some-slm-format).
ISSUE: this could become unwieldy as more parameters are added. Is there a more generic approach?
For example:
C->S: html-speech/1.0 GET-PARAMS 34132
resource-id: recognizer
supported-content: audio/basic, audio/amr-wb,
audio/x-wav;channels=2;formattag=pcm;samplespersec=44100,
audio/dsr-es202212; rate:8000; maxptime:40,
application/x-ngram+xml
supported-languages: en-AU, en-GB, en-US, en (A variety of English dialects are desired)
builtin-grammars: <builtin:dictation?topic=websearch>,
<builtin:dictation?topic=message>,
<builtin:ordinals>,
<builtin:datetime>,
<builtin:cities?locale=USA>
S->C: html-speech/1.0 34132 200 COMPLETE
resource-id: recognizer
supported-content: audio/basic, audio/dsr-es202212; rate:8000; maxptime:40
supported-languages: en-GB, en (The recognizer supports UK English, but will work with any English)
builtin-grammars: <builtin:dictation?topic=websearch>, <builtin:dictation?topic=message>
C->S: html-speech/1.0 GET-PARAMS 48223
resource-id: synthesizer
supported-content: audio/ogg, audio/flac, audio/basic
supported-languages: en-AU, en-GB
S->C: html-speech/1.0 48223 200 COMPLETE
resource-id: synthesizer
supported-content: audio/flac, audio/basic
supported-languages: en-GB
Speech services may care to send optional vendor-specific interim events during the processing of a request. For example: some recognizers are capable of providing additional information as they process input audio; and some synthesizers are capable of firing progress events on word, phoneme, and viseme boundaries. These are exposed through the HTML Speech API as events that the webapp can listen for if it knows to do so. A service vendor MAY require a vendor-specific value to be set with SET-PARAMS before a it starts to fire certain events.
interim-event = version request-ID SP INTERIM-EVENT CRLF
*(header CRLF)
CRLF
[body]
event-name-header = "Event-Name:" event-name
The Event-Name header is required and must contain a value that was previously subscribed to with the Interim-Events header.
The Request-ID and Content-Type headers are required, and any data conveyed by the event must be contained in the body.
Applications will generally want to select resources with certain capabilities, such as the ability to recognize certain languages, work well in specific acoustic conditions, work well with specific genders or ages, speak particular languages, speak with a particular style, age or gender, etc.
There are three ways in which resource selection can be achieved, each of which has relevance:
Any service may enable applications to encode resource requirements as query string parameters in the URI, or by using specific URIs with known resources. The specific URI format and parameter scheme is by necessity not standardized and is defined by the implementer based on their architecture and service offerings. For example:
ws://example1.net:2233/webreco/de-de/cell-phone
ws://example2.net/?reco-lang=en-UK&reco-acoustic=10-foot-open-room&sample-rate=16kHz&channels=2
ws://example3.net/speech?reco-lang=es-es&tts-lang=es-es&tts-gender=female&tts-vendor=acmesynth
ws://example4.com/profile=af3e-239e-9a01-66c0
Request headers may also be used to select specific resource capabilities. Synthesizer parameters are set through SET-PARAMS or SPEAK; whereas recognition paramaters are set either through SET-PARAMS or LISTEN. There is a small set of standard headers that can be used with each resource: the Speech-Language header may be used with both the recognizer and synthesizer, and the synthesizer may also accept a variety of voice selection parameters as headers. However, a resource does not need to support these headers where it does not have the ability to so. If a particular header value is unsupported, the request should fail with a status of 407 "Unsupported Header Field Value". For example:
C->S: html-speech/1.0 LISTEN 8322
Resource-ID: Recognizer
Speech-Language: fr-CA
C->S: html-speech/1.0 SET-PARAMS 8323
Resource-ID: Recognizer
Speech-Language: pt-BR
C->S: html-speech/1.0 SPEAK 8324
Resource-ID: Synthesizer
Speech-Language: ko-KR
Voice-Age: 35
Voice-Gender: female
C->S: html-speech/1.0 SET-PARAMS 8325
Resource-ID: Synthesizer
Speech-Language: sv-SE
Voice-Name: Kiana
The [SRGS] and [SSML] input documents for the recognizer and synthesizer will specify the language for the overall document, and MAY specify languages for specific subsections of the document. The resource consuming these documents SHOULD honor these language assignments when they occur. If a resource is unable to do so, it should error with a 4xx status "Unsupported content language". (It should be noted that at the time of writing, most currently available recognizer and synthesizer implementations will be unable to suppor this capability.)
Generally speaking, unless a service is unusually adaptable, applications are better off using specific URLs that encode the abilities they need, so that the appropriate resources can be allocated during session initiation.
A recognizer resource is either in the "listening" state, or the "idle" state. Because continuous recognition scenarios often don't have dialog turns or other down-time, all functions are performed in series on the same input stream(s). The key distinction between the idle and listening states is the obvious one: when listening, the recognizer processes incoming media and produces results; whereas when idle, the recognizer SHOULD buffer audio but will not process it. For example: text dictation applications commonly have a variety of command grammars that are activated and deactivated to enable editing and correction modes; in open-microphone multimodal applications, the application will listen continuously, but change the set of active grammars based on the user's other non-speech interactions with the app. Grammars can be loaded, and rules activated or deactivated, while the recognizer is idle (but not while it is listening).
TODO: some turn based recognizers can't change state in a recognition. What happens then? Answer: state should only change while idle.
TODO: what happens to timers if we enter the listening state without having an input stream? Timers should be based on the start of the input stream. Answer: can't listen when there's no input.
Recognition is accomplished with a set of messages and events, to a certain extent inspired by those in [MRCPv2].
Idle State Listening State
| |
|--\ |
| DEFINE-GRAMMAR |
|<-/ |
| |
|--\ |
| SET-GRAMMARS |
|<-/ |
| |
|--\ |--\
| GET-GRAMMARS | GET-GRAMMARS
|<-/ |<-/
| |
|--\ |
| INFO |
|<-/ |
| |
|---------LISTEN------------>|
| |
| |--\
| | INTERIM-EVENT
| |<-/
| |
| |--\
| | START-OF-SPEECH
| |<-/
| |
| |--\
| | START-INPUT-TIMERS
| |<-/
| |
| |--\
| | END-OF-SPEECH
| |<-/
| |
| |--\
| | INFO
| |<-/
| |
| |--\
| | INTERMEDIATE-RESULT
| |<-/
| |
| |--\
| | RECOGNITION-COMPLETE
| | (when mode = recognize-continuous)
| |<-/
| |
|<---RECOGNITION-COMPLETE----|
|(when mode = recognize-once)|
| |
| |
|<--no media streams remain--|
| |
| |
|<----------STOP-------------|
| |
| |
|<---some 4xx/5xx errors-----|
| |
|--\ |--\
| INTERPRET | INTERPRET
|<-/ |<-/
| |
|--\ |--\
| INTERPRETATION-COMPLETE | INTERPRETATION-COMPLETE
|<-/ |<-/
| |
TODO: add a C->S message for sending metadata to the recognizer.
reco-method = "LISTEN" ; Transitions Idle -> Listening
| "START-INPUT-TIMERS" ; Starts the timer for the various input timeout conditions
| "STOP" ; Transitions Listening -> Idle
| "DEFINE-GRAMMAR" ; Pre-loads & compiles a grammar, assigns a temporary URI for reference in other methods
| "SET-GRAMMARS" ; Activates and deactivates grammars and rules
| "GET-GRAMMARS" ; Returns the current grammar and rule state
| "CLEAR-GRAMMARS" ; Unloads all grammars, whether active or inactive
| "INTERPRET" ; Interprets input text as though it was spoken
| "INFO" ; Sends metadata to the recognizer
The LISTEN method transitions the recognizer from the idle state to the listening state. The recognizer then processes the media input streams against the set of active grammars. The request MUST include the Source-Time header, which is used by the Recognizer to determine the point in the input stream(s) that the recognizer should start processing from (which won't necessarily be the start of the stream). The request MUST also include the Listen-Mode header to indicate whether the recognizer should perform continuous recognition, a single recognition, or vendor-specific processing.
A LISTEN request MAY also activiate or deactivate grammars and rules using the Active-Grammars and Inactive-Grammars headers. These grammars/rules are considered to be activated/deactivated from the point specified in the Source-Time header.
NOTE: LISTEN does NOT use the same grammar specification technique as the MRCP RECOGNIZE method. In html-speech/1.0 this would add unnecessary and redundant complexity, since all the necessary functionality is already present in other html-speech/1.0 methods.
When there are no input media streams, and the Input-Waveform-URI header has not been specified, the recognizer cannot enter the listening state, and the listen request will fail (4xx). When in the listening state, and all input streams have ended, the recognizer automatically transitions to the idle state, and issues a RECOGNITION-COMPLETE event, with Completion-Cause set to 4xx (tbd).
TODO: should be "when" not "if".
TODO: Specify Completion-Cause value for no input stream.
A LISTEN request that is made while the recognizer is already listening results in a 402 error ("Method not valid in this state", since it is already listening).
This is identical to the [MRCPv2] method with the same name. It is useful, for example, when the application wants to enable voice barge-in during a prompt, but doesn't want to start the time-out clock until after the prompt has completed.
TODO: collapse START-MEDIA-STREAM.
The STOP method transitions the recognizer from the listening state to the idle state. No RECOGNITION-COMPLETE event is sent. The Source-Time header MUST be used, since the recognizer may still fire a RECOGNITION-COMPLETE event for any completion state it encounters prior to that time in the input stream.
A STOP request that is sent while the the recognizer is idle results in a 402 response (method not valid in this state, since there is nothing to stop).
The DEFINE-GRAMMAR method is similar to its namesake in [MRCPv2]. DEFINE-GRAMMAR does not activate a grammar, it simply causes the recognizer to pre-load and compile it, and associates it with a temporary URI that can then be used to activate or deactivate the grammar or one of its rules. DEFINE-GRAMMAR is not required in order to use a grammar, since the recognizer can load grammars on demand as needed. However, it is useful when an application wants to ensure a large grammar is pre-loaded and ready for use prior to the recognizer entering the listening state. DEFINE-GRAMMAR can be used when the recognizer is in either the listening or idle state.
All recognizer services MUST support grammars in the SRGS XML format, and MAY support additional alternative grammar/language-model formats.
The SET-GRAMMARS method is used to activate and deactivate grammars and rules, using the Active-Grammars and Inactive-Grammars headers. The Source-Time header MUST be used, and activations/deactivations are considered to take place at precisely that time in the input stream(s).
SET-GRAMMARS may only be requested when the recognizer is in the idle state. It will fail (4xx) if requested in the listening state.
The recognizer MUST support grammars in the [SRGS] XML format, and may support grammars (or other forms of language model) in other formats.
ISSUE: Do we need an explicit method for this, or is SET-PARAMS enough? One option is to not allow them on set/get-params. Another is to say that if get/set-params does exactly the same thing, then there's no need for this method. If there's a default set of active grammars, then get-params might be required. Get-params may also be useful for defensive programming. Inline grammars don't have URIs. Suggestion is to add a GET-GRAMMARS, and disallow get/set-params.
The GET-GRAMMARS method is used to query the set of active grammars and rules. The recognizer should respond with a 200 COMPLETE status message, containing an Active-Grammars header that lists all of the currently active grammars and rules.
TODO: Should GET-GRAMMARS also return the list of inactive grammars/rules? It's not clear how that would be useful. Also, the list of inactive rules could be rather long and unwieldy.
In continuous recognition, a variety of grammars may be loaded over time, potentially resulting in unused grammars consuming memory resources in the recognizer. The CLEAR-GRAMMARS method unloads all grammars, whether active or inactive. Any URIs previously defined in DefineGrammar become invalid.
The INTERPRET method is similar to its namesake in [MRCPv2], and processes the input text according to the set of grammar rules that are active at the time it is received by the recognizer. It MUST include the Interpret-Text header. The use of INTERPRET is orthogonal to any audio processing the recognizer may be doing, and will not affect any audio processing. The recognizer can be in either the listening or idle state.
In multimodal applications, some recognizers will benefit from additional context. Clients can use the INFO request to send this context. The Content-Type header should specify the type of data, and the data itself is contained in the message body.
TODO: Note somewhere that vendors are free to support other language model file formats beyond SRGS.
Recognition events are associated with 'IN-PROGRESS' request-state notifications from the 'recognizer' resource.
reco-event = "START-OF-SPEECH" ; Start of speech has been detected
| "END-OF-SPEECH" ; End of speech has been detected
| "INTERIM-EVENT" ; See Interim Events above
| "INTERMEDIATE-RESULT" ; A partial hypothesis
| "RECOGNITION-COMPLETE" ; Similar to MRCP2 except that application/emma+xml (EMMA) will be the default Content-Type.
| "INTERPRETATION-COMPLETE"
TODO: change to START/END-OF-SPEECH
END-OF-INPUT is the logical counterpart to START-OF-INPUT, and indicates that speech has ended. The event MUST include the Source-Time header, which corresponds to the point in the input stream where the recognizer estimates speech to have ended, NOT when the endpointer finally decided that speech ended (which will be a number of milliseconds later).
See Interim Events above.
Continuous speech (aka dictation) often requires feedback about what has been recognized thus far. Waiting for a RECOGNITION-COMPLETE event prevents this sort of user interface. INTERMEDIATE-RESULT provides this intermediate feedback. As with RECOGNITION-COMPLETE, contents are assumed to be EMMA unless an alternate Content-Type is provided.
This event is identical to the [MRCPv2] event with the same name.
This method is similar to the [MRCPv2] method with the same name, except that application/emma+xml (EMMA) is the default Content-Type. The Source-Time header must be included, to indicate the point in the input stream when the event occured. When the Listen-Mode is reco-once, the recognizer will transition from the listening state to the idle state when this message is fired, and the Recognizer-State header in the event is set to "idle".
TODO: Describe how final results can be replaced in continuous recognition.
TODO: no match is returned, is the EMMA no-match document required?
TODO: Insert some EMMA document examples.
Indicates that start of speech has been detected. The Source-Time header MUST correspond to the point in the input stream(s) where speech was estimated to begin, NOT when the endpointer finally decided that speech began (a number of milliseconds later).
The list of valid headers for the recognizer resource include a subset of the [MRCPv2] Recognizer Header Fields, where they make sense for HTML Speech requirements, as well as a handful of headers that are required for HTML Speech.
reco-header = ; Headers borrowed from MRCP
Confidence-Threshold
| Sensitivity-Level
| Speed-Vs-Accuracy
| N-Best-List-Length
| No-Input-Timeout
| Recognition-Timeout
| Waveform-URI
| Media-Type
| Input-Waveform-URI
| Completion-Cause
| Completion-Reason
| Recognizer-Context-Block
| Start-Input-Timers
| Speech-Complete-Timeout
| Speech-Incomplete-Timeout
| Failed-URI
| Failed-URI-Cause
| Save-Waveform
| Speech-Language
| Hotword-Min-Duration
| Hotword-Max-Duration
| Interpret-Text
; Headers added for html-speech/1.0
| audio-codec ; The audio codec used in an input media stream
| active-grammars ; Specifies a grammar or specific rule to activate.
| inactive-grammars ; Specifies a grammar or specific rule to deactivate.
| hotword ; Whether to listen in "hotword" mode (i.e. ignore out-of-grammar speech)
| listen-mode ; Whether to do continuous or one-shot recognition
| partial ; Whether to send partial results
| partial-interval ; Suggested interval between partial results, in milliseconds.
| recognizer-state ; Indicates whether the recognizer is listening or idle
| source-time ; The UA local time at the request was initiated
| user-id ; Unique identifier for the user, so that adaptation can be used to improve accuracy.
| Wave-Start-Time ; The start point of a recognition in the audio referred to by Waveform-URI.
| Wave-End-Time ; The end point of a recognition in the audio referred to by Waveform-URI.
hotword = "Hotword:" BOOLEAN
listen-mode = "Listen-Mode:" ("reco-once" | "reco-continuous" | vendor-listen-mode)
vendor-listen-mode = "x-" 1*UTFCHAR
recognizer-state = "Recognizer-State:" ("listening" | "idle")
source-time = "Source-Time:" 1*20DIGIT
audio-codec = "Audio-Codec:" mime-media-type ; see [RFC3555]
partial = "Partial:" BOOLEAN
partial-interval = "Partial-Interval:" 1*5DIGIT
active-grammars = "Grammar-Activate:" "<" URI ["#" rule-name] [SP weight] ">" *("," "<" URI ["#" rule-name] [SP weight] ">")
rule-name = 1*UTFCHAR
weight = "0." 1*3DIGIT
inactive-grammars = "Grammar-Deactivate:" "<" URI ["#" rule-name] ">" *("," "<" URI ["#" rule-name] ">")
user-id = "User-ID:" 1*UTFCHAR
wave-start-time = "Wave-Start-Time:" 1*DIGIT ["." 1*DIGIT]
wave-end-time = "Wave-End-Time:" 1*DIGIT ["." 1*DIGIT]
TODO: discuss how recognition from file would work.
Headers with the same names as their [MRCPv2] counterparts are considered to have the same specification. Other headers are describe as follows:
The Audio-Codec header is used in the START-MEDIA-STREAM request, to specify the codec and parameters used to encode the input stream, using the MIME media type encoding scheme specified in [RFC3555].
The Active-Grammars header specifies a list of grammars, and optionally specific rules within those grammars. The header is used in SET-GRAMMARS or LISTEN to activate grammars/rules, and in GET-GRAMMARS to list the active grammars/rules. If no rule is specified for a grammar, the root rule is activated. This header may also specify the weight of the rule.
This header cannot be used in GET/SET-PARAMS
ISSUE: Grammar-Activate/Deactivate probably don't make sense in GET/SET-PARAMS. Is this an issue? Perhaps this would be better achieved in the message body? The same format could be used.
The Inactive-Grammars header specifies a list of grammars, and optionally specific rules within those grammars, to be deactivated. If no rule is specified, all rules in the grammar are deactivated, including the root rule. The Grammar-Deactivate header MAY be used in both the SET-GRAMMAR and LISTEN methods.
This header cannot be used in GET/SET-PARAMS
The Hotword header is analogous to the [MRCPv2] Recognition-Mode header, however it has a different name and boolean type in html-speech/1.0 in order to avoid confusion with the Listen-Mode header. When true, the recognizer functions in "hotword" mode, which essentially means that out-of-grammar speech is ignored.
Listen-Mode is used in the LISTEN request to specify whether the recognizer should listen continuously, or return to the idle state after the first RECOGNITION-COMPLETE event. It MUST NOT be used in any other type of request other than LISTEN. When the recognizer is in the listening state, it should include Listen-Mode in all event and status messages it sends.
This header is required to support the continuous speech scenario on the recognizer resource. When sent by the client in a LISTEN or SET-PARAMS request, this header controls whether or not the client is interested in partial results from the service. In this context, the term 'partial' is meant to describe mid-utterance results that provide a best guess at the user's speech thus far (e.g. "deer", "dear father", "dear father christmas"). These results should contain all recognized speech from the point of the last non-partial (ie complete) result, but it may be common for them to omit fully-qualified result attributes like an NBest list, timings, etc. The only guarantee is that the content must be EMMA. Note that this header is valid on both regular command-and-control recognition requests as well as dictation sessions. This is because at the API level, there is no syntactic difference between the recognition types. They are both simply recognition requests over an SRGS grammar or set of URL(s). Additionally partial results can be useful in command-and-control scenarios, for example: open-microphone applications, dictation enrollment applications, and lip-sync. When sent by the server, this header indicates whether the message contents represent a full or partial result. It's valid for a server to send this header on both INTERMEDIATE-RESULT, RECOGNITION-COMPLETE, and in response to a GET-PARAMS messages.
A suggestion from the client to the service on the frequency at which partial results should be sent. It is an integer value represents desired interval expressed in milliseconds. The recognizer does not need to precisely honor the requested interval, but SHOULD provide something close, if it is within the operating parameters of the implementation.
Indicates whether the recognizer is listening or idle. This MUST NOT be included by the client in any requests, and MUST be included by the recognizer in all status and event messages it sends.
Indicates the timestamp of a message using the client's local time. All requests sent from the client to the recognizer MUST include the Source-Time header, which must faithfully specify the client's local system time at the moment it sends the request. This enables the recognizer to correctly synchronize requests with the precise point in the input stream at which they were actually sent by the client. All event messages sent by the recognizer MUST include the Source-Time, calculated by the recognizer service based on the point in the input stream at which the event occurred, and expressed in the client's local clock time (since the recognizer knows what this was at the start of the input stream). By expressing all times in client-time, the user agent or application is able to correctly sequence events, and implement timing-sensitive scenarios, that involve other objects outside the knowledge of the recognizer service (for example, media playback objects or videogame states).
TODO: What notation should be used? The Media Fragments Draft, "Temporal Dimensions" section has some potentially viable formats, such as the "wall clock" Zulu-time format.
Recognition results are often more accurate if the recognizer can train itself to the user's speech over time. This is especially the case with dictation as vocabularies are so large. A User-ID field would allow the recognizer to establish the user's identify if the webapp decided to supply this information.
ISSUE: There were some additional headers proposed few weeks ago: Return-Punctuation; Gender-Number-Pronoun; Return-Formatting; Filter-Profanity. There was pushback that these shouldn't be in a standard because they're very vendor-specific. Thus I haven't included them in this draft. Do we agree these are appropriate to omit? NOTE: we decided to omit them.
Some applications will wish to re-recognize an utterance using different grammars. For example, an application may accept a broad range of input, and use the first round of recognition simply to classify an utterance so that it can use a more focused grammar on the second round. Others will wish to record an utterance for future use. For example, an application transcribes an utterance to text may store a recording so that untranscribed information is not lost (tone, emotion, etc). While these are not mainstream scenarios, they are both valid and inevitable, and may be achieved using the headers provided for recognition.
If the Save-Waveform header is set to true (with SET-PARAMS or LISTEN), then the recognizer will save the input audio. Consequent RECOGNITION-COMPLETE events sent by the recognizer will contain a URI in the Waveform-URI header which refers to the stored audio. In the case of continuous recognition, the Waveform-URI header refers to all of the audio captured so far. The application may fetch the audio from this URI, assuming it has appropriate credentials (the credential policy is determined by the service provider). The application may also use the URI as input to future LISTEN requests by passing the URI in the Input-Waveform-URI header.
When RECOGNITION-COMPLETE returns a Waveform-URI header, it also returns the time interval within the recorded waveform that the recognition result applies to, in the Wave-Start-Time and Wave-End-Time headers, which indicate the offsets in seconds from the start of the waveform. A client MAY also use the SourceTime header of other events such as START-OF-SPEECH and END-OF-SPEECH to calculate other intervals of interest. When using the Input-Wavefor-URI header, the client may suffix the URI with an "interval" parameter to indicate that the recognizer should only decode that particular interval of the audio:
interval = "interval=" start "," end
start = seconds | "start"
end = seconds | "end"
seconds = 1*DIGIT ["." 1*DIGIT]
For example:
http://example.com/retainedaudio/fe429ac870a?interval=0.3,2.86
http://example.com/temp44235.wav?interval=0.65,end
TODO: does the Waveform-URI return a URI for each input stream, or are all input streams magically encoded into a single stream?
TODO: does the Input-Waveform-URI cause any existing input streams to be ignored?
Speech services MAY support pre-defined grammars that can be referenced through a 'builtin:' uri. For example <builtin:dictation?context=email&lang=en_US>, <builtin:date>, or <builtin:search?context=web>. These can be used as top-level grammars in the Grammar-Activate/Deactivate headers, or in rule references within other grammars. If a speech service does not support the referenced builtin or if it does not specify the builtin in combination with other active grammars, it should return a grammar compilation error.
The specific set of predefined grammars is to be defined later. However, there MUST be a certain small set of predefined grammars that a user agent's default speech recognizer MUST support. For non-default recognizers, support for predefined grammars is optional, and the set that is supported is also defined by the service provider (and may include proprietary grammars, e.g. builtin:x-acme-parts-catalog).
TODO: perhaps the specific set of grammars should be a MUST for the default built-in user agent, for a certain small set of grammars, but MAY for 3rd-party services. Can't solve this now - but note as an issue to be solved later.
TODO: Write some examples of one-shot and continuous recognition, EMMA documents, partial results, vendor-extensions, grammar/rule activation/deactivation, etc.
C->S: binary message: start of stream (stream-id = 112233)
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| message type | stream-id |
|1 0 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
+---------------+-----------------------------------------------+
|1 0 0 0 1 0 1 1 1 1 0 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0| } NTP Timestamp
|0 0 0 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0 1 1 0 0 0 0 1 0 1 0 1 0 1| }
| 61 75 64 69 | a u d i
| 6F 2F 61 6D | o / a m
| 72 2D 77 62 | r - w b
+---------------------------------------------------------------+
C->S: binary message: media packet (stream-id = 112233)
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| message type | stream-id |
|0 1 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
+---------------+-----------------------------------------------+
| encoded audio data |
| ... |
| ... |
| ... |
+---------------------------------------------------------------+
C->S: more binary media packets...
C->S: html-speech/1.0 LISTEN 8322
Resource-Identifier: recognizer
Confidence-Threshold:0.9
Active-Grammars: <built-in:dictation?context=message>
Listen-Mode: reco-once
Source-time: 12753432234 (where in the input stream recognition should start)
S->C: html-speech/2.0 START-OF-SPEECH 8322 IN-PROGRESS
C->S: more binary media packets...
C->S: binary audio packets...
C->S: binary audio packet in which the user stops talking
C->S: binary audio packets...
S->C: html-speech/1.0 END-OF-SPEECH 8322 IN-PROGRESS (i.e. the recognizer has detected the user stopped talking)
C->S: binary audio packet: end of stream (i.e. since the recognizer has signaled end of input, the UA decides to terminate the stream)
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| message type | stream-id |
|1 1 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
+---------------+-----------------------------------------------+
S->C: html-speech/1.0 RECOGNITION-COMPLETE 8322 COMPLETE (because mode = reco-once, it the request completes when reco completes)
C->S: binary message: start of stream (stream-id = 112233)
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| message type | stream-id |
|1 0 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
+---------------+-----------------------------------------------+
|1 0 0 0 1 0 1 1 1 1 0 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0| } NTP Timestamp
|0 0 0 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0 1 1 0 0 0 0 1 0 1 0 1 0 1| }
| 61 75 64 69 | a u d i
| 6F 2F 61 6D | o / a m
| 72 2D 77 62 | r - w b
+---------------------------------------------------------------+
C->S: binary message: media packet (stream-id = 112233)
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| message type | stream-id |
|0 1 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
+---------------+-----------------------------------------------+
| encoded audio data |
| ... |
| ... |
| ... |
+---------------------------------------------------------------+
C->S: more binary media packets...
C->S: html-speech/1.0 LISTEN 8322
Resource-Identifier: recognizer
Confidence-Threshold:0.9
Active-Grammars: <built-in:dictation?context=message>
Listen-Mode: reco-continuous
Partial: TRUE
Source-time: 12753432234 (where in the input stream recognition should start)
C->S: more binary media packets...
S->C: html-speech/2.0 START-OF-SPEECH 8322 IN-PROGRESS
Source-time: 12753439912 (when speech was detected)
C->S: more binary media packets...
S->C: html-speech/2.0 INTERMEDIATE-RESULT 8322 IN-PROGRESS
C->S: more binary media packets...
S->C: html-speech/1.0 END-OF-SPEECH 8322 IN-PROGRESS (i.e. the recognizer has detected the user stopped talking)
C->S: more binary media packets...
S->C: html-speech/1.0 RECOGNITION-COMPLETE 8322 IN-PROGRESS (because mode = reco-continuous, the request remains IN-PROGRESS)
C->S: more binary media packets...
S->C: html-speech/2.0 START-OF-SPEECH 8322 IN-PROGRESS
S->C: html-speech/2.0 INTERMEDIATE-RESULT 8322 IN-PROGRESS
S->C: html-speech/1.0 RECOGNITION-COMPLETE 8322 IN-PROGRESS
S->C: html-speech/2.0 INTERMEDIATE-RESULT 8322 IN-PROGRESS
S->C: html-speech/1.0 RECOGNITION-COMPLETE 8322 IN-PROGRESS (because mode = reco-continuous, the request remains IN-PROGRESS)
S->C: html-speech/1.0 END-OF-SPEECH 8322 IN-PROGRESS (i.e. the recognizer has detected the user stopped talking)
C->S: binary audio packet: end of stream (i.e. since the recognizer has signaled end of input, the UA decides to terminate the stream)
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| message type | stream-id |
|1 1 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
+---------------+-----------------------------------------------+
S->C: html-speech/1.0 RECOGNITION-COMPLETE 8322 COMPLETE
Recognizer-State:idle
Completion-Cause: XXX (TBD)
Completion-Reason: No Input Streams
In HTML speech applications, the synthesizer service does not participate directly in the user interface. Rather, it simply provides rendered audio upon request, similar to any media server, plus interim events such as marks. The UA buffers the rendered audio, and the application may choose to play it to the user at some point completely unrelated to the synthesizer service. It is the synthesizer's role to render the audio stream in a timely manner, at least rapidly enough to support real-time feedback. The synthesizer MAY also render and transmit the stream faster than requied for real time playback, or render multiple streams in parallel, in order to reduce latency in the application. This is a stark contrast to IVR, where the synthesizer essentially renders directly to the user's telephone, and is an active part of the user interface.
The synthesizer MUST support [[SSML] AND plain text input. A synthesizer MAY also accept other input formats. In all cases, the client should use the content-type header to indicate the input format.
TODO: Mention SSML. Use content-type to differentiate between SSML and plain text.
synth-method = "SPEAK"
| "STOP"
| "DEFINE-LEXICON"
The set of synthesizer request methods is a subset of those defined in [MRCPv2]
The SPEAK method operates similarly to its [MRCPv2] namesake. The primary difference is that SPEAK results in a new audio stream being sent from the server to the client, using the same Request-ID. A SPEAK request MUST include the Audio-Codec header. When the rendering has completed, and the end-of-stream message has been sent, the synthesizer sends a SPEAK-COMPLETE event.
When the synthesizer receives a STOP request, it ceases rendering the requests specified in the Active-Request-ID header. If the Active-Request-ID header is missing, it ceases rendring all active SPEAK requests. For any SPEAK request that is ceased, the synthesiser sends an end-of-stream message, and a SPEAK-COMPLETE event.
This is identical to its namesake in [MRCPv2].
Synthesis events are associated with 'IN-PROGRESS' request-state notifications from the synthesizer resource.
synth-event = "INTERIM-EVENT" ; See Interim Events above
| "SPEECH-MARKER" ; An SSML mark has been rendered
| "SPEAK-COMPLETE"
See Interim Events above.
Similar to its namesake in [MRCPv2], except that the Speech-Marker header contains a relative timestamp indicating the elapsed time from the start of the stream.
Implementations should send the SPEECH-MARKER as closely as possible to the corresponding media packet so clients may play the media and fire events in real time if needed.
TODO: this should be sent adjacent to the audio packet at the same time point, so clients can play back in real time.
The same as its [MRCPv2] namesake.
The synthesis headers used in html-speech/1.0 are mostly a subset of those in [MRCPv2], with some minor modification and additions.
synth-header = ; headers borrowed from [MRCPv2]
active-request-id-list
| Completion-Cause
| Completion-Reason
| Voice-Gender
| Voice-Age
| Voice-Variant
| Voice-Name
| Prosody-parameter ; Actually a collection of prosody headers
| Speech-Marker
| Speech-Language
| Failed-URI
| Failed-URI-Cause
| Load-Lexicon
| Lexicon-Search-Order
; new headers for html-speech/1.0
| Audio-Codec
| Stream-ID
Audio-Codec = "Audio-Codec:" mime-media-type ; See [RFC3555]
Stream-ID = 1*8DIGIT ; decimal representation of 24-bit stream-ID
Similar to its namesake in [MRCPv2], except that the clock is defined as the local time at the service. By using the timestamp from the beginning of the stream, and the timestamp of this event, the UA can calculate when to raise the event to the application based on where it is in the playback of the rendered stream.
TODO: insert more synthesis examples
TODO: synthesizing multiple prompts in parallel for playback in the UA when the app needs them
C->S: html-speech/1.0 SPEAK 3257
Resource-ID:synthesizer
Voice-gender:neutral
Voice-Age:25
Audio-codec:audio/flac
Prosody-volume:medium
Content-Type:application/ssml+xml
<?xml version="1.0"?>
<speak version="1.0">
...
S->C: html-speech/1.0 3257 200 IN-PROGRESS
Resource-ID:synthesizer
Stream-ID: 112233
Speech-Marker:timestamp=0
S->C: binary message: start of stream (stream-id = 112233)
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| message type | stream-id |
|1 0 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
+---------------+-----------------------------------------------+
|1 0 0 0 1 0 1 1 1 1 0 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0| } NTP Timestamp
|0 0 0 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0 1 1 0 0 0 0 1 0 1 0 1 0 1| }
| 61 75 64 69 | a u d i
| 6F 2F 66 6C | o / f l
| 61 63 +-------------------------------+ a c
+-------------------------------+
S->C: binary message: media packet (stream-id = 112233)
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| message type | stream-id |
|0 1 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
+---------------+-----------------------------------------------+
| encoded audio data |
| ... |
| ... |
| ... |
+---------------------------------------------------------------+
S->C: more binary media packets...
S->C: html-speech/1.0 SPEECH-MARKER 3257 IN-PROGRESS
Resource-ID:synthesizer
Speech-Marker:timestamp=2059000;marker-1
S->C: more binary media packets...
S->C: html-speech/1.0 SPEAK-COMPLETE 3257 COMPLETE
Resource-ID:Synthesizer
Completion-Cause:000 normal
Speech-Marker:timestamp=5011000
S->C: binary audio packets...
S->C: binary audio packet: end of stream ( message type = 0x03 )
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| message type | stream-id |
|1 1 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
+---------------+-----------------------------------------------+