This document is an informal rough draft that collates proposals, agreements, and open issues on the design of the necessary underlying protocol for the HTML Speech XG, for the purposes of review and discussion within the XG.
Multimodal interfaces enable users to interact with web applications using multiple different modalities. The HTML Speech protocol, and associated HTML Speech API, are designed to enable speech modalities as part of a common multimodal user experience combining spoken and graphical interaction across browsers. The specific goal of the HTML Speech protocol is to enable a web application to utilize the same network-based speech resources regardless of the browser used to render the application. The HTML Speech protocol is defined as a sub-protocol of WebSockets [WS-PROTOCOL], and enables HTML user agents and applications to make interoperable use of network-based speech service providers, such that applications can use the service providers of their choice, regardless of the particular user agent the application is running in. The protocol bears some similarity to [MRCPv2] where it makes sense to borrow from that prior art. However, since the use cases for HTML Speech applications are in many cases considerably different from those around which MRCPv2 was designed, the HTML Speech protocol is not a direct transcript of MRCP. Similarly, because the HTML Speech protocol builds on WebSockets, its session negotiation and media transport needs are quite different from those of MRCP.
Client
|-----------------------------|
| HTML Application | Server
|-----------------------------| |--------------------------|
| HTML Speech API | | Synthesizer | Recognizer |
|-----------------------------| |--------------------------|
| HTML-Speech Protocol Client |---html-speech/1.0 subprotocol---| HTML-Speech Server |
|-----------------------------| |--------------------------|
| WebSockets Client |-------WebSockets protocol-------| WebSockets Server |
|-----------------------------| |--------------------------|
Because continuous recognition plays an important role in HTML Speech scenarios, a Recognizer is a resource that essentially acts as a filter on its input streams. Its grammars/language models can be specified and changed, as needed by the application, and the recognizer adapts its processing accordingly. Single-shot recognition (e.g. a user on a web search page presses a button and utters a single web-search query) is a special case of this general pattern, where the application specifies its model once, and is only interested in one match event, after which it stops sending audio (if it hasn't already).
A Recognizer performs speech recognition, with the following characteristics:
"Recognizers" are not strictly required to perform speech recognition, and may perform additional or alternative functions, such as speaker verification, emotion detection, or audio recording.
A Synthesizer generates audio streams from textual input. It essentially produces a media stream with additional events, which the user agent buffers and plays back as required by the application. A Synthesizer service has the following characteristics:
In the HTML speech protocol, the control signals and the media itself are transported over the same Websocket connection. Earlier implementations utilized a simple HTTP connection for speech recognition and synthesis. Use cases involving continuous recognition motivated the move to Websockets. This simple design avoids all the normal media problems of session negotiation, packet delivery, port & IP address assignments, NAT-traversal, etc, since the underlying WebSocket already satisfies these requirements. A beneficial side-effect of this design is that by limiting the protocol to Websockets over HTTP there should be less problems with firewalls compared to having a separate RTP connection of other for the media transport. This design is different from MRCP, which is oriented around telephony/IVR and all its impediments, rather than HTML and WebServices, and is motivated by simplicity and desire to keep the protocol within HTTP.
The WebSockets session is established through the standard WebSockets HTTP handshake, with these specifics:
For example:
C->S: GET /speechservice123?customparam=foo&otherparam=bar HTTP/1.1
Host: examplespeechservice.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: OIUSDGY67SDGLjkSD&g2 (for example)
Sec-WebSocket-Version: 9
Sec-WebSocket-Protocol: html-speech/1.0, x-proprietary-speech
S->C: HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=
Sec-WebSocket-Protocol: html-speech/1.0
Once the WebSockets session is established, the UA can begin sending requests and media to the service, which can respond with events, responses or media.
A session MAY have a maximum of one synthesizer resource and one recognizer resource. If an application requires multiple resources of the same type (such as, for example, two synthesizers from different vendors), it MUST use separate WebSocket sessions.
There is no association of state between sessions. If a service wishes to provide a special association between separate sessions, it may do so behind the scenes (for example, to re-use audio input from one session in another session without resending it, or to cause service-side barge-in of TTS in one session by recognition in another session, would be service-specific extensions).
The signaling design borrows its basic pattern from [MRCPv2], where there are three classes of control messages:
control-message = start-line ; i.e. use the typical MIME message format
*(header CRLF)
CRLF
[body]
start-line = request-line | status-line | event-line
header = <Standard MIME header format> ; case-insensitive. Actual headers depend on the type of message
body = *OCTET ; depends on the type of message
The interaction is full-duplex and asymmetrical: service activity is instigated by requests the UA, which may be multiple and overlapping, and where each request results in one or more messages from the service back to the UA.
For example:
C->S: html-speech/1.0 SPEAK 3257 ; request synthesis of string
Resource-ID:synthesizer
Audio-codec:audio/basic
Content-Type:text/plain
Hello world! I speak therefore I am.
S->C: html-speech/1.0 3257 200 IN-PROGRESS ; server confirms it will start synthesizing
S->C: media for 3257 ; receive synthesized media
S->C: html-speech/1.0 SPEAK-COMPLETE 3257 COMPLETE ; done!
Request messages are sent from the client to the server, usually to request an action or modify a setting. Each request has its own request-id, which is unique within a given WebSockets html-speech session. Any status or event messages related to a request use the same request-id. All request-ids MUST have a non-negative integer value of 1-10 decimal digits.
request-line = version SP method-name SP request-id SP CRLF
version = "html-speech/" 1*DIGIT "." 1*DIGIT ; html-speech/1.0
method-name = general-method | synth-method | reco-method | proprietary-method
request-id = 1*10DIGIT
NOTE: In some other protocols, messages also include their message length, so that they can be framed in what is otherwise an open stream of data. In html-speech/1.0, framing is already provided by WebSockets, and message length is not needed, and therefore not included.
For example, to request the recognizer to interpret text as if it were spoken:
C->S: html-speech/1.0 INTERPRET 8322
Resource-Identifier: recognizer
Active-Grammars: <http://myserver/mygrammar.grxml>
Interpret-Text: Send a dozen yellow roses and some expensive chocolates to my mother
Status messages are sent from the server to the client, to indicate the state of a request.
status-line = version SP request-id SP status-code SP request-state CRLF
status-code = 3DIGIT ; Specific codes TBD, but probably similar to those used in MRCP
; All communication from the server is labeled with a request state.
request-state = "COMPLETE" ; Processing of the request has completed.
| "IN-PROGRESS" ; The request is being fulfilled.
| "PENDING" ; Processing of the request has not begun.
Specific status code values follow a pattern similar to [MRCPv2]:
Event messages are sent by the server, to indicate specific data, such as synthesis marks, speech detection, and recognition results. They are essentially specialized status messages.
event-line = version SP event-name SP request-id SP request-state CRLF
event-name = synth-event | reco-event | proprietary-event
For example, an event indicating that the recognizer has detected the start of speech:
S->C: html-speech/2.0 START-OF-SPEECH 8322 IN-PROGRESS
Resource-ID: recognizer
Source-time: 2011-09-06T21:47:31.981+1:30 (when speech was detected)
HTML Speech applications feature a wide variety of media transmission scenarios. The number of media streams at any given time is not fixed. For example:
Whereas a human listener will tolerate the clicks and pops of missing packets so they can continue listening in real time, recognizers do not require their data in real-time, and will generally prefer to wait for delayed packets in order to maintain accuracy.
Advanced implementations of HTML Speech may incorporate multiple channels of audio in a single transmission. For example, a living-room device with a microphone array may send separate streams capturing the speech of multiple individuals within the room. Or, for example, a device may send parallel streams with alternative encodings that may not be human-consumable but contain information that is of particular value to a recognition service.
In html-speech/1.0, audio (or other media) is packetized and transmitted as a series of WebSockets binary messages, on the same WebSockets session used for the control messages.
media-packet = binary-message-type
binary-stream-id
binary-data
binary-message-type = OCTET ; Values > 0x03 are reserved. 0x00 is undefined.
binary-stream-id = 3OCTET ; Unique identifier for the stream 0..224-1
binary-data = *OCTET
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| message type | stream-id |
+---------------+-----------------------------------------------+
| ... |
| Data |
| ... |
+---------------------------------------------------------------+
The binary-stream-id field is used to identify the messages for a particular stream. It is a 24-bit unsigned integer. Its value for any given stream is assigned by the sender (client or server) in the initial message of the stream, and must be unique to the sender within the WebSockets session.
A sequence of media messages with the same stream-ID represents an in-order contiguous stream of data. Because the messages are sent in-order and audio packets cannot be lost (WebSockets uses TCP), there is no need for sequence numbering or timestamps. The sender just packetizes audio from the encoder and sends it, while the receiver just un-packs the messages and feeds them to the consumer (e.g. the recognizer's decoder, or the TTS playback buffer). Timing of coordinated events is calculated by decoded offset from the beginning of the stream.
The WebSockets stack de-multiplexes text and binary messages, thus separating signaling from media, while the stream-ID on each media message is used to de-multiplex the messages into separate media streams.
The binary-message-type field has these defined values:
message type = 0x01; stream-id = 112233; media-type = audio/amr-wb
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| message type | stream-id |
|1 0 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
+---------------+-----------------------------------------------+
|1 0 0 0 1 0 1 1 1 1 0 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0| } NTP Timestamp
|0 0 0 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0 1 1 0 0 0 0 1 0 1 0 1 0 1| }
| 61 75 64 69 | a u d i
| 6F 2F 61 6D | o / a m
| 72 2D 77 62 | r - w b
+---------------------------------------------------------------+
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| message type | stream-id |
|0 1 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
+---------------+-----------------------------------------------+
| encoded audio data |
| ... |
| ... |
| ... |
+---------------------------------------------------------------+
There is no strict constraint on the size and frequency of audio messages. Nor is there a requirement for all audio packets to encode the same duration of sound. However, implementations SHOULD seek to minimize interference with the flow of other messages on the same socket, by sending messages that encode between 20 and 80 milliseconds of media. Since a WebSockets frame header is typically only 4 bytes, overhead is minimal and implementations SHOULD err on the side of sending smaller packets more frequently.
A synthesis service MAY (and typically will) send audio faster than real-time, and the client MUST be able to handle this.
A recognition service MUST be prepared to receive slower-than-real-time audio due to practical throughput limitations of the network.
The design does not permit the transmission of binary media as base-64 text messages, since WebSockets already provides native support for binary messages. Base-64 encoding would incur an unnecessary 33% transmission overhead.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| message type | stream-id |
|1 1 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
+---------------+-----------------------------------------------+
Although most services will be strictly either recognition or synthesis services, some services may support both in the same session. While this is a more advanced scenario, the design does not introduce any constraints to prevent it. Indeed, both the client and server MAY send audio streams in the same session.
Both the signaling and media transmission aspects of the html-speech/1.0 protocol inherit a number of security features from the underlying WebSockets protocol[WS-PROTOCOL]:
Clients may authenticate servers using standard TLS, simply by using the WSS: uri scheme rather than the WS: scheme in the service URI. This is standard WebSockets functionality in much the same way as HTTP specifies TLS by using the HTTPS: scheme.
Similarly, all traffic (media and signaling) is encrypted by TLS, when using the WSS: uri scheme.
User authentication, when required by a server, will commonly be done using the standard [HTTP] challenge-response mechanism in the initial websocket bootstrap. A server may also choose to use TLS client authentication, and although this will probably be uncommon, WebSockets stacks should support it.
HTML speech network scenarios also have security boundaries outside of signaling and media:
A client may require a server to access resources from a third location. Such resources may include SRGS documents, SSML documents, audio files, etc. This may either be a result of the application referring to the resource by URI; or of an already loaded resource containing a URI reference to a separate resource. In these cases the server will need permission to access these resources. There are three ways in which this may be accomplished:
Through the use of certain headers during speech recognition, the client may request the server to retain a recording of the input media, and make this recording available at a URL for retrieval. The server that holds the recording MAY secure this recording by using standard HTTP security mechanisms: it MAY authenticate client using standard HTTP challenge/response; it MAY use TLS to encrypt the recording when transmitting it back to the client; and it MAY use TLS to authenticate the client. The server that holds a recording MAY also discard a recording after a reasonable period, as determined by the server.
Timestamps are used in a variety of headers in the protocol. Binary messages use the 64-bit NTP timestamp format, as defined in [RFC 1305]. Text messages use the encoding format defined in [RFC 3339] "Date and Time on the Internet: Timestamps", and reproduced here:
date-time = full-date "T" full-time
; For example: 2011-09-06T10:33:16.612Z
; or: 2011-09-06T21:47:31.981+1:30
full-date = date-fullyear "-" date-month "-" date-mday
full-time = partial-time time-offset
date-fullyear = 4DIGIT
date-month = 2DIGIT ; 01-12
date-mday = 2DIGIT ; 01-28, 01-29, 01-30, 01-31 based on
; month/year
partial-time = time-hour ":" time-minute ":" time-second
[time-secfrac]
time-hour = 2DIGIT ; 00-23
time-minute = 2DIGIT ; 00-59
time-second = 2DIGIT ; 00-58, 00-59, 00-60 based on leap second
; rules
time-secfrac = "." 1*DIGIT
time-numoffset = ("+" / "-") time-hour ":" time-minute
time-offset = "Z" / time-numoffset
The GET-PARAMS and SET-PARAMS requests are the same as their [MRCPv2] counterparts. They are used to discover and set the default header values of a resource (recognizer or synthesiser). Like all messages, they must always include the Resource-ID header.
Setting a header with SET-PARAMS sets a default value for the header. This value is used by the resource in requests where the header is absent from the request, but valid for that type of request.
general-method = "SET-PARAMS"
| "GET-PARAMS"
header = capability-query-header
| interim-event-header
| reco-header
| synth-header
capability-query-header =
"Supported-Content:" mime-type *("," mime-type)
| "Supported-Languages:" lang-tag *("," lang-tag) ; See [RFC5646]
| "Builtin-Grammars:" "<" URI ">" *("," "<" URI ">")
interim-event-header =
"Interim-Events:" event-name *("," event-name)
event-name = 1*UTFCHAR
These headers may be used in any control message. All header names are case-insensitive.
generic-header =
| accept
| accept-charset
| content-base
| logging-tag
| resource-id
| vendor-specific
| content-type
| content-encoding
resource-id = "Resource-ID:" ("recognizer" | "synthesizer" | vendor-resource)
vendor-resource = "x-" 1*UTFCHAR
accept = <indicates the content-types the sender will accept>
accept-charset = <indicates the character set the sender will accept>
content-base = <the base for relative URIs>
content-type = <the type of content contained in the message body>
content-encoding = <the encoding of message body content>
logging-tag = <a tag to be inserted into server logs>
vendor-specific = "Vendor-Specific-Parameters:" vendor-specific-av-pair
*[";" vendor-specific-av-pair] CRLF
vendor-specific-av-pair = vendor-av-pair-name "=" vendor-av-pair-value
The html-speech/1.0 protocol provides a way for the application/UA to determine whether a resource supports the basic capabilities it needs. In most cases applications will know a service's resource capabilities ahead of time. However, some applications may be more adaptable, or may wish to double-check at runtime. To determine resource capabilities, the UA sends a GET-PARAMS request to the resource, containing a set of capabilities, to which the resource responds with the specific subset it actually supports.
For example, discovering whether the recognizer supports the desired CODECs, grammar format, languages/dialects and built-in grammars:
C->S: html-speech/1.0 GET-PARAMS 34132
resource-id: recognizer
supported-content: audio/basic, audio/amr-wb,
audio/x-wav;channels=2;formattag=pcm;samplespersec=44100,
audio/dsr-es202212; rate:8000; maxptime:40,
application/x-ngram+xml
supported-languages: en-AU, en-GB, en-US, en (A variety of English dialects are desired)
builtin-grammars: <builtin:dictation?topic=websearch>,
<builtin:dictation?topic=message>,
<builtin:ordinals>,
<builtin:datetime>,
<builtin:cities?locale=USA>
S->C: html-speech/1.0 34132 200 COMPLETE
resource-id: recognizer
supported-content: audio/basic, audio/dsr-es202212; rate:8000; maxptime:40
supported-languages: en-GB, en (The recognizer supports UK English, but will work with any English)
builtin-grammars: <builtin:dictation?topic=websearch>, <builtin:dictation?topic=message>
For example, discovering whether the synthesizer supports the desired CODECs, languages/dialects, and content markup format:
C->S: html-speech/1.0 GET-PARAMS 48223
resource-id: synthesizer
supported-content: audio/ogg, audio/flac, audio/basic, application/ssml+xml
supported-languages: en-AU, en-GB
S->C: html-speech/1.0 48223 200 COMPLETE
resource-id: synthesizer
supported-content: audio/basic, application/ssml+xml
supported-languages: en-GB
Speech services may care to send optional vendor-specific interim events during the processing of a request. For example: some recognizers are capable of providing additional information as they process input audio; and some synthesizers are capable of firing progress events on word, phoneme, and viseme boundaries. These are exposed through the HTML Speech API as events that the webapp can listen for if it knows to do so. A service vendor MAY require a vendor-specific value to be set with SET-PARAMS before a it starts to fire certain events.
interim-event = version request-ID SP INTERIM-EVENT CRLF
*(header CRLF)
CRLF
[body]
event-name-header = "Event-Name:" event-name
The Event-Name header is required and must contain a value that was previously subscribed to with the Interim-Events header.
The Request-ID and Content-Type headers are required, and any data conveyed by the event must be contained in the body.
For example, a synthesis service might choose to communicate visemes through interim events:
C->S: html-speech/1.0 SPEAK 3257
Resource-ID:synthesizer
Audio-codec:audio/basic
Content-Type:text/plain
Hello world! I speak therefore I am.
S->C: html-speech/1.0 3257 200 IN-PROGRESS
S->C: media for 3257
S->C: html-speech/1.0 INTERIM-EVENT 3257 IN-PROGRESS
Resource-ID:synthesizer
Event-Name:x-viseme-event
Content-Type:application/x-viseme-list
"Hello"
0.500 H
0.850 A
1.050 L
1.125 OW
1.800 SILENCE
S->C: more media for 3257
S->C: html-speech/1.0 INTERIM-EVENT 3257 IN-PROGRESS
Resource-ID:synthesizer
Event-Name:x-viseme-event
Content-Type:application/x-viseme-list
"World"
2.200 W
2.350 ER
2.650 L
2.800 D
3.100 SILENCE
S->C: etc
Applications will generally want to select resources with certain capabilities, such as the ability to recognize certain languages, work well in specific acoustic conditions, work well with specific genders or ages, speak particular languages, speak with a particular style, age or gender, etc.
There are three ways in which resource selection can be achieved, each of which has relevance:
This is the preferred mechanism.
Any service may enable applications to encode resource requirements as query string parameters in the URI, or by using specific URIs with known resources. The specific URI format and parameter scheme is by necessity not standardized and is defined by the implementer based on their architecture and service offerings.
For example, a German recognizer with a cell-phone acoustic environment model:
ws://example1.net:2233/webreco/de-de/cell-phone
A UK English recognizer for a two-beam living-room array microphone:
ws://example2.net/?reco-lang=en-UK&reco-acoustic=10-foot-open-room&sample-rate=16kHz&channels=2
Spanish recognizer and synthesizer, where the synthesizer uses a female voice provided by AcmeSynth:
ws://example3.net/speech?reco-lang=es-es&tts-lang=es-es&tts-gender=female&tts-vendor=AcmeSynth
A pre-defined profile specified by an ID string:
ws://example4.com/profile=af3e-239e-9a01-66c0
Request headers may also be used to select specific resource capabilities. Synthesizer parameters are set through SET-PARAMS or SPEAK; whereas recognition paramaters are set either through SET-PARAMS or LISTEN. There is a small set of standard headers that can be used with each resource: the Speech-Language header may be used with both the recognizer and synthesizer, and the synthesizer may also accept a variety of voice selection parameters as headers. A resource MAY honor these headers, but does not need to where it does not have the ability to so. If a particular header value is unsupported, the request should fail with a status of 409 "Unsupported Header Field Value".
For example, a client requires Canadian French recognition but it isn't available:
C->S: html-speech/1.0 LISTEN 8322
Resource-ID: Recognizer
Speech-Language: fr-CA
S->C: html-speech/1.0 8322 409 COMPLETE ; 409, since fr-CA isn't supported.
resource-id: Recognizer
Set the default recognition language to Brazilian Portuguese:
C->S: html-speech/1.0 SET-PARAMS 8323
Resource-ID: Recognizer
Speech-Language: pt-BR
S->C: html-speech/1.0 8323 200 COMPLETE
resource-id: Recognizer
Speak with the voice of a Korean woman in her mid-thirties:
C->S: html-speech/1.0 SPEAK 8324
Resource-ID: Synthesizer
Speech-Language: ko-KR
Voice-Age: 35
Voice-Gender: female
Set the default voice to a Swedish voice named "Kiana":
C->S: html-speech/1.0 SET-PARAMS 8325
Resource-ID: Synthesizer
Speech-Language: sv-SE
Voice-Name: Kiana
This approach is very versatile in theory, however at the time of writing, very few implementations are capable of this kind of versatility in practice.
The [SRGS] and [SSML] input documents for the recognizer and synthesizer will specify the language for the overall document, and MAY specify languages for specific subsections of the document. The resource consuming these documents SHOULD honor these language assignments when they occur. If a resource is unable to do so, it should error with a 481 status "Unsupported content language". (It should be noted that at the time of writing, most currently available recognizer and synthesizer implementations will be unable to suppor this capability.)
Generally speaking, given the current typical state of speech technology, unless a service is unusually adaptable, applications will be most successful using specific proprietary URLs that encode the abilities they need, so that the appropriate resources can be allocated during session initiation.
A recognizer resource is either in the "listening" state, or the "idle" state. Because continuous recognition scenarios often don't have dialog turns or other down-time, all functions are performed in series on the same input stream(s). The key distinction between the idle and listening states is the obvious one: when listening, the recognizer processes incoming media and produces results; whereas when idle, the recognizer SHOULD buffer audio but will not process it.
Recognition is accomplished with a set of messages and events, to a certain extent inspired by those in [MRCPv2].
Idle State Listening State
| |
|--\ |
| DEFINE-GRAMMAR |
|<-/ |
| |
|--\ |
| INFO |
|<-/ |
| |
|---------LISTEN------------>|
| |
| |--\
| | INTERIM-EVENT
| |<-/
| |
| |--\
| | START-OF-SPEECH
| |<-/
| |
| |--\
| | START-INPUT-TIMERS
| |<-/
| |
| |--\
| | END-OF-SPEECH
| |<-/
| |
| |--\
| | INFO
| |<-/
| |
| |--\
| | INTERMEDIATE-RESULT
| |<-/
| |
| |--\
| | RECOGNITION-COMPLETE
| | (when mode = recognize-continuous)
| |<-/
| |
|<---RECOGNITION-COMPLETE----|
|(when mode = recognize-once)|
| |
| |
|<--no media streams remain--|
| |
| |
|<----------STOP-------------|
| |
| |
|<---some 4xx/5xx errors-----|
| |
|--\ |--\
| INTERPRET | INTERPRET
|<-/ |<-/
| |
|--\ |--\
| INTERPRETATION-COMPLETE | INTERPRETATION-COMPLETE
|<-/ |<-/
| |
reco-method = "LISTEN" ; Transitions Idle -> Listening
| "START-INPUT-TIMERS" ; Starts the timer for the various input timeout conditions
| "STOP" ; Transitions Listening -> Idle
| "DEFINE-GRAMMAR" ; Pre-loads & compiles a grammar, assigns a temporary URI for reference in other methods
| "CLEAR-GRAMMARS" ; Unloads all grammars, whether active or inactive
| "INTERPRET" ; Interprets input text as though it was spoken
| "INFO" ; Sends metadata to the recognizer
The LISTEN method transitions the recognizer from the idle state to the listening state. The recognizer then processes the media input streams against the set of active grammars. The request MUST include the Source-Time header, which is used by the Recognizer to determine the point in the input stream(s) that the recognizer should start processing from (which won't necessarily be the start of the stream). The request MUST also include the Listen-Mode header to indicate whether the recognizer should perform continuous recognition, a single recognition, or vendor-specific processing.
A LISTEN request MAY also activiate or deactivate grammars and rules using the Active-Grammars and Inactive-Grammars headers. These grammars/rules are considered to be activated/deactivated from the point specified in the Source-Time header.
When there are no input media streams, and the Input-Waveform-URI header has not been specified, the recognizer cannot enter the listening state, and the listen request will fail (480). When in the listening state, and all input streams have ended, the recognizer automatically transitions to the idle state, and issues a RECOGNITION-COMPLETE event, with Completion-Cause set to 080 ("no-input-stream").
A LISTEN request that is made while the recognizer is already listening results in a 402 error ("Method not valid in this state", since it is already listening).
This is used to indicate when the input timeout clock should start. For example, when the application wants to enable voice barge-in during a prompt, but doesn't want to start the time-out clock until after the prompt has completed, it will delay sending this request until it's finished playing the prompt.
The STOP method transitions the recognizer from the listening state to the idle state. No RECOGNITION-COMPLETE event is sent. The Source-Time header MUST be used, since the recognizer may still fire a RECOGNITION-COMPLETE event for any completion state it encounters prior to that time in the input stream.
A STOP request that is sent while the the recognizer is idle results in a 402 response (method not valid in this state, since there is nothing to stop).
The DEFINE-GRAMMAR method does not activate a grammar. It simply causes the recognizer to pre-load and compile it, and associates it with a temporary URI that can then be used to activate or deactivate the grammar or one of its rules. DEFINE-GRAMMAR is not required in order to use a grammar, since the recognizer can load grammars on demand as needed. However, it is useful when an application wants to ensure a large grammar is pre-loaded and ready for use prior to the recognizer entering the listening state. DEFINE-GRAMMAR can be used when the recognizer is in either the listening or idle state.
All recognizer services MUST support grammars in the SRGS XML format, and MAY support additional alternative grammar/language-model formats.
The client SHOULD remember the temporary URIs, but if it loses track, it can always re-issue the DEFINE-GRAMMAR request, which MUST not result in a service error as long as the mapping is consistent with the original request. Once in place, the URI MUST be honored by the service for the duration of the session. If the service runs low in resources, it is free to unload the URI's payload, but must always continue to honor the URI even if it means reloading the grammar (performance notwithstanding).
Refer to [MRCPv2] for more details on this method.
In continuous recognition, a variety of grammars may be loaded over time, potentially resulting in unused grammars consuming memory resources in the recognizer. The CLEAR-GRAMMARS method unloads all grammars, whether active or inactive. Any URIs previously defined in DefineGrammar become invalid.
The INTERPRET method processes the input text according to the set of grammar rules that are active at the time it is received by the recognizer. It MUST include the Interpret-Text header. The use of INTERPRET is orthogonal to any audio processing the recognizer may be doing, and will not affect any audio processing. The recognizer can be in either the listening or idle state.
An INTERPRET request MAY also activiate or deactivate grammars and rules using the Active-Grammars and Inactive-Grammars headers, but only if the recognizer is in the Idle state. These grammars/rules are considered to be activated/deactivated from the point specified in the Source-Time header.
In multimodal applications, some recognizers will benefit from additional context. Clients can use the INFO request to send this context. The Content-Type header should specify the type of data, and the data itself is contained in the message body.
Recognition events are associated with 'IN-PROGRESS' request-state notifications from the 'recognizer' resource.
reco-event = "START-OF-SPEECH" ; Start of speech has been detected
| "END-OF-SPEECH" ; End of speech has been detected
| "INTERIM-EVENT" ; See Interim Events above
| "INTERMEDIATE-RESULT" ; A partial hypothesis
| "RECOGNITION-COMPLETE" ; Similar to MRCP2 except that application/emma+xml (EMMA) will be the default Content-Type.
| "INTERPRETATION-COMPLETE"
END-OF-INPUT is the logical counterpart to START-OF-INPUT, and indicates that speech has ended. The event MUST include the Source-Time header, which corresponds to the point in the input stream where the recognizer estimates speech to have ended, NOT when the endpointer finally decided that speech ended (which will be a number of milliseconds later).
See Interim Events above. For example, a recognition service may send interim events to indicate it's begun to recognize a phrase, or to indicate that noise or cross-talk on the input channel is degrading accuracy.
Continuous speech (aka dictation) often requires feedback about what has been recognized thus far. Waiting for a RECOGNITION-COMPLETE event prevents this sort of user interface. INTERMEDIATE-RESULT provides this intermediate feedback. As with RECOGNITION-COMPLETE, contents are assumed to be EMMA unless an alternate Content-Type is provided.
This event contains the result of an INTERPRET request.
This method is similar to the [MRCPv2] method with the same name, except that application/emma+xml (EMMA) is the default Content-Type. The Source-Time header must be included, to indicate the point in the input stream when the event occured. When the Listen-Mode is reco-once, the recognizer will transition from the listening state to the idle state when this message is fired, and the Recognizer-State header in the event is set to "idle".
Where applicable, the body of the message SHOULD contain an EMMA that is consistent with the Completion-Cause.
Indicates that start of speech has been detected. The Source-Time header MUST correspond to the point in the input stream(s) where speech was estimated to begin, NOT when the endpointer finally decided that speech began (a number of milliseconds later).
The list of valid headers for the recognizer resource include a subset of the [MRCPv2] Recognizer Header Fields, where they make sense for HTML Speech requirements, as well as a handful of headers that are required for HTML Speech.
reco-header = ; Headers borrowed from MRCP
Confidence-Threshold
| Sensitivity-Level
| Speed-Vs-Accuracy
| N-Best-List-Length
| No-Input-Timeout
| Recognition-Timeout
| Media-Type
| Input-Waveform-URI
| Completion-Cause
| Completion-Reason
| Recognizer-Context-Block
| Start-Input-Timers
| Speech-Complete-Timeout
| Speech-Incomplete-Timeout
| Failed-URI
| Failed-URI-Cause
| Save-Waveform
| Speech-Language
| Hotword-Min-Duration
| Hotword-Max-Duration
| Interpret-Text
| Vendor-Specific ; see Generic Headers
; Headers added for html-speech/1.0
| audio-codec ; The audio codec used in an input media stream
| active-grammars ; Specifies a grammar or specific rule to activate.
| inactive-grammars ; Specifies a grammar or specific rule to deactivate.
| hotword ; Whether to listen in "hotword" mode (i.e. ignore out-of-grammar speech)
| listen-mode ; Whether to do continuous or one-shot recognition
| partial ; Whether to send partial results
| partial-interval ; Suggested interval between partial results, in milliseconds.
| recognizer-state ; Indicates whether the recognizer is listening or idle
| source-time ; The UA local time at the request was initiated
| user-id ; Unique identifier for the user, so that adaptation can be used to improve accuracy.
| Wave-Start-Time ; The start point of a recognition in the audio referred to by Waveform-URIs.
| Wave-End-Time ; The end point of a recognition in the audio referred to by Waveform-URIs.
| Waveform-URIs ; List of URIs to recorded input streams
hotword = "Hotword:" BOOLEAN
listen-mode = "Listen-Mode:" ("reco-once" | "reco-continuous" | vendor-listen-mode)
vendor-listen-mode = "x-" 1*UTFCHAR
recognizer-state = "Recognizer-State:" ("listening" | "idle")
source-time = "Source-Time:" 1*20DIGIT
audio-codec = "Audio-Codec:" mime-media-type ; see [RFC3555]
partial = "Partial:" BOOLEAN
partial-interval = "Partial-Interval:" 1*5DIGIT
active-grammars = "Active-Grammars:" "<" URI ["#" rule-name] [SP weight] ">" *("," "<" URI ["#" rule-name] [SP weight] ">")
rule-name = 1*UTFCHAR
weight = "0." 1*3DIGIT
inactive-grammars = "Inactive-Grammars:" "<" URI ["#" rule-name] ">" *("," "<" URI ["#" rule-name] ">")
user-id = "User-ID:" 1*UTFCHAR
wave-start-time = "Wave-Start-Time:" 1*DIGIT ["." 1*DIGIT]
wave-end-time = "Wave-End-Time:" 1*DIGIT ["." 1*DIGIT]
waveform-URIs = "Waveform-URIs:" "<" URI ">" *("," "<" URI ">")
TODO: discuss how recognition from file would work.
Headers with the same names as their [MRCPv2] counterparts are considered to have the same specification. Other headers are describe as follows:
The Audio-Codec header is used in the START-MEDIA-STREAM request, to specify the codec and parameters used to encode the input stream, using the MIME media type encoding scheme specified in [RFC3555].
The Active-Grammars header specifies a list of grammars, and optionally specific rules within those grammars. The header is used in LISTEN to activate grammars/rules. If no rule is specified for a grammar, the root rule is activated. This header may also specify the weight of the rule.
The Inactive-Grammars header specifies a list of grammars, and optionally specific rules within those grammars, to be deactivated. If no rule is specified, all rules in the grammar are deactivated, including the root rule. The Grammar-Deactivate header MAY be used in the LISTEN method.
The Hotword header is analogous to the [MRCPv2] Recognition-Mode header, however it has a different name and boolean type in html-speech/1.0 in order to avoid confusion with the Listen-Mode header. When true, the recognizer functions in "hotword" mode, which essentially means that out-of-grammar speech is ignored.
Listen-Mode is used in the LISTEN request to specify whether the recognizer should listen continuously, or return to the idle state after the first RECOGNITION-COMPLETE event. It MUST NOT be used in any other type of request other than LISTEN. When the recognizer is in the listening state, it should include Listen-Mode in all event and status messages it sends.
This header is required to support the continuous speech scenario on the recognizer resource. When sent by the client in a LISTEN or SET-PARAMS request, this header controls whether or not the client is interested in partial results from the service. In this context, the term 'partial' is meant to describe mid-utterance results that provide a best guess at the user's speech thus far (e.g. "deer", "dear father", "dear father christmas"). These results should contain all recognized speech from the point of the last non-partial (ie complete) result, but it may be common for them to omit fully-qualified result attributes like an NBest list, timings, etc. The only guarantee is that the content must be EMMA. Note that this header is valid on both regular command-and-control recognition requests as well as dictation sessions. This is because at the API level, there is no syntactic difference between the recognition types. They are both simply recognition requests over an SRGS grammar or set of URL(s). Additionally partial results can be useful in command-and-control scenarios, for example: open-microphone applications, dictation enrollment applications, and lip-sync. When sent by the server, this header indicates whether the message contents represent a full or partial result. It's valid for a server to send this header on both INTERMEDIATE-RESULT, RECOGNITION-COMPLETE, and in response to a GET-PARAMS messages.
A suggestion from the client to the service on the frequency at which partial results should be sent. It is an integer value represents desired interval expressed in milliseconds. The recognizer does not need to precisely honor the requested interval, but SHOULD provide something close, if it is within the operating parameters of the implementation.
Indicates whether the recognizer is listening or idle. This MUST NOT be included by the client in any requests, and MUST be included by the recognizer in all status and event messages it sends.
Indicates the timestamp of a message using the client's local time. All requests sent from the client to the recognizer MUST include the Source-Time header, which must faithfully specify the client's local system time at the moment it sends the request. This enables the recognizer to correctly synchronize requests with the precise point in the input stream at which they were actually sent by the client. All event messages sent by the recognizer MUST include the Source-Time, calculated by the recognizer service based on the point in the input stream at which the event occurred, and expressed in the client's local clock time (since the recognizer knows what this was at the start of the input stream). By expressing all times in client-time, the user agent or application is able to correctly sequence events, and implement timing-sensitive scenarios, that involve other objects outside the knowledge of the recognizer service (for example, media playback objects or videogame states).
Recognition results are often more accurate if the recognizer can train itself to the user's speech over time. This is especially the case with dictation as vocabularies are so large. A User-ID field would allow the recognizer to establish the user's identify if the webapp decided to supply this information.
Some applications will wish to re-recognize an utterance using different grammars. For example, an application may accept a broad range of input, and use the first round of recognition simply to classify an utterance so that it can use a more focused grammar on the second round. Others will wish to record an utterance for future use. For example, an application transcribes an utterance to text may store a recording so that untranscribed information is not lost (tone, emotion, etc). While these are not mainstream scenarios, they are both valid and inevitable, and may be achieved using the headers provided for recognition.
If the Save-Waveform header is set to true (with SET-PARAMS or LISTEN), then the recognizer will save the input audio. Consequent RECOGNITION-COMPLETE events sent by the recognizer will contain a URI in the Waveform-URI header which refers to the stored audio (multiple URIs when multiple input streams are present). In the case of continuous recognition, the Waveform-URI header refers to all of the audio captured so far. The application may fetch the audio from this URI, assuming it has appropriate credentials (the credential policy is determined by the service provider). The application may also use the URI as input to future LISTEN requests by passing the URI in the Input-Waveform-URI header.
When RECOGNITION-COMPLETE returns a Waveform-URI header, it also returns the time interval within the recorded waveform that the recognition result applies to, in the Wave-Start-Time and Wave-End-Time headers, which indicate the offsets in seconds from the start of the waveform. A client MAY also use the SourceTime header of other events such as START-OF-SPEECH and END-OF-SPEECH to calculate other intervals of interest. When using the Input-Wavefor-URI header, the client may suffix the URI with an "interval" parameter to indicate that the recognizer should only decode that particular interval of the audio:
interval = "interval=" start "," end
start = seconds | "start"
end = seconds | "end"
seconds = 1*DIGIT ["." 1*DIGIT]
For example:
http://example.com/retainedaudio/fe429ac870a?interval=0.3,2.86
http://example.com/temp44235.wav?interval=0.65,end
When the Input-Waveform-URI header is used, all other input streams are ignored.
Speech services MAY support pre-defined grammars that can be referenced through a 'builtin:' uri. For example:
builtin:dictation?context=email&lang=en_US
builtin:date
builtin:search?context=web
These can be used as top-level grammars in the Grammar-Activate/Deactivate headers, or in rule references within other grammars. If a speech service does not support the referenced builtin or if it does not specify the builtin in combination with other active grammars, it should return a grammar compilation error.
The specific set of predefined grammars is to be defined later. However, there MUST be a certain small set of predefined grammars that a user agent's default speech recognizer MUST support. For non-default recognizers, support for predefined grammars is optional, and the set that is supported is also defined by the service provider and may include proprietary grammars (e.g. builtin:x-acme-parts-catalog).
Start streaming audio:
C->S: binary message: start of stream (stream-id = 112233)
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| message type | stream-id |
|1 0 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
+---------------+-----------------------------------------------+
|1 0 0 0 1 0 1 1 1 1 0 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0| } NTP Timestamp
|0 0 0 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0 1 1 0 0 0 0 1 0 1 0 1 0 1| }
| 61 75 64 69 | a u d i
| 6F 2F 61 6D | o / a m
| 72 2D 77 62 | r - w b
+---------------------------------------------------------------+
C->S: binary message: media packet (stream-id = 112233)
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| message type | stream-id |
|0 1 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
+---------------+-----------------------------------------------+
| encoded audio data |
| ... |
| ... |
| ... |
+---------------------------------------------------------------+
C->S: more binary media packets...
Send the LISTEN request:
C->S: html-speech/1.0 LISTEN 8322
Resource-Identifier: recognizer
Confidence-Threshold:0.9
Active-Grammars: <built-in:dictation?context=message>
Listen-Mode: reco-once
Source-time: 2011-09-06T21:47:31.981+1:30 (where in the input stream recognition should start)
S->C: html-speech/2.0 START-OF-SPEECH 8322 IN-PROGRESS
C->S: more binary media packets...
C->S: binary audio packets...
C->S: binary audio packet in which the user stops talking
C->S: binary audio packets...
S->C: html-speech/1.0 END-OF-SPEECH 8322 IN-PROGRESS (i.e. the recognizer has detected the user stopped talking)
C->S: binary audio packet: end of stream (i.e. since the recognizer has signaled end of input, the UA decides to terminate the stream)
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| message type | stream-id |
|1 1 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
+---------------+-----------------------------------------------+
S->C: html-speech/1.0 RECOGNITION-COMPLETE 8322 COMPLETE (because mode = reco-once, it the request completes when reco completes)
Resource-Identifier: recognizer
<emma:emma version="1.0"
...etc
C->S: binary message: start of stream (stream-id = 112233)
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| message type | stream-id |
|1 0 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
+---------------+-----------------------------------------------+
|1 0 0 0 1 0 1 1 1 1 0 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0| } NTP Timestamp
|0 0 0 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0 1 1 0 0 0 0 1 0 1 0 1 0 1| }
| 61 75 64 69 | a u d i
| 6F 2F 61 6D | o / a m
| 72 2D 77 62 | r - w b
+---------------------------------------------------------------+
C->S: binary message: media packet (stream-id = 112233)
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| message type | stream-id |
|0 1 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
+---------------+-----------------------------------------------+
| encoded audio data |
| ... |
| ... |
| ... |
+---------------------------------------------------------------+
C->S: more binary media packets...
C->S: html-speech/1.0 LISTEN 8322
Resource-Identifier: recognizer
Confidence-Threshold:0.9
Active-Grammars: <built-in:dictation?context=message>
Listen-Mode: reco-continuous
Partial: TRUE
Source-time: 2011-09-06T21:47:31.981+1:30 (where in the input stream recognition should start)
C->S: more binary media packets...
S->C: html-speech/2.0 START-OF-SPEECH 8322 IN-PROGRESS
Source-time: 12753439912 (when speech was detected)
C->S: more binary media packets...
S->C: html-speech/2.0 INTERMEDIATE-RESULT 8322 IN-PROGRESS
C->S: more binary media packets...
S->C: html-speech/1.0 END-OF-SPEECH 8322 IN-PROGRESS (i.e. the recognizer has detected the user stopped talking)
C->S: more binary media packets...
S->C: html-speech/1.0 RECOGNITION-COMPLETE 8322 IN-PROGRESS (because mode = reco-continuous, the request remains IN-PROGRESS)
C->S: more binary media packets...
S->C: html-speech/2.0 START-OF-SPEECH 8322 IN-PROGRESS
S->C: html-speech/2.0 INTERMEDIATE-RESULT 8322 IN-PROGRESS
S->C: html-speech/1.0 RECOGNITION-COMPLETE 8322 IN-PROGRESS
S->C: html-speech/2.0 INTERMEDIATE-RESULT 8322 IN-PROGRESS
S->C: html-speech/1.0 RECOGNITION-COMPLETE 8322 IN-PROGRESS (because mode = reco-continuous, the request remains IN-PROGRESS)
S->C: html-speech/1.0 END-OF-SPEECH 8322 IN-PROGRESS (i.e. the recognizer has detected the user stopped talking)
C->S: binary audio packet: end of stream (i.e. since the recognizer has signaled end of input, the UA decides to terminate the stream)
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| message type | stream-id |
|1 1 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
+---------------+-----------------------------------------------+
S->C: html-speech/1.0 RECOGNITION-COMPLETE 8322 COMPLETE
Recognizer-State:idle
Completion-Cause: 080
Completion-Reason: No Input Streams
Example showing 1-best with an XML semantics within emma:interpretation. The 'interpretation' is contained within the emma:interpretation element. The 'utterance' is the value of emma:tokens and 'confidence' is the value of emma:confidence.
<emma:emma version="1.0"
xmlns:emma="http://www.w3.org/2003/04/emma"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2003/04/emma
http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd"
xmlns="http://www.example.com/example">
<emma:grammar id="gram1"
grammar-type="application/srgs-xml" <!-- From EMMA 1.1 -->
ref="http://acme.com/flightquery.grxml"/>
<emma:interpretation id="int1"
emma:start="1087995961542"
emma:end="1087995963542"
emma:medium="acoustic"
emma:mode="voice"
emma:confidence="0.75"
emma:lang="en-US"
emma:grammar-ref="gram1"
emma:media-type="audio/x-wav; rate:8000;"
emma:signal="http://example.com/signals/145.wav"
emma:tokens="flights from boston to denver"
emma:process="http://example.com/my_asr.xml">
<origin>Boston</origin>
<destination>Denver</destination>
</emma:interpretation>
</emma:emma>
Similar example but with a JSON semantic payload rather than XML.
<emma:emma
version="1.0"
xmlns:emma="http://www.w3.org/2003/04/emma"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2003/04/emma
http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd"
xmlns="http://www.example.com/example">
<emma:grammar id="gram2"
grammar-type="application/srgs-xml"
ref="http://acme.com/pizzaorder.grxml"/>
<emma:interpretation id="int1"
emma:start="1087995961542"
emma:end="1087995963542"
emma:confidence=".75"
emma:medium="acoustic"
emma:mode="voice"
emma:verbal="true"
emma:function="dialog"
emma:lang="en-US"
emma:grammar-ref="gram2"
emma:media-type="audio/x-wav; rate:8000;"
emma:signal="http://example.com/signals/367.wav"
emma:tokens="a medium coke and 3 large pizzas with pepperoni and mushrooms"
emma:process="http://example.com/my_asr.xml">
<emma:literal>
<![CDATA[
{
drink: {
liquid:"coke",
drinksize:"medium"},
pizza: {
number: "3",
pizzasize: "large",
topping: [ "pepperoni", "mushrooms" ]
}
}
]]>
</emma:literal>
</emma:interpretation>
</emma:emma>
In EMMA 1.1 there is an attribute to specify the type of interpretation payload: emma:semantic-rep="json".
Example showing multiple recognition results and their associated interpretations.
<emma:emma version="1.0"
xmlns:emma="http://www.w3.org/2003/04/emma"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2003/04/emma
http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd"
xmlns="http://www.example.com/example">
<emma:grammar id="gram1"
grammar-type="application/srgs-xml"
ref="http://acme.com/flightquery.grxml"/>
<emma:grammar id="gram2"
grammar-type="application/srgs-xml"
ref="http://acme.com/pizzaorder.grxml"/>
<emma:one-of id="r1"
emma:start="1087995961542"
emma:end="1087995963542"
emma:medium="acoustic"
emma:mode="voice"
emma:lang="en-US"
emma:media-type="audio/x-wav; rate:8000;"
emma:signal="http://example.com/signals/789.wav"
emma:process="http://example.com/my_asr.xml">
<emma:interpretation id="int1"
emma:confidence="0.75"
emma:tokens="flights from boston to denver"
emma:grammar-ref="gram1">
<origin>Boston</origin>
<destination>Denver</destination>
</emma:interpretation>
<emma:interpretation id="int2"
emma:confidence="0.68"
emma:tokens="flights from austin to denver"
emma:grammar-ref="gram1">
<origin>Austin</origin>
<destination>Denver</destination>
</emma:interpretation>
</emma:one-of>
</emma:emma>
In the case of a no-match the EMMA result returned MUST be annotated as emma:uninterpreted="true".
<emma:emma version="1.0"
xmlns:emma="http://www.w3.org/2003/04/emma"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2003/04/emma
http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd"
xmlns="http://www.example.com/example">
<emma:interpretation id="interp1"
emma:uninterpreted="true"
emma:medium="acoustic"
emma:mode="voice"
emma:process="http://example.com/my_asr.xml"/>
</emma:emma>
In the case of a no-match the EMMA interpretation returned must be annotated as emma:no-input="true" and the <emma:interpretation> element must be empty.
<emma:emma version="1.0"
xmlns:emma="http://www.w3.org/2003/04/emma"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2003/04/emma
http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd"
xmlns="http://www.example.com/example">
<emma:interpretation id="int1"
emma:no-input="true"
emma:medium="acoustic"
emma:mode="voice"
emma:process="http://example.com/my_asr.xml"/>
</emma:emma>
Example showing a multimodal interpretation resulting from combination of speech input with a mouse event passed in through a control metadata message.
<emma:emma version="1.0"
xmlns:emma="http://www.w3.org/2003/04/emma"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2003/04/emma
http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd"
xmlns="http://www.example.com/example">
<emma:interpretation
emma:medium="acoustic tactile"
emma:mode="voice touch"
emma:lang="en-US"
emma:start="1087995963542"
emma:end="1087995964542"
emma:process="http://example.com/myintegrator.xml">
<emma:derived-from resource="voice1" composite="true"/>
<emma:derived-from resource="touch1" composite="true"/>
<command>
<action>zoom</action>
<location>
<point>42.1345 -37.128</point>
</location>
</command>
</emma:interpretation>
<emma:derivation>
<emma:interpretation id="voice1"
emma:medium="acoustic"
emma:mode="voice"
emma:lang="en-US"
emma:start="1087995963542"
emma:end="1087995964542"
emma:media-type="audio/x-wav; rate:8000;"
emma:tokens="zoom in here"
emma:signal="http://example.com/signals/456.wav"
emma:process="http://example.com/my_asr.xml">
<command>
<action>zoom</action>
<location/>
</command>
</emma:interpretation>
<emma:interpretation id="touch1"
emma:medium="tactile"
emma:mode="touch"
emma:start="1087995964000"
emma:end="1087995964000">
<point>42.1345 -37.128</point>
</emma:interpretation>
</emma:derivation>
</emma:emma>
As an example of a lattice of semantic interpretations, in a travel application where the source is either "Boston" or "Austin"and the destination is either "Newark" or "New York", the possibilities might be represented in a lattice as follows:
<emma:emma version="1.0"
xmlns:emma="http://www.w3.org/2003/04/emma"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2003/04/emma
http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd"
xmlns="http://www.example.com/example">
<emma:grammar id="gram1"
grammar-type="application/srgs-xml"
ref="http://acme.com/flightquery.grxml"/>
<emma:interpretation id="interp1"
emma:medium="acoustic"
emma:mode="voice"
emma:start="1087995961542"
emma:end="1087995963542"
emma:medium="acoustic"
emma:mode="voice"
emma:confidence="0.75"
emma:lang="en-US"
emma:grammar-ref="gram1"
emma:signal="http://example.com/signals/123.wav"
emma:media-type="audio/x-wav; rate:8000;"
emma:process="http://example.com/my_asr.xml">
<emma:lattice initial="1" final="8">
<emma:arc from="1" to="2">flights</emma:arc>
<emma:arc from="2" to="3">to</emma:arc>
<emma:arc from="3" to="4">boston</emma:arc>
<emma:arc from="3" to="4">austin</emma:arc>
<emma:arc from="4" to="5">from</emma:arc>
<emma:arc from="5" to="6">portland</emma:arc>
<emma:arc from="5" to="6">oakland</emma:arc>
<emma:arc from="6" to="7">today</emma:arc>
<emma:arc from="7" to="8">please</emma:arc>
<emma:arc from="6" to="8">tomorrow</emma:arc>
</emma:lattice>
</emma:interpretation>
</emma:emma>
In HTML speech applications, the synthesizer service does not participate directly in the user interface. Rather, it simply provides rendered audio upon request, similar to any media server, plus interim events such as marks. The UA buffers the rendered audio, and the application may choose to play it to the user at some point completely unrelated to the synthesizer service. It is the synthesizer's role to render the audio stream in a timely manner, at least rapidly enough to support real-time playback. The synthesizer MAY also render and transmit the stream faster than required for real time playback, or render multiple streams in parallel, in order to reduce latency in the application. This is a stark contrast to the IVR implemented by MRCP, where the synthesizer essentially renders directly to the user's telephone, and is an active part of the user interface.
The synthesizer MUST support [[SSML] AND plain text input. A synthesizer MAY also accept other input formats. In all cases, the client should use the content-type header to indicate the input format.
synth-method = "SPEAK"
| "STOP"
| "DEFINE-LEXICON"
The set of synthesizer request methods is a subset of those defined in [MRCPv2]
The SPEAK method operates similarly to its [MRCPv2] namesake. The primary difference is that SPEAK results in a new audio stream being sent from the server to the client, using the same Request-ID. A SPEAK request MUST include the Audio-Codec header. When the rendering has completed, and the end-of-stream message has been sent, the synthesizer sends a SPEAK-COMPLETE event.
When the synthesizer receives a STOP request, it ceases rendering the requests specified in the Active-Request-ID header. If the Active-Request-ID header is missing, it ceases rendring all active SPEAK requests. For any SPEAK request that is ceased, the synthesiser sends an end-of-stream message, and a SPEAK-COMPLETE event.
This is used to load or unload a lexicon, and is identical to its namesake in [MRCPv2].
Synthesis events are associated with 'IN-PROGRESS' request-state notifications from the synthesizer resource.
synth-event = "INTERIM-EVENT" ; See Interim Events above
| "SPEECH-MARKER" ; An SSML mark has been rendered
| "SPEAK-COMPLETE"
See Interim Events above.
This event indicates that an SSML mark has been rendered. It uses the Speech-Marker header, which contains a timestamp indicating where in the stream the mark occurred, and the label associated in the mark.
Implementations should send the SPEECH-MARKER as closely as possible to the corresponding media packet so clients may play the media and fire events in real time if needed.
Indicates that rendering of the SPEAK request has completed.
The synthesis headers used in html-speech/1.0 are mostly a subset of those in [MRCPv2], with some minor modification and additions.
synth-header = ; headers borrowed from [MRCPv2]
active-request-id-list
| Completion-Cause
| Completion-Reason
| Voice-Gender
| Voice-Age
| Voice-Variant
| Voice-Name
| Prosody-parameter ; Actually a collection of headers, see [MRCPv2]
| Speech-Marker
| Speech-Language
| Failed-URI
| Failed-URI-Cause
| Load-Lexicon
| Lexicon-Search-Order
| Vendor-Specific ; see Generic Headers
; new headers for html-speech/1.0
| Audio-Codec
| Stream-ID ; read-only
Speech-Marker = "Speech-Marker:" "timestamp" "=" date-time [";" 1*(UTFCHAR)]
; e.g. Speech-Marker:timestamp=2011-09-06T10:33:16.612Z;banana
Audio-Codec = "Audio-Codec:" mime-media-type ; See [RFC3555]
Stream-ID = 1*8DIGIT ; decimal representation of 24-bit stream-ID
Because an audio stream is created in response to a SPEAK request, the audio codec and parameters must be specified in the SPEAK request, or in SET-PARAMS, using the Audio-Codec header. If the synthesizer is unable to encode with this codec, it terminates the request with a 409 (unsupported header field) COMPLETE status message.
This event indicates when an SSML mark is rendered. It is similar to its namesake in [MRCPv2], except that the clock is defined as the local time at the service, and the timestamp format is as defined in this document. By using the timestamp from the beginning of the stream, and the timestamp of this event, the UA can calculate when to raise the event to the application based on where it is in the playback of the rendered stream.
Specifies the ID of the stream that contains the rendered audio, so that the UA can associate audio streams it receives with particular SPEAK requests. This is a read-only parameter, returned in responses to the SPEAK request.
The most straightforward use case for TTS is the synthesis of one utterance at a time. This is inevitable for just-in-time rendition of speech, for example in dialogue systems or in in-car navigation scenarios. Here, the web application will send a single speech synthesis SPEAK request to the speech service.
C->S: html-speech/1.0 SPEAK 3257
Resource-ID:synthesizer
Audio-codec:audio/flac
Speech-Language: de-DE
Content-Type:text/plain
Hallo, ich heiße Peter.
S->C: html-speech/1.0 3257 200 IN-PROGRESS
Resource-ID:synthesizer
Stream-ID: 112233
Speech-Marker:timestamp=2011-09-06T10:33:16.612Z
S->C: binary message: start of stream (stream-id = 112233)
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| message type | stream-id |
|1 0 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
+---------------+-----------------------------------------------+
|1 0 0 0 1 0 1 1 1 1 0 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0| } NTP Timestamp
|0 0 0 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0 1 1 0 0 0 0 1 0 1 0 1 0 1| }
| 61 75 64 69 | a u d i
| 6F 2F 66 6C | o / f l
| 61 63 +-------------------------------+ a c
+-------------------------------+
S->C: more binary media packets...
S->C: html-speech/1.0 SPEAK-COMPLETE 3257 COMPLETE
Resource-ID:Synthesizer
Completion-Cause:000 normal
Speech-Marker:timestamp=2011-09-06T10:33:26.922Z
S->C: binary audio packets...
S->C: binary audio packet: end of stream ( message type = 0x03 )
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| message type | stream-id |
|1 1 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
+---------------+-----------------------------------------------+
For richer markup of the text, it is possible to use the SSML format for sending an annotated request. For example, it is possible to propose an appropriate pronunciation or to indicate where to insert pauses. (SSML example adapted from http://www.w3.org/TR/speech-synthesis11/#edef_break)
C->S: html-speech/1.0 SPEAK 3257
Resource-ID:synthesizer
Voice-gender:neutral
Voice-Age:25
Audio-codec:audio/flac
Prosody-volume:medium
Content-Type:application/ssml+xml
<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xml:lang="en-US">
Please make your choice. <break time="3s"/>
Click any of the buttons to indicate your preference.
</speak>
Remainder of example as above
Some use cases require relatively static speech output which can be known at the time of loading a web page. In these cases, all required speech output can be requested in parallel as multiple concurrent requests. Callback methods in the web api are responsible to relate each speech stream to the appropriate place in the web application.
On the protocol level, the request of multiple speech streams concurrently is realized as follows.
C->S: html-speech/1.0 SPEAK 3257
Resource-ID:synthesizer
Audio-codec:audio/basic
Speech-Language: es-ES
Content-Type:text/plain
Hola, me llamo Maria.
C->S: html-speech/1.0 SPEAK 3258
Resource-ID:synthesizer
Audio-codec:audio/basic
Speech-Language: en-UK
Content-Type:text/plain
Hi, I'm George.
C->S: html-speech/1.0 SPEAK 3259
Resource-ID:synthesizer
Audio-codec:audio/basic
Speech-Language: de-DE
Content-Type:text/plain
Hallo, ich heiße Peter.
S->C: html-speech/1.0 3257 200 IN-PROGRESS
S->C: media for 3257
S->C: html-speech/1.0 3258 200 IN-PROGRESS
S->C: media for 3258
S->C: html-speech/1.0 3259 200 IN-PROGRESS
S->C: media for 3259
S->C: more media for 3257
S->C: html-speech/1.0 SPEAK-COMPLETE 3257 COMPLETE
S->C: more media for 3258
S->C: html-speech/1.0 SPEAK-COMPLETE 3258 COMPLETE
S->C: more media for 3259
S->C: html-speech/1.0 SPEAK-COMPLETE 3259 COMPLETE
The service MAY choose to serialize its processing of certain requests (such as only rendering one SPEAK request at a time), but MUST still accept multiple active requests.
In order to synchronize the speech content with other events in the web application, it is possible to mark relevant points in time using the SSML tag. When the speech is played back, a callback method is called for these markers, allowing the web application to present, e.g., visual displays synchronously.
(Example adapted from http://www.w3.org/TR/speech-synthesis11/#S3.3.2)
C->S: html-speech/1.0 SPEAK 3257
Resource-ID:synthesizer
Voice-gender:neutral
Voice-Age:25
Audio-codec:audio/flac<
Prosody-volume:medium
Content-Type:application/ssml+xml
<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xml:lang="en-US">
Would you like to sit <mark name="window_seat"/> here at the window, or
rather <mark name="aisle_seat"/> here at the aisle?
</speak>
S->C: html-speech/1.0 3257 200 IN-PROGRESS
Resource-ID:synthesizer
Stream-ID: 112233
Speech-Marker:timestamp=2011-09-06T10:33:16.612Z
S->C: binary message: start of stream (stream-id = 112233)
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| message type | stream-id |
|1 0 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
+---------------+-----------------------------------------------+
|1 0 0 0 1 0 1 1 1 1 0 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0| } NTP Timestamp
|0 0 0 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0 1 1 0 0 0 0 1 0 1 0 1 0 1| }
| 61 75 64 69 | a u d i
| 6F 2F 66 6C | o / f l
| 61 63 +-------------------------------+ a c
+-------------------------------+
S->C: more binary media packets...
S->C: html-speech/2.0 SPEECH-MARKER 3257 IN-PROGRESS
Resource-ID:synthesizer
Stream-ID: 112233
Speech-Marker:timestamp=2011-09-06T10:33:18.310Z;window_seat
S->C: more binary media packets...
S->C: html-speech/2.0 SPEECH-MARKER 3257 IN-PROGRESS
Resource-ID:synthesizer
Stream-ID: 112233
Speech-Marker:timestamp=2011-09-06T10:33:21.008Z;aisle_seat
S->C: more binary media packets...
S->C: html-speech/1.0 SPEAK-COMPLETE 3257 COMPLETE
Resource-ID:Synthesizer
Completion-Cause:000 normal
Speech-Marker:timestamp=2011-09-06T10:33:23.881Z
S->C: binary audio packets...
S->C: binary audio packet: end of stream ( message type = 0x03 )
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| message type | stream-id |
|1 1 0 0 0 0 0 0|1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0|
+---------------+-----------------------------------------------+