HTML Speech XG
Proposal for the "Control" portion of the Remote Speech Services Protocol

Version 2, 22nd June, 2011

Authored by Nuance with contributions from Microsoft and AT&T

This proposal is meant to be read in the context of the protocol discussion in the W3C HTML Speech group. Its focus is the control portion of the specification, and it assumes that session definition/initiation and transport are defined elsewhere.

Client Requests

Following MRCP2 convention, the client requests a "method" a particular remote speech resource. These methods are assumed to be defined as in MRCP2 except where noted.

Generic Methods - Available on both the 'recognizer' or synthesizer resource. We may also want to investigate supporting these methods at the session level (i.e. without a target resource).

GET-PARAMS
SET-PARAMS

Recognizer Methods - Available on the 'recognizer' resource.

DEFINE-GRAMMAR
RECOGNIZE - Similar to MRCP2 with TBD support for right and left context in message body. These cannot be included as headers because of ASCII limitation.
START-INPUT-TIMERS
STOP
INTERPRET

Synthesis Methods - Available on the 'synthesizer' (aka TTS) resource.

SPEAK
STOP
PAUSE
RESUME
BARGE-IN-OCCURRED
CONTROL
DEFINE-LEXICON

Server Responses

The server responses to the client requests will be as defined by MRCP2 except where noted.

Request States - By MRCP2 convention, all communication from the server is labeled with a request state.

COMPLETE
IN-PROGRESS
PENDING

Recognition Events - Associated with 'IN-PROGRESS' request-state notifications from the 'recognizer' resource.

START-OF-INPUT - Note the the timestamp on the event should be when speech was estimated to begin, NOT when the endpointer finally decided that speech began (M milliseconds later).
RECOGNITION-COMPLETE - Similar to MRCP2 except that application/emma+xml (EMMA) will be the default Content-Type.
INTERPRETATION-COMPLETE
END-OF-INPUT - Not part of the MRCP2 standard, this event is the logical counterpart of START-OF-INPUT. Note that the timestamp on the event should be at the point when speech was estimated to have ended, NOT when the endpointer finally decided that speech ended (M milliseconds later).
INTERMEDIATE-RESULT - Continuous speech (aka dictation) often requires feedback about what has been recognized thus far. Waiting for a monolithic RECOGNITION-COMPLETE event at the end of the RECOGNITION transaction does not usually lead to user-friendly interfaces. This INTERMEDIATE-RESULT (not part of MRCP2), provides this "live" channel. As with RECOGNITION-COMPLETE, contents are assumed to be EMMA unless an alternate Content-Type is provided.

Synthesis Events - Associated with 'IN-PROGRESS' request-state notifications from the 'synthesizer' resource.

SPEECH-MARKER
SPEAK-COMPLETE

Server Notifications

Within MRCP v2, the server may only send message in response to a client-driven request. Client polling via GET-PARAMS is the only option for �pushing� a message from the server to the client.

It�s unclear whether server push through the HTML Speech protocol and API is required functionality. These messages could, for example, be accomplished outside the specification using a separate WebSocket connection. But if this is found to be convenient, then we may choose to define a server-driven notification mechanism as follows:

server-notification = version SP NOTIFY CRLF [body]

Note that such notification lacks support for MRCP infrastructure like request-ids and headers. These were omitted because I don�t see how the client browser would make sense of the data. If webapps require support for request-ids, parameter, etc, they would probably be best encoded within the message [body].

If the [body] was detected as being XML or JSON, it would be nice if the client browser could automatically reflect the data as a DOM or EMCA object. But I don�t know much about that sort of technology, so would need someone else to comment.

Headers

Following MRCP2 convention, headers will be used to communicate parameters and other standard formatted information from client to server and vice versa. The set of useful parameters will be as defined in MRCP2 except as follows:

No support for Verification parameters defined in MRCP2 section 11.4.
Should investigate supporting "Generic Headers (defined in section 6.2) at the session level (ie without specifying a 'recognizer' or 'synthesizer' resource).
No support for the DTMF family of headers: DTMF-Interdigit-Timeout, DTMF-Term-Timeout, and DTMF-Term-Char.

New header will be required to support the continuous speech scenario on the 'recognizer' resource:

Partial - Boolean header defaults to 'false'.
When sent by the client in a RECOGNITION or SET-PARAMS request, this header controls whether or not the client is interested in �partial� results from the service. In this context, the term �partial� is meant to describe mid-utterance results that provide a best guess at the user�s speech thus far (e.g. �deer�, �dear father�, �dear father christmas�). These results should contain all recognized speech from the point of the last non-partial (ie complete) result, but it may be common for them to omit fully-qualified result attributes like an NBest list, timings, etc. The only guarantee is that the content must be EMMA.
Note that this header is valid on both regular command-and-control recognition requests as well as dictation sessions. This is because at the API level, there is no syntactic difference between the recognition types. They are both simply recognition requests over an SRGS grammar or set of URL(s). Additionally partial results can be useful in command-and-control scenarios when doing dictation enrollment and lip-sync.
When sent by the server, this header indicates whether the message contents represent a full or partial result. It�s valid for a server to send this header on both INTERMEDIATE-RESULT, RECOGNITION-COMPLETE, and in response to a GET-PARAMS messages.
Partial-Interval-Hint Integer � no default.
A suggestion from the client to the service on the frequency at which partial results should be sent. Integer value represents desired interval expressed in millisecond.
User-ID String � no default.
Recognition results are often more accurate if the recognizer can train itself to the user�s speech over time. This is especially the case with dictation as vocabularies are so large. A User-ID field would allow the recognizer to establish the user�s identify if the webapp decided to supply this information. Otherwise the engine would operate under default LMs.
Enrollment-Phrase String � no default.
Defines the phrase that the user has spoken so that the recognizer can refine the models. Although User-Gender and User-ID are two headers that will be commonly supplied in conjunction with the enrollment phrase, that can all be specified independent.
Punctuation-Alias String � no default.
It�s common for punctuation characters to have alternate names. For example the �.� character is often called �period� or �dot� depending upon the context. This parameter is a comma-separated list of such aliases where each alias is denoted with /. For example: �./period, ./dot, ./full stop, \,/comma, \n/newline�.

The following set of new 'recognizer' resource headers revolve around post-recognition modifications of the utterance string. The selected modifications are merged together and presented in parallel with the raw utterance (see example below).

Return-Punctuation Boolean � default false.
If true, the speech service would return a punctuated utterance in addition to the raw utterance. For example, if the user spoke �dear abby�, the result might be �dear abby,�.
GenderNumberPronoun String � empty default.
Some languages require the recognizer to conjugate verbs differently depending upon the gender and "number" of the speaker. For example, in French, this parameter might be set to one of "je", "tu", "vous", etc.
Return-Formatting Boolean � default false.
If true, the speech service would return a formatted utterance in addition to the raw utterance. For example, if the user spoke �dear abbey�, the result might be �Dear Abbey�.
Filter-Profanity Boolean � default true.
If true, the speech server would remove suspected profanity from the utterance string. For example, if the user spoke �my dog sally is a good bitch�, the result might be �my dog sally is a good b****�.

Assuming all of the above three formatting parameters were set to true, the user utterance of �he bought a damn nice five dollar watch in new york period�, might result in the following RECOGNITION-COMPLETE payload:


<emma:interpretation id="interp1" emma:medium="acoustic" emma:mode="voice">
    <emma:lattice initial="1" final="12">
      <emma:arc from="1" to="2">he</emma:arc>
      <emma:arc from="2" to="3">bought</emma:arc>
      <emma:arc from="3" to="4">a</emma:arc>
      <emma:arc from="4" to="5">damn</emma:arc>
      <emma:arc from="5" to="6">nice</emma:arc>
      <emma:arc from="6" to="7">five</emma:arc>
      <emma:arc from="7" to="8">dollar</emma:arc>
      <emma:arc from="8" to="9">watch</emma:arc>
      <emma:arc from="9" to="10">in</emma:arc>
      <emma:arc from="10" to="11">New York</emma:arc>
      <emma:arc from="11" to="12">./period</emma:arc>
    </emma:lattice>
    <text>He bought a d*** nice $5.00 watch in New York.</text>
    <alignment>Content undefined (i.e. vendor specific)</alignment>
</emma:interpretation>

References

[MRCP2]: MRCP 2.0
[EMMA]: EMMA 1.0