HTML Speech XG
Proposal for the "Control" portion of the Remote Speech Services Protocol

Version 2, 22nd June, 2011

Authored by Nuance with contributions from Microsoft and AT&T




This proposal is meant to be read in the context of the protocol discussion in the W3C HTML Speech group. Its focus is the control portion of the specification, and it assumes that session definition/initiation and transport are defined elsewhere.

Client Requests

Following MRCP2 convention, the client requests a "method" a particular remote speech resource. These methods are assumed to be defined as in MRCP2 except where noted.

Generic Methods - Available on both the 'recognizer' or synthesizer resource. We may also want to investigate supporting these methods at the session level (i.e. without a target resource).

Recognizer Methods - Available on the 'recognizer' resource.

Synthesis Methods - Available on the 'synthesizer' (aka TTS) resource.



Server Responses

The server responses to the client requests will be as defined by MRCP2 except where noted.

Request States - By MRCP2 convention, all communication from the server is labeled with a request state.

Recognition Events - Associated with 'IN-PROGRESS' request-state notifications from the 'recognizer' resource.

Synthesis Events - Associated with 'IN-PROGRESS' request-state notifications from the 'synthesizer' resource.



Server Notifications

Within MRCP v2, the server may only send message in response to a client-driven request. Client polling via GET-PARAMS is the only option for “pushing” a message from the server to the client.

It’s unclear whether server push through the HTML Speech protocol and API is required functionality. These messages could, for example, be accomplished outside the specification using a separate WebSocket connection. But if this is found to be convenient, then we may choose to define a server-driven notification mechanism as follows:

server-notification = version SP NOTIFY CRLF [body]
Note that such notification lacks support for MRCP infrastructure like request-ids and headers. These were omitted because I don’t see how the client browser would make sense of the data. If webapps require support for request-ids, parameter, etc, they would probably be best encoded within the message [body].

If the [body] was detected as being XML or JSON, it would be nice if the client browser could automatically reflect the data as a DOM or EMCA object. But I don’t know much about that sort of technology, so would need someone else to comment.



Headers

Following MRCP2 convention, headers will be used to communicate parameters and other standard formatted information from client to server and vice versa. The set of useful parameters will be as defined in MRCP2 except as follows:
New header will be required to support the continuous speech scenario on the 'recognizer' resource:
The following set of new 'recognizer' resource headers revolve around post-recognition modifications of the utterance string. The selected modifications are merged together and presented in parallel with the raw utterance (see example below). Assuming all of the above three formatting parameters were set to true, the user utterance of “he bought a damn nice five dollar watch in new york period”, might result in the following RECOGNITION-COMPLETE payload:

<emma:interpretation id="interp1" emma:medium="acoustic" emma:mode="voice">
    <emma:lattice initial="1" final="12">
      <emma:arc from="1" to="2">he</emma:arc>
      <emma:arc from="2" to="3">bought</emma:arc>
      <emma:arc from="3" to="4">a</emma:arc>
      <emma:arc from="4" to="5">damn</emma:arc>
      <emma:arc from="5" to="6">nice</emma:arc>
      <emma:arc from="6" to="7">five</emma:arc>
      <emma:arc from="7" to="8">dollar</emma:arc>
      <emma:arc from="8" to="9">watch</emma:arc>
      <emma:arc from="9" to="10">in</emma:arc>
      <emma:arc from="10" to="11">New York</emma:arc>
      <emma:arc from="11" to="12">./period</emma:arc>
    </emma:lattice>
    <text>He bought a d*** nice $5.00 watch in New York.</text>
    <alignment>Content undefined (i.e. vendor specific)</alignment>
</emma:interpretation>


References

[MRCP2]
MRCP 2.0
[EMMA]
EMMA 1.0