Authored by Nuance with contributions from Microsoft and AT&T
Generic Methods - Available on both the 'recognizer' or synthesizer resource. We may also want to investigate supporting these methods at the session level (i.e. without a target resource).
Recognizer Methods - Available on the 'recognizer' resource.
Synthesis Methods - Available on the 'synthesizer' (aka TTS) resource.
Request States - By MRCP2 convention, all communication from the server is labeled with a request state.
Recognition Events - Associated with 'IN-PROGRESS' request-state notifications from the 'recognizer' resource.
Synthesis Events - Associated with 'IN-PROGRESS' request-state notifications from the 'synthesizer' resource.
It’s unclear whether server push through the HTML Speech protocol and API is required functionality. These messages could, for example, be accomplished outside the
specification using a separate WebSocket connection. But if this is found to be convenient, then we may choose to define a server-driven notification mechanism as follows:
Note that such notification lacks support for MRCP infrastructure like request-ids and headers. These were omitted because I don’t see how the client browser would make
sense of the data. If webapps require support for request-ids, parameter, etc, they would probably be best encoded within the message [body].server-notification = version SP NOTIFY CRLF [body]
If the [body] was detected as being XML or JSON, it would be nice if the client browser could automatically reflect the data as a DOM or EMCA object. But I don’t know much about that sort of technology, so would need someone else to comment.
When sent by the client in a RECOGNITION or SET-PARAMS request, this header controls whether or not the client is interested in “partial” results from the service. In this context, the term “partial” is meant to describe mid-utterance results that provide a best guess at the user’s speech thus far (e.g. “deer”, “dear father”, “dear father christmas”). These results should contain all recognized speech from the point of the last non-partial (ie complete) result, but it may be common for them to omit fully-qualified result attributes like an NBest list, timings, etc. The only guarantee is that the content must be EMMA.
Note that this header is valid on both regular command-and-control recognition requests as well as dictation sessions. This is because at the API level, there is no syntactic difference between the recognition types. They are both simply recognition requests over an SRGS grammar or set of URL(s). Additionally partial results can be useful in command-and-control scenarios when doing dictation enrollment and lip-sync.
When sent by the server, this header indicates whether the message contents represent a full or partial result. It’s valid for a server to send this header on both INTERMEDIATE-RESULT, RECOGNITION-COMPLETE, and in response to a GET-PARAMS messages.
A suggestion from the client to the service on the frequency at which partial results should be sent. Integer value represents desired interval expressed in millisecond.
Recognition results are often more accurate if the recognizer can train itself to the user’s speech over time. This is especially the case with dictation as vocabularies are so large. A User-ID field would allow the recognizer to establish the user’s identify if the webapp decided to supply this information. Otherwise the engine would operate under default LMs.
Defines the phrase that the user has spoken so that the recognizer can refine the models. Although User-Gender and User-ID are two headers that will be commonly supplied in conjunction with the enrollment phrase, that can all be specified independent.
It’s common for punctuation characters to have alternate names. For example the ‘.’ character is often called ‘period’ or ‘dot’ depending upon the context. This parameter is a comma-separated list of such aliases where each alias is denoted with
If true, the speech service would return a punctuated utterance in addition to the raw utterance. For example, if the user spoke “dear abby”, the result might be ‘dear abby,’.
Some languages require the recognizer to conjugate verbs differently depending upon the gender and "number" of the speaker. For example, in French, this parameter might be set to one of "je", "tu", "vous", etc.
If true, the speech service would return a formatted utterance in addition to the raw utterance. For example, if the user spoke “dear abbey”, the result might be ‘Dear Abbey’.
If true, the speech server would remove suspected profanity from the utterance string. For example, if the user spoke “my dog sally is a good bitch”, the result might be “my dog sally is a good b****”.
<emma:interpretation id="interp1" emma:medium="acoustic" emma:mode="voice">
<emma:lattice initial="1" final="12">
<emma:arc from="1" to="2">he</emma:arc>
<emma:arc from="2" to="3">bought</emma:arc>
<emma:arc from="3" to="4">a</emma:arc>
<emma:arc from="4" to="5">damn</emma:arc>
<emma:arc from="5" to="6">nice</emma:arc>
<emma:arc from="6" to="7">five</emma:arc>
<emma:arc from="7" to="8">dollar</emma:arc>
<emma:arc from="8" to="9">watch</emma:arc>
<emma:arc from="9" to="10">in</emma:arc>
<emma:arc from="10" to="11">New York</emma:arc>
<emma:arc from="11" to="12">./period</emma:arc>
</emma:lattice>
<text>He bought a d*** nice $5.00 watch in New York.</text>
<alignment>Content undefined (i.e. vendor specific)</alignment>
</emma:interpretation>