- From: Young, Milan <Milan.Young@nuance.com>
- Date: Fri, 10 Jun 2011 11:46:09 -0700
- To: Robert Brown <Robert.Brown@microsoft.com>, "Satish Sampath (Google)" <satish@google.com>, <gshires@google.com>, "Marc Schroeder (DFKI)" <marc.schroeder@dfki.de>, "Patrick Ehlen (AT&T)" <pehlen@attinteractive.com>, "JOHNSTON, MICHAEL J (MICHAEL J)" <johnston@research.att.com>
- CC: "Dan Burnett (Voxeo)" <dburnett@voxeo.com>, HTML Speech XG <public-xg-htmlspeech@w3.org>
- Message-ID: <1AA381D92997964F898DF2A3AA4FF9AD0B8C91CC@SUN-EXCH01.nuance.com>
Robert's draft referenced a few placeholder control methods and headers
that were "inspired from MRCP". This is a start at making these
sections more concrete.
One notable omission is handling of continuous recognition results and
corrections. I will follow up on this section early next week.
---------------------------
Client Requests
For the contents of 'recognition-method', I suggest we use the following
as defined by MRCP v2:
SET-PARAMS
GET-PARAMS
DEFINE-GRAMMAR
RECOGNIZE
RECOGNITION-START-TIMERS
STOP
INTERPRET
... and for 'synthesizer-method':
SET-PARAMS
GET-PARAMS
SPEAK
STOP
PAUSE
RESUME
BARGE-IN-OCCURRED
CONTROL
DEFINE-LEXICON
I suggest we also add a recorder resource (this probably needs
discussion in the API group). Although there are other ways to pass
recorded audio from client to server, doing it within the protocol has
some nice advantages:
* Consistent use of server-based endpointing and channel
adaptation.
* Shares the headers with the other control messages (eg
timeouts, cookies, and channel-identifier).
* Same network paths
'recorder-method' would be defined as per MRCP v2 using the following
methods:
RECORD
STOP
START-INPUT-TIMERS
Server Responses
Server request state should be exactly as defined by MRCP v2:
COMPLETE
IN-PROGRESS
PENDING
For 'recognizer-event', I suggest we use the following as defined by
MRCP:
START-OF-INPUT
RECOGNITION-COMPLETE
INTERPRETATION-COMPLETE
... and for 'synthesizer-event'
SPEECH-MARKER
SPEAK-COMPLETE
...and for 'recorder-event'
START-OF-INPUT
RECORD-COMPLETE
Headers
I suggest that we use all the headers defined by MRCP v2 except those
that are specific to verification. Specifically, this means:
* Generic (see
http://tools.ietf.org/html/draft-ietf-speechsc-mrcpv2-24#section-6.2).
* Synthesizer (see
http://tools.ietf.org/html/draft-ietf-speechsc-mrcpv2-24#section-8.4)
* Recognizer (see
http://tools.ietf.org/html/draft-ietf-speechsc-mrcpv2-24#section-9.4)
* Recorder (see
http://tools.ietf.org/html/draft-ietf-speechsc-mrcpv2-24#section-10.4)
Appropriate use of these headers is defined as per MRCP v2 spec in the
context of a specific method or response reference by this
specification.
Server Notifications
Within MRCP v2, the server may only send message in response to a
client-driven request. Client polling via GET-PARAMS is the only option
for "pushing" a message from the server to the client.
It's unclear whether server push through the HTML Speech protocol and
API is required functionality. These messages could, for example, be
accomplished outside the specification using a separate WebSocket
connection. On the other hand, frameworks like MMI hinge on the ability
for the server to proactively send state updates to the client.
If this is found to be convenient, then we may choose to add to our list
of 'event-names' with a 'notification-event'. This new event would use
a status code of '200', and a request state of 'NOTIFY'. The value of
the 'Channel-Identifier' header would use a new resource type called
'notification'. For example:
html-speech/1.0 92 323340 200 NOTIFY
Channel-Identifier: 817@notification
Content-Length: 36
Content-Type: text/xml
<?xml version="1.0"?>
<foo>bar</foo>
A couple notes:
* If the [body] was detected as being XML or JSON, it would be nice if
the client browser could automatically reflect the data as a DOM or EMCA
object. But I don't know much about that sort of technology, so would
need someone else to comment.
* The client would request notifications using the SDP-like setup
protocol that Robert is working on. Something like
'a=resource:notification'.
* The client browser would not interpret any headers in the
notification those required to parse the message (ie 'Content-Length',
'Content-Type', and 'Content-Encoding').
* The request-id, Channel-Identifier, and other headers would be
bundled up along with the body and handed to the webapp. It would be up
to the application to decide the meaning of such headers in the context
of the notification.
Received on Friday, 10 June 2011 18:48:06 UTC