- From: Robert Brown <Robert.Brown@microsoft.com>
- Date: Wed, 15 Jun 2011 23:09:51 +0000
- To: "Milan Young (Nuance)" <Milan.Young@nuance.com>, "Satish Sampath (Google)" <satish@google.com>, "Glen Shires (gshires@google.com)" <gshires@google.com>, "Patrick Ehlen (AT&T)" <pehlen@attinteractive.com>, "Dan Burnett (Voxeo)" <dburnett@voxeo.com>, "Michael Johnston (AT&T)" <johnston@research.att.com>, "Marc Schroeder (DFKI)" <marc.schroeder@dfki.de>, "Glen Shires (gshires@google.com)" <gshires@google.com>
- CC: HTML Speech XG <public-xg-htmlspeech@w3.org>, Michael Bodell <mbodell@microsoft.com>
- Message-ID: <113BCF28740AF44989BE7D3F84AE18DD1B137159@TK5EX14MBXC118.redmond.corp.microsoft.>
I took an action item last week to flesh out the media transport and media control messages, and whatever other session establishment stuff we need.
Let me know what you think of this...
To quote MRCPv2-24 as a starting point:
   ... MRCPv2 is not a
   "stand-alone" protocol - it relies on other protocols, such as
   Session Initiation Protocol (SIP) to rendezvous MRCPv2 clients and
   servers and manage sessions between them, and the Session Description
   Protocol (SDP) to describe, discover and exchange capabilities.  It
   also depends on SIP and SDP to establish the media sessions and
   associated parameters between the media source or sink and the media
   server.  Once this is done, the MRCPv2 protocol exchange operates
   over the control session established above, allowing the client to
   control the media processing resources on the speech resource server.
So we need:
A.      Client & Server rendezvous
B.      Description, discovery and interchange of capabilities
C.      Media session establishment
****
(A) Client & Server rendezvous
Rendezvous is easy, and already described in the draft I sent out a couple of weeks back http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/att-0008/speech-protocol-basic-approach-01.html.  The client uses the standard WebSocket HTTP bootstrap, with the header "Sec-WebSocket-Protocol: html-speech".  The server accepts the request by sending "101 Switching Protocols" and from there on out it's a WebSockets session using the protocol we're defining here.
****
(B) Description, discovery and interchange of capabilities
Description, discovery and interchange of capabilities are also easy.  I proposed in the draft that we use GET-PARAMS to do this.  This needs further elaboration:
1.       GET-PARAMS is an existing capabilities discovery mechanism.   Furthermore:
a.       Since the common case will be applications having fore-knowledge of service capabilities, discovery etc of capabilities won't be needed in most sessions.  So it's perfectly fine for this to be done after session establishment, if the client cares to do it at all.  So GET-PARAMS seems fine.
b.      I don't believe there's any use case where the service needs to discover the capabilities of the client.  Besides, it's probably undesirable from a privacy point-of view.  So GET-PARAMS, which is C->S, is fine.
2.       We should add a few headers for things that clients would find useful to know, for example (but not specifically):
a.       "recognize-media:" lists the mime types for the media encodings the recognizer will accept.
b.      "recognize-lang:" lists the IETF language tags the recognizer will recognize.
c.       "speak-media:" where the response header lists the mime types for the media encodings the synthesizer can produce.
d.      "speak-lang:" lists the languages the synthesizer can speak.
e.      The recognize-* headers return blank if there's no recognition service.  Similarly with the speak-* headers if there's no synthesizer.
3.       The client can just tear down the session if it discovers that the service doesn't have the necessary capabilities.
****
(C) Media session establishment
Media session establishment is also fairly straight forward, but not as obvious.  With MRCP, media is out-of-band, and is negotiated along with the MRCP session negotiation in the SIP INVITE-OK-ACK sequence.  However, with our protocol, everything happens in-band in the WebSockets session.  So I think that leaves us with these options, of which I propose we adopt #3:
1.       We *could* require all media to be negotiated as part of the WebSockets session establishment (i.e. the HTTP bootstrap).  For example, we could put something in the body of the initial HTTP GET request.  While I don't think this is strictly illegal, it's certainly irregular.
2.       We *could*specify a set of messages that the client and service interchange at the very beginning of the session prior to doing any recognition or synthesis, which establish the media channels for the duration of the session.  That would be fine if we were modeling a telephone.  It would work, but it seems heavy-handed to me.
3.       The most effective approach, IMHO, is for the client or server to just create an inline media channel as needed.  For example, the client could create the audio input session when it does its first RECOGNIZE request.  Ditto, the service could create an audio output session when it starts processing its SPEAK request.
So, how would this work?
In the draft I sent a couple of weeks ago, I suggested a couple of control methods:
audio-control-method  =  "START-AUDIO" ; indicates the codec, and perhaps other params, for an audio channel
                      |  "JUMP-AUDIO"  ; for example, to jump over a period of silence during SR,
                                       ; or shuttle control in TTS
(Maybe these should be "*-MEDIA" instead of "*-AUDIO"?)
The START-AUDIO method MUST be sent prior to the first packet in an audio stream.  It essentially describes the encoding being used, and the channel number of the stream.  It also includes a "source-time" header that indicates the local client time for the first byte of the audio stream on that channel.  This can be used as reference point for other methods and events.  For example, a DEFINE-GRAMMAR would specify the time at which the grammar becomes active, and the recognizer should be able to figure out where in the audio stream this is, which is important for continuous recognition scenarios.  (The timing info is also useful for synchronizing multiple input streams.)
For example:
C->S: html-speech/1.0 START-AUDIO
      audio-channel: 1
      audio-codec: audio/speex
      source-time: 12753248231 (source's local time at the start of the first packet)
C->S: binary audio packet #1
       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  message type |           channel no          |   reserved    |
      |1 0 0 0 0 0 0 0|1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0|
      +---------------+-------------------------------+---------------+
      |                       encoded audio data                      |
      |                              ...                              |
      |                              ...                              |
      |                              ...                              |
      +---------------------------------------------------------------+
C->S: binary audio packet #2
       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  message type |           channel no          |   reserved    |
      |1 0 0 0 0 0 0 0|1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0|
      +---------------+-------------------------------+---------------+
      |                       encoded audio data                      |
      |                              ...                              |
      |                              ...                              |
      |                              ...                              |
      +---------------------------------------------------------------+
For the RECOGNIZE method, we would add a header called "audio-in-channels:" (or something like that) which lists the space-delimited media channels used for input (typically only one channel, but potentially more).  It could have a default value of "1".  The corresponding stream(s) can be started either before or after the RECOGNIZE message is sent.  RECOGNIZE could even be called many times during an ongoing stream.
Similarly, for the SYNTHESIZE method, we would add another header called "audio-out-channel:" where the client specifies the channel it wants the output on.  It would have a default value of "1".
****
What about SDP???
SDP is not used in the HTML Speech protocol. It's just not needed. The initial WebSockets handshake both describes and establishes the session.  Further details are determined in-session using the simple mechanisms described above.  Furthermore, SDP is concerned with lots of lower level things that are either completely irrelevant or already determined by virtue of the fact that we're running on WebSockets.  Perhaps if WebSockets used SDP as part of session establishment, there would be a case to leverage it.  But that isn't the case, and as things actually stand we basically have no problems left for SDP to solve.
Received on Wednesday, 15 June 2011 23:10:22 UTC