RE: Session initiation & media negotiation

Thus far, I've been assuming that GET-SET params are targeted at a
particular resource (ie synthesizer or recognizer).  For example:

   C->S:  MRCP/2.0 ... SET-PARAMS 543256

          Channel-Identifier:32AECB23433802@speechsynth

          Voice-gender:female

          Voice-variant:3

 

Targeting this at the service may need further thought.

 

Thanks

 

 

________________________________

From: Robert Brown [mailto:Robert.Brown@microsoft.com] 
Sent: Wednesday, June 15, 2011 4:10 PM
To: Young, Milan; Satish Sampath (Google); Glen Shires
(gshires@google.com); Patrick Ehlen (AT&T); Dan Burnett (Voxeo); Michael
Johnston (AT&T); Marc Schroeder (DFKI); Glen Shires (gshires@google.com)
Cc: HTML Speech XG; Michael Bodell
Subject: Session initiation & media negotiation

 

I took an action item last week to flesh out the media transport and
media control messages, and whatever other session establishment stuff
we need.

 

Let me know what you think of this...

 

To quote MRCPv2-24 as a starting point:

 

   ... MRCPv2 is not a

   "stand-alone" protocol - it relies on other protocols, such as

   Session Initiation Protocol (SIP) to rendezvous MRCPv2 clients and

   servers and manage sessions between them, and the Session Description

   Protocol (SDP) to describe, discover and exchange capabilities.  It

   also depends on SIP and SDP to establish the media sessions and

   associated parameters between the media source or sink and the media

   server.  Once this is done, the MRCPv2 protocol exchange operates

   over the control session established above, allowing the client to

   control the media processing resources on the speech resource server.

 

So we need:

A.      Client & Server rendezvous

B.      Description, discovery and interchange of capabilities

C.      Media session establishment

 

****

(A) Client & Server rendezvous

 

Rendezvous is easy, and already described in the draft I sent out a
couple of weeks back
http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/att-000
8/speech-protocol-basic-approach-01.html.  The client uses the standard
WebSocket HTTP bootstrap, with the header "Sec-WebSocket-Protocol:
html-speech".  The server accepts the request by sending "101 Switching
Protocols" and from there on out it's a WebSockets session using the
protocol we're defining here.

 

****

(B) Description, discovery and interchange of capabilities

 

Description, discovery and interchange of capabilities are also easy.  I
proposed in the draft that we use GET-PARAMS to do this.  This needs
further elaboration:

1.       GET-PARAMS is an existing capabilities discovery mechanism.
Furthermore:

a.       Since the common case will be applications having
fore-knowledge of service capabilities, discovery etc of capabilities
won't be needed in most sessions.  So it's perfectly fine for this to be
done after session establishment, if the client cares to do it at all.
So GET-PARAMS seems fine.

b.      I don't believe there's any use case where the service needs to
discover the capabilities of the client.  Besides, it's probably
undesirable from a privacy point-of view.  So GET-PARAMS, which is C->S,
is fine.

2.       We should add a few headers for things that clients would find
useful to know, for example (but not specifically):

a.       "recognize-media:" lists the mime types for the media encodings
the recognizer will accept.  

b.      "recognize-lang:" lists the IETF language tags the recognizer
will recognize. 

c.       "speak-media:" where the response header lists the mime types
for the media encodings the synthesizer can produce.

d.      "speak-lang:" lists the languages the synthesizer can speak.

e.      The recognize-* headers return blank if there's no recognition
service.  Similarly with the speak-* headers if there's no synthesizer.

3.       The client can just tear down the session if it discovers that
the service doesn't have the necessary capabilities.

 

****

(C) Media session establishment

 

Media session establishment is also fairly straight forward, but not as
obvious.  With MRCP, media is out-of-band, and is negotiated along with
the MRCP session negotiation in the SIP INVITE-OK-ACK sequence.
However, with our protocol, everything happens in-band in the WebSockets
session.  So I think that leaves us with these options, of which I
propose we adopt #3:

1.       We *could* require all media to be negotiated as part of the
WebSockets session establishment (i.e. the HTTP bootstrap).  For
example, we could put something in the body of the initial HTTP GET
request.  While I don't think this is strictly illegal, it's certainly
irregular.

2.       We *could*specify a set of messages that the client and service
interchange at the very beginning of the session prior to doing any
recognition or synthesis, which establish the media channels for the
duration of the session.  That would be fine if we were modeling a
telephone.  It would work, but it seems heavy-handed to me.

3.       The most effective approach, IMHO, is for the client or server
to just create an inline media channel as needed.  For example, the
client could create the audio input session when it does its first
RECOGNIZE request.  Ditto, the service could create an audio output
session when it starts processing its SPEAK request.

 

So, how would this work?

 

In the draft I sent a couple of weeks ago, I suggested a couple of
control methods:

 

audio-control-method  =  "START-AUDIO" ; indicates the codec, and
perhaps other params, for an audio channel

                      |  "JUMP-AUDIO"  ; for example, to jump over a
period of silence during SR, 

                                       ; or shuttle control in TTS

(Maybe these should be "*-MEDIA" instead of "*-AUDIO"?)

 

The START-AUDIO method MUST be sent prior to the first packet in an
audio stream.  It essentially describes the encoding being used, and the
channel number of the stream.  It also includes a "source-time" header
that indicates the local client time for the first byte of the audio
stream on that channel.  This can be used as reference point for other
methods and events.  For example, a DEFINE-GRAMMAR would specify the
time at which the grammar becomes active, and the recognizer should be
able to figure out where in the audio stream this is, which is important
for continuous recognition scenarios.  (The timing info is also useful
for synchronizing multiple input streams.)

 

For example:

 

C->S: html-speech/1.0 START-AUDIO

      audio-channel: 1

      audio-codec: audio/speex

      source-time: 12753248231 (source's local time at the start of the
first packet)

 

C->S: binary audio packet #1

       0                   1                   2                   3

       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

     |  message type |           channel no          |   reserved    |

      |1 0 0 0 0 0 0 0|1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0|

      +---------------+-------------------------------+---------------+

      |                       encoded audio data                      |

      |                              ...                              |

      |                              ...                              |

      |                              ...                              |

      +---------------------------------------------------------------+

 

C->S: binary audio packet #2

       0                   1                   2                   3

       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

     |  message type |           channel no          |   reserved    |

      |1 0 0 0 0 0 0 0|1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0|

      +---------------+-------------------------------+---------------+

      |                       encoded audio data                      |

      |                              ...                              |

      |                              ...                              |

      |                              ...                              |

      +---------------------------------------------------------------+

 

For the RECOGNIZE method, we would add a header called
"audio-in-channels:" (or something like that) which lists the
space-delimited media channels used for input (typically only one
channel, but potentially more).  It could have a default value of "1".
The corresponding stream(s) can be started either before or after the
RECOGNIZE message is sent.  RECOGNIZE could even be called many times
during an ongoing stream. 

 

Similarly, for the SYNTHESIZE method, we would add another header called
"audio-out-channel:" where the client specifies the channel it wants the
output on.  It would have a default value of "1".  

 

****

What about SDP???

 

SDP is not used in the HTML Speech protocol. It's just not needed. The
initial WebSockets handshake both describes and establishes the session.
Further details are determined in-session using the simple mechanisms
described above.  Furthermore, SDP is concerned with lots of lower level
things that are either completely irrelevant or already determined by
virtue of the fact that we're running on WebSockets.  Perhaps if
WebSockets used SDP as part of session establishment, there would be a
case to leverage it.  But that isn't the case, and as things actually
stand we basically have no problems left for SDP to solve.

 

Received on Thursday, 16 June 2011 16:04:39 UTC