Re: Session initiation & media negotiation from Marc Schroeder on 2011-06-16 (public-xg-htmlspeech@w3.org from June 2011)

From: Marc Schroeder <marc.schroeder@dfki.de>
Date: Thu, 16 Jun 2011 11:17:45 +0200
To: Robert Brown <Robert.Brown@microsoft.com>
CC: "Milan Young (Nuance)" <Milan.Young@nuance.com>, "Satish Sampath (Google)" <satish@google.com>, "Glen Shires (gshires@google.com)" <gshires@google.com>, "Patrick Ehlen (AT&T)" <pehlen@attinteractive.com>, "Dan Burnett (Voxeo)" <dburnett@voxeo.com>, "Michael Johnston (AT&T)" <johnston@research.att.com>, HTML Speech XG <public-xg-htmlspeech@w3.org>, Michael Bodell <mbodell@microsoft.com>
Message-ID: <4DF9CA39.2010805@dfki.de>
Robert, all,

just a short comment on audio.

In the case of TTS, I don't quite see how the server could send the 
local time of the first audio sample, given the fact that it cannot 
predict when the client will start playing it. So your comment in this 
respect seems to apply to ASR only.

About the description of the audio format: I wonder if the following is 
sufficient:

 > audio-codec: audio/speex

and
 > | message type | channel no | reserved |
 > |1 0 0 0 0 0 0 0|1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0|

Some codecs might need a more detailed description of the audio format, 
e.g. raw PCM: sample rate, bits per sample, signed/unsigned... you get 
the point. I guess we want to allow for the possibility of a codec not 
including this information as part of the binary audio data, so the 
header should include an optional slot such as "audio-format:". Its 
content would probably have to be codec-specific.

Similarly, during the discovery phase, this logic would make it 
insufficient to merely provide the mime type of audio accepted or 
produced. And then of course the TTS request should be able to ask, 
e.g., for a specific sampling rate etc.

Best,
Marc



On 16.06.11 01:09, Robert Brown wrote:
> I took an action item last week to flesh out the media transport and
> media control messages, and whatever other session establishment stuff
> we need.
>
> Let me know what you think of this…
>
> To quote MRCPv2-24 as a starting point:
>
> ... MRCPv2 is not a
>
> "stand-alone" protocol - it relies on other protocols, such as
>
> Session Initiation Protocol (SIP) to rendezvous MRCPv2 clients and
>
> servers and manage sessions between them, and the Session Description
>
> Protocol (SDP) to describe, discover and exchange capabilities. It
>
> also depends on SIP and SDP to establish the media sessions and
>
> associated parameters between the media source or sink and the media
>
> server. Once this is done, the MRCPv2 protocol exchange operates
>
> over the control session established above, allowing the client to
>
> control the media processing resources on the speech resource server.
>
> So we need:
>
> A.Client & Server rendezvous
>
> B.Description, discovery and interchange of capabilities
>
> C.Media session establishment
>
> ****
>
> *(A) Client & Server rendezvous*
>
> Rendezvous is easy, and already described in the draft I sent out a
> couple of weeks back
> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/att-0008/speech-protocol-basic-approach-01.html.
> The client uses the standard WebSocket HTTP bootstrap, with the header
> “Sec-WebSocket-Protocol: html-speech”. The server accepts the request by
> sending “101 Switching Protocols” and from there on out it’s a
> WebSockets session using the protocol we’re defining here.
>
> ****
>
> *(B) Description, discovery and interchange of capabilities*
>
> Description, discovery and interchange of capabilities are also easy. I
> proposed in the draft that we use GET-PARAMS to do this. This needs
> further elaboration:
>
> 1.GET-PARAMS is an existing capabilities discovery mechanism. Furthermore:
>
> a.Since the common case will be applications having fore-knowledge of
> service capabilities, discovery etc of capabilities won’t be needed in
> most sessions. So it’s perfectly fine for this to be done after session
> establishment, if the client cares to do it at all. So GET-PARAMS seems
> fine.
>
> b.I don’t believe there’s any use case where the service needs to
> discover the capabilities of the client. Besides, it’s probably
> undesirable from a privacy point-of view. So GET-PARAMS, which is C->S,
> is fine.
>
> 2.We should add a few headers for things that clients would find useful
> to know, for example (but not specifically):
>
> a.“recognize-media:” lists the mime types for the media encodings the
> recognizer will accept.
>
> b.“recognize-lang:” lists the IETF language tags the recognizer will
> recognize.
>
> c.“speak-media:” where the response header lists the mime types for the
> media encodings the synthesizer can produce.
>
> d.“speak-lang:” lists the languages the synthesizer can speak.
>
> e.The recognize-* headers return blank if there’s no recognition
> service. Similarly with the speak-* headers if there’s no synthesizer.
>
> 3.The client can just tear down the session if it discovers that the
> service doesn’t have the necessary capabilities.
>
> ****
>
> *(C) Media session establishment*
>
> Media session establishment is also fairly straight forward, but not as
> obvious. With MRCP, media is out-of-band, and is negotiated along with
> the MRCP session negotiation in the SIP INVITE-OK-ACK sequence. However,
> with our protocol, everything happens in-band in the WebSockets session.
> So I think that leaves us with these options, of which I propose we
> adopt #3:
>
> 1.We **could** require all media to be negotiated as part of the
> WebSockets session establishment (i.e. the HTTP bootstrap). For example,
> we could put something in the body of the initial HTTP GET request.
> While I don’t think this is strictly illegal, it’s certainly irregular.
>
> 2.We **could**specify a set of messages that the client and service
> interchange at the very beginning of the session prior to doing any
> recognition or synthesis, which establish the media channels for the
> duration of the session. That would be fine if we were modeling a
> telephone. It would work, but it seems heavy-handed to me.
>
> 3.The most effective approach, IMHO, is for the client or server to just
> create an inline media channel as needed. For example, the client could
> create the audio input session when it does its first RECOGNIZE request.
> Ditto, the service could create an audio output session when it starts
> processing its SPEAK request.
>
> So, how would this work?
>
> In the draft I sent a couple of weeks ago, I suggested a couple of
> control methods:
>
> audio-control-method = "START-AUDIO" ; indicates the codec, and perhaps
> other params, for an audio channel
>
> | "JUMP-AUDIO" ; for example, to jump over a period of silence during SR,
>
> ; or shuttle control in TTS
>
> (Maybe these should be “*-MEDIA” instead of “*-AUDIO”?)
>
> The START-AUDIO method MUST be sent prior to the first packet in an
> audio stream. It essentially describes the encoding being used, and the
> channel number of the stream. It also includes a “source-time” header
> that indicates the local client time for the first byte of the audio
> stream on that channel. This can be used as reference point for other
> methods and events. For example, a DEFINE-GRAMMAR would specify the time
> at which the grammar becomes active, and the recognizer should be able
> to figure out where in the audio stream this is, which is important for
> continuous recognition scenarios. (The timing info is also useful for
> synchronizing multiple input streams.)
>
> For example:
>
> C->S: html-speech/1.0 START-AUDIO
>
> audio-channel: 1
>
> audio-codec: audio/speex
>
> source-time: 12753248231 (source’s local time at the start of the first
> packet)
>
> C->S: binary audio packet #1
>
> 0 1 2 3
>
> 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
>
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>
> | message type | channel no | reserved |
>
> |1 0 0 0 0 0 0 0|1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0|
>
> +---------------+-------------------------------+---------------+
>
> | encoded audio data |
>
> | ... |
>
> | ... |
>
> | ... |
>
> +---------------------------------------------------------------+
>
> C->S: binary audio packet #2
>
> 0 1 2 3
>
> 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
>
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>
> | message type | channel no | reserved |
>
> |1 0 0 0 0 0 0 0|1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0|
>
> +---------------+-------------------------------+---------------+
>
> | encoded audio data |
>
> | ... |
>
> | ... |
>
> | ... |
>
> +---------------------------------------------------------------+
>
> For the RECOGNIZE method, we would add a header called
> “audio-in-channels:” (or something like that) which lists the
> space-delimited media channels used for input (typically only one
> channel, but potentially more). It could have a default value of “1”.
> The corresponding stream(s) can be started either before or after the
> RECOGNIZE message is sent. RECOGNIZE could even be called many times
> during an ongoing stream.
>
> Similarly, for the SYNTHESIZE method, we would add another header called
> “audio-out-channel:” where the client specifies the channel it wants the
> output on. It would have a default value of “1”.
>
> ****
>
> *What about SDP???*
>
> SDP is not used in the HTML Speech protocol. It’s just not needed. The
> initial WebSockets handshake both describes and establishes the session.
> Further details are determined in-session using the simple mechanisms
> described above. Furthermore, SDP is concerned with lots of lower level
> things that are either completely irrelevant or already determined by
> virtue of the fact that we’re running on WebSockets. Perhaps if
> WebSockets used SDP as part of session establishment, there would be a
> case to leverage it. But that isn’t the case, and as things actually
> stand we basically have no problems left for SDP to solve.
>

-- 
Dr. Marc Schröder, Senior Researcher at DFKI GmbH
Project leader for DFKI in SSPNet http://sspnet.eu
Team Leader DFKI TTS Group http://mary.dfki.de
Editor W3C EmotionML Working Draft http://www.w3.org/TR/emotionml/
Portal Editor http://emotion-research.net

Homepage: http://www.dfki.de/~schroed
Email: marc.schroeder@dfki.de
Phone: +49-681-85775-5303
Postal address: DFKI GmbH, Campus D3_2, Stuhlsatzenhausweg 3, D-66123 
Saarbrücken, Germany
--
Official DFKI coordinates:
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Trippstadter Strasse 122, D-67663 Kaiserslautern, Germany
Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
Received on Thursday, 16 June 2011 09:18:16 UTC