Re: Session initiation & media negotiation from JOHNSTON, MICHAEL J (MICHAEL J) on 2011-06-16 (public-xg-htmlspeech@w3.org from June 2011)

From: JOHNSTON, MICHAEL J (MICHAEL J) <johnston@research.att.com>
Date: Thu, 16 Jun 2011 13:02:58 -0400
To: "Young, Milan" <Milan.Young@nuance.com>
CC: Robert Brown <Robert.Brown@microsoft.com>, "Satish Sampath (Google)" <satish@google.com>, "gshires@google.com" <gshires@google.com>, "EHLEN, PATRICK (ATTSI)" <pehlen@attinteractive.com>, "Dan Burnett (Voxeo)" <dburnett@voxeo.com>, "Marc Schroeder (DFKI)" <marc.schroeder@dfki.de>, HTML Speech XG <public-xg-htmlspeech@w3.org>, Michael Bodell <mbodell@microsoft.com>
Message-ID: <B1171C7F-CF2F-46FD-958F-EF9B7516B300@research.att.com>
Could we use values in audio-codec:  or whatever we
call it, that look like the media-type values in EMMA,which
 includes the codec and other info such as
sample rate:


Here is an example where the media type for the ETSI ES 202 212 audio codec for Distributed Speech Recognition (DSR) is applied to the emma:interpretation element. The example also specifies an optional sampling rate of 8 kHz and maxptime of 40 milliseconds.

<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:interpretation id="intp1"
        emma:signal="http://example.com/signals/signal.dsr"
        emma:media-type="audio/dsr-es202212; rate:8000; maxptime:40"

        emma:medium="acoustic" emma:mode="voice">
    <origin>Boston</origin>
    <destination>Denver</destination>
    <date>03152003</date>
  </emma:interpretation>
</emma:emma>





Robert, all,

just a short comment on audio.

In the case of TTS, I don't quite see how the server could send the
local time of the first audio sample, given the fact that it cannot
predict when the client will start playing it. So your comment in this
respect seems to apply to ASR only.

About the description of the audio format: I wonder if the following is
sufficient:

 > audio-codec: audio/speex

and
 > | message type | channel no | reserved |
 > |1 0 0 0 0 0 0 0|1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0|

Some codecs might need a more detailed description of the audio format,
e.g. raw PCM: sample rate, bits per sample, signed/unsigned... you get
the point. I guess we want to allow for the possibility of a codec not
including this information as part of the binary audio data, so the
header should include an optional slot such as "audio-format:". Its
content would probably have to be codec-specific.

Similarly, during the discovery phase, this logic would make it
insufficient to merely provide the mime type of audio accepted or
produced. And then of course the TTS request should be able to ask,
e.g., for a specific sampling rate etc.

Best,
Marc



On 16.06.11 01:09, Robert Brown wrote:
> I took an action item last week to flesh out the media transport and
> media control messages, and whatever other session establishment stuff
> we need.
>
> Let me know what you think of this…
>
> To quote MRCPv2-24 as a starting point:
>
> ... MRCPv2 is not a
>
> "stand-alone" protocol - it relies on other protocols, such as
>
> Session Initiation Protocol (SIP) to rendezvous MRCPv2 clients and
>
> servers and manage sessions between them, and the Session Description
>
> Protocol (SDP) to describe, discover and exchange capabilities. It
>
> also depends on SIP and SDP to establish the media sessions and
>
> associated parameters between the media source or sink and the media
>
> server. Once this is done, the MRCPv2 protocol exchange operates
>
> over the control session established above, allowing the client to
>
> control the media processing resources on the speech resource server.
>
> So we need:
>
> A.Client & Server rendezvous
>
> B.Description, discovery and interchange of capabilities
>
> C.Media session establishment
>
> ****
>
> *(A) Client & Server rendezvous*
>
> Rendezvous is easy, and already described in the draft I sent out a
> couple of weeks back
> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/att-0008/speech-protocol-basic-approach-01.html.
> The client uses the standard WebSocket HTTP bootstrap, with the header
> “Sec-WebSocket-Protocol: html-speech”. The server accepts the request by
> sending “101 Switching Protocols” and from there on out it’s a
> WebSockets session using the protocol we’re defining here.
>
> ****
>
> *(B) Description, discovery and interchange of capabilities*
>
> Description, discovery and interchange of capabilities are also easy. I
> proposed in the draft that we use GET-PARAMS to do this. This needs
> further elaboration:
>
> 1.GET-PARAMS is an existing capabilities discovery mechanism. Furthermore:
>
> a.Since the common case will be applications having fore-knowledge of
> service capabilities, discovery etc of capabilities won’t be needed in
> most sessions. So it’s perfectly fine for this to be done after session
> establishment, if the client cares to do it at all. So GET-PARAMS seems
> fine.
>
> b.I don’t believe there’s any use case where the service needs to
> discover the capabilities of the client. Besides, it’s probably
> undesirable from a privacy point-of view. So GET-PARAMS, which is C->S,
> is fine.
>
> 2.We should add a few headers for things that clients would find useful
> to know, for example (but not specifically):
>
> a.“recognize-media:” lists the mime types for the media encodings the
> recognizer will accept.
>
> b.“recognize-lang:” lists the IETF language tags the recognizer will
> recognize.
>
> c.“speak-media:” where the response header lists the mime types for the
> media encodings the synthesizer can produce.
>
> d.“speak-lang:” lists the languages the synthesizer can speak.
>
> e.The recognize-* headers return blank if there’s no recognition
> service. Similarly with the speak-* headers if there’s no synthesizer.
>
> 3.The client can just tear down the session if it discovers that the
> service doesn’t have the necessary capabilities.
>
> ****
>
> *(C) Media session establishment*
>
> Media session establishment is also fairly straight forward, but not as
> obvious. With MRCP, media is out-of-band, and is negotiated along with
> the MRCP session negotiation in the SIP INVITE-OK-ACK sequence. However,
> with our protocol, everything happens in-band in the WebSockets session.
> So I think that leaves us with these options, of which I propose we
> adopt #3:
>
> 1.We **could** require all media to be negotiated as part of the
> WebSockets session establishment (i.e. the HTTP bootstrap). For example,
> we could put something in the body of the initial HTTP GET request.
> While I don’t think this is strictly illegal, it’s certainly irregular.
>
> 2.We **could**specify a set of messages that the client and service
> interchange at the very beginning of the session prior to doing any
> recognition or synthesis, which establish the media channels for the
> duration of the session. That would be fine if we were modeling a
> telephone. It would work, but it seems heavy-handed to me.
>
> 3.The most effective approach, IMHO, is for the client or server to just
> create an inline media channel as needed. For example, the client could
> create the audio input session when it does its first RECOGNIZE request.
> Ditto, the service could create an audio output session when it starts
> processing its SPEAK request.
>
> So, how would this work?
>
> In the draft I sent a couple of weeks ago, I suggested a couple of
> control methods:
>
> audio-control-method = "START-AUDIO" ; indicates the codec, and perhaps
> other params, for an audio channel
>
> | "JUMP-AUDIO" ; for example, to jump over a period of silence during SR,
>
> ; or shuttle control in TTS
>
> (Maybe these should be “*-MEDIA” instead of “*-AUDIO”?)
>
> The START-AUDIO method MUST be sent prior to the first packet in an
> audio stream. It essentially describes the encoding being used, and the
> channel number of the stream. It also includes a “source-time” header
> that indicates the local client time for the first byte of the audio
> stream on that channel. This can be used as reference point for other
> methods and events. For example, a DEFINE-GRAMMAR would specify the time
> at which the grammar becomes active, and the recognizer should be able
> to figure out where in the audio stream this is, which is important for
> continuous recognition scenarios. (The timing info is also useful for
> synchronizing multiple input streams.)
>
> For example:
>
> C->S: html-speech/1.0 START-AUDIO
>
> audio-channel: 1
>
> audio-codec: audio/speex
>
> source-time: 12753248231 (source’s local time at the start of the first
> packet)
>
> C->S: binary audio packet #1
>
> 0 1 2 3
>
> 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
>
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>
> | message type | channel no | reserved |
>
> |1 0 0 0 0 0 0 0|1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0|
>
> +---------------+-------------------------------+---------------+
>
> | encoded audio data |
>
> | ... |
>
> | ... |
>
> | ... |
>
> +---------------------------------------------------------------+
>
> C->S: binary audio packet #2
>
> 0 1 2 3
>
> 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
>
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>
> | message type | channel no | reserved |
>
> |1 0 0 0 0 0 0 0|1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0|
>
> +---------------+-------------------------------+---------------+
>
> | encoded audio data |
>
> | ... |
>
> | ... |
>
> | ... |
>
> +---------------------------------------------------------------+
>
> For the RECOGNIZE method, we would add a header called
> “audio-in-channels:” (or something like that) which lists the
> space-delimited media channels used for input (typically only one
> channel, but potentially more). It could have a default value of “1”.
> The corresponding stream(s) can be started either before or after the
> RECOGNIZE message is sent. RECOGNIZE could even be called many times
> during an ongoing stream.
>
> Similarly, for the SYNTHESIZE method, we would add another header called
> “audio-out-channel:” where the client specifies the channel it wants the
> output on. It would have a default value of “1”.
>
> ****
>
> *What about SDP???*
>
> SDP is not used in the HTML Speech protocol. It’s just not needed. The
> initial WebSockets handshake both describes and establishes the session.
> Further details are determined in-session using the simple mechanisms
> described above. Furthermore, SDP is concerned with lots of lower level
> things that are either completely irrelevant or already determined by
> virtue of the fact that we’re running on WebSockets. Perhaps if
> WebSockets used SDP as part of session establishment, there would be a
> case to leverage it. But that isn’t the case, and as things actually
> stand we basically have no problems left for SDP to solve.
>

--
Dr. Marc Schröder, Senior Researcher at DFKI GmbH
Project leader for DFKI in SSPNet http://sspnet.eu<http://sspnet.eu/>
Team Leader DFKI TTS Group http://mary.dfki.de<http://mary.dfki.de/>
Editor W3C EmotionML Working Draft http://www.w3.org/TR/emotionml/
Portal Editor http://emotion-research.net<http://emotion-research.net/>

Homepage: http://www.dfki.de/~schroed
Email: marc.schroeder@dfki.de<mailto:marc.schroeder@dfki.de?Subject=Re%3A%20Session%20initiation%20%26%20media%20negotiation&In-Reply-To=%253C4DF9CA39.2010805%40dfki.de%253E&References=%253C4DF9CA39.2010805%40dfki.de%253E>
Phone: +49-681-85775-5303
Postal address: DFKI GmbH, Campus D3_2, Stuhlsatzenhausweg 3, D-66123
Saarbrücken, Germany
--
Official DFKI coordinates:
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Trippstadter Strasse 122, D-67663 Kaiserslautern, Germany
Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313


Received on Thursday, 16 June 2011 09:18:16 GMT

 *   This message: [ Message body<http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/0045.html#start45> ]
 *   Next message: Young, Milan: "RE: Session initiation & media negotiation"<http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/0046.html>
 *   Previous message: Deborah Dahl: "RE: Speech API breakdown of work items"<http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/0044.html>
 *   In reply to: Robert Brown: "Session initiation & media negotiation"<http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/0042.html>
 *   Next in thread: Young, Milan: "RE: Session initiation & media negotiation"<http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/0046.html>

 *   Mail actions: [ respond to this message<mailto:public-xg-htmlspeech@w3.org?Subject=Re%3A%20Session%20initiation%20%26%20media%20negotiation&In-Reply-To=%253C4DF9CA39.2010805%40dfki.de%253E&References=%253C4DF9CA39.2010805%40dfki.de%253E> ] [ mail a new topic<mailto:public-xg-htmlspeech@w3.org> ]
 *   Contemporary messages sorted: [ by date<http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/index.html#msg45> ] [ by thread<http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/thread.html#msg45> ] [ by subject<http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/subject.html#msg45> ] [ by author<http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/author.html#msg45> ]
 *   Help: [ How to use the archives<http://lists.w3.org/Help/> ] [ Search in the archives<http://www.w3.org/Search/Mail/Public/search?type-index=public-xg-htmlspeech&index-type=t> ]

This archive was generated by hypermail 2.2.0<http://www.hypermail-project.org/>+W3C-0.50<http://www.w3.org/2004/12/hypermail-w3c-patch/> : Thursday, 16 June 2011 09:18:17 GMT


On Jun 16, 2011, at 12:03 PM, Young, Milan wrote:

Thus far, I’ve been assuming that GET-SET params are targeted at a particular resource (ie synthesizer or recognizer).  For example:
   C->S:  MRCP/2.0 ... SET-PARAMS 543256
          Channel-Identifier:32AECB23433802@speechsynth
          Voice-gender:female
          Voice-variant:3

Targeting this at the service may need further thought.

Thanks


________________________________
From: Robert Brown [mailto:Robert.Brown@microsoft.com]
Sent: Wednesday, June 15, 2011 4:10 PM
To: Young, Milan; Satish Sampath (Google); Glen Shires (gshires@google.com<mailto:gshires@google.com>); Patrick Ehlen (AT&T); Dan Burnett (Voxeo); Michael Johnston (AT&T); Marc Schroeder (DFKI); Glen Shires (gshires@google.com<mailto:gshires@google.com>)
Cc: HTML Speech XG; Michael Bodell
Subject: Session initiation & media negotiation

I took an action item last week to flesh out the media transport and media control messages, and whatever other session establishment stuff we need.

Let me know what you think of this…

To quote MRCPv2-24 as a starting point:

   ... MRCPv2 is not a
   "stand-alone" protocol - it relies on other protocols, such as
   Session Initiation Protocol (SIP) to rendezvous MRCPv2 clients and
   servers and manage sessions between them, and the Session Description
   Protocol (SDP) to describe, discover and exchange capabilities.  It
   also depends on SIP and SDP to establish the media sessions and
   associated parameters between the media source or sink and the media
   server.  Once this is done, the MRCPv2 protocol exchange operates
   over the control session established above, allowing the client to
   control the media processing resources on the speech resource server.

So we need:
A.      Client & Server rendezvous
B.      Description, discovery and interchange of capabilities
C.      Media session establishment

****
(A) Client & Server rendezvous

Rendezvous is easy, and already described in the draft I sent out a couple of weeks back http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/att-0008/speech-protocol-basic-approach-01.html.  The client uses the standard WebSocket HTTP bootstrap, with the header “Sec-WebSocket-Protocol: html-speech”.  The server accepts the request by sending “101 Switching Protocols” and from there on out it’s a WebSockets session using the protocol we’re defining here.

****
(B) Description, discovery and interchange of capabilities

Description, discovery and interchange of capabilities are also easy.  I proposed in the draft that we use GET-PARAMS to do this.  This needs further elaboration:
1.       GET-PARAMS is an existing capabilities discovery mechanism.   Furthermore:
a.       Since the common case will be applications having fore-knowledge of service capabilities, discovery etc of capabilities won’t be needed in most sessions.  So it’s perfectly fine for this to be done after session establishment, if the client cares to do it at all.  So GET-PARAMS seems fine.
b.      I don’t believe there’s any use case where the service needs to discover the capabilities of the client.  Besides, it’s probably undesirable from a privacy point-of view.  So GET-PARAMS, which is C->S, is fine.
2.       We should add a few headers for things that clients would find useful to know, for example (but not specifically):
a.       “recognize-media:” lists the mime types for the media encodings the recognizer will accept.
b.      “recognize-lang:” lists the IETF language tags the recognizer will recognize.
c.       “speak-media:” where the response header lists the mime types for the media encodings the synthesizer can produce.
d.      “speak-lang:” lists the languages the synthesizer can speak.
e.      The recognize-* headers return blank if there’s no recognition service.  Similarly with the speak-* headers if there’s no synthesizer.
3.       The client can just tear down the session if it discovers that the service doesn’t have the necessary capabilities.

****
(C) Media session establishment

Media session establishment is also fairly straight forward, but not as obvious.  With MRCP, media is out-of-band, and is negotiated along with the MRCP session negotiation in the SIP INVITE-OK-ACK sequence.  However, with our protocol, everything happens in-band in the WebSockets session.  So I think that leaves us with these options, of which I propose we adopt #3:
1.       We *could* require all media to be negotiated as part of the WebSockets session establishment (i.e. the HTTP bootstrap).  For example, we could put something in the body of the initial HTTP GET request.  While I don’t think this is strictly illegal, it’s certainly irregular.
2.       We *could*specify a set of messages that the client and service interchange at the very beginning of the session prior to doing any recognition or synthesis, which establish the media channels for the duration of the session.  That would be fine if we were modeling a telephone.  It would work, but it seems heavy-handed to me.
3.       The most effective approach, IMHO, is for the client or server to just create an inline media channel as needed.  For example, the client could create the audio input session when it does its first RECOGNIZE request.  Ditto, the service could create an audio output session when it starts processing its SPEAK request.

So, how would this work?

In the draft I sent a couple of weeks ago, I suggested a couple of control methods:

audio-control-method  =  "START-AUDIO" ; indicates the codec, and perhaps other params, for an audio channel
                      |  "JUMP-AUDIO"  ; for example, to jump over a period of silence during SR,
                                       ; or shuttle control in TTS
(Maybe these should be “*-MEDIA” instead of “*-AUDIO”?)

The START-AUDIO method MUST be sent prior to the first packet in an audio stream.  It essentially describes the encoding being used, and the channel number of the stream.  It also includes a “source-time” header that indicates the local client time for the first byte of the audio stream on that channel.  This can be used as reference point for other methods and events.  For example, a DEFINE-GRAMMAR would specify the time at which the grammar becomes active, and the recognizer should be able to figure out where in the audio stream this is, which is important for continuous recognition scenarios.  (The timing info is also useful for synchronizing multiple input streams.)

For example:

C->S: html-speech/1.0 START-AUDIO
      audio-channel: 1
      audio-codec: audio/speex
      source-time: 12753248231 (source’s local time at the start of the first packet)

C->S: binary audio packet #1
       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  message type |           channel no          |   reserved    |
      |1 0 0 0 0 0 0 0|1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0|
      +---------------+-------------------------------+---------------+
      |                       encoded audio data                      |
      |                              ...                              |
      |                              ...                              |
      |                              ...                              |
      +---------------------------------------------------------------+

C->S: binary audio packet #2
       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  message type |           channel no          |   reserved    |
      |1 0 0 0 0 0 0 0|1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|0 0 0 0 0 0 0 0|
      +---------------+-------------------------------+---------------+
      |                       encoded audio data                      |
      |                              ...                              |
      |                              ...                              |
      |                              ...                              |
      +---------------------------------------------------------------+

For the RECOGNIZE method, we would add a header called “audio-in-channels:” (or something like that) which lists the space-delimited media channels used for input (typically only one channel, but potentially more).  It could have a default value of “1”.  The corresponding stream(s) can be started either before or after the RECOGNIZE message is sent.  RECOGNIZE could even be called many times during an ongoing stream.

Similarly, for the SYNTHESIZE method, we would add another header called “audio-out-channel:” where the client specifies the channel it wants the output on.  It would have a default value of “1”.

****
What about SDP???

SDP is not used in the HTML Speech protocol. It’s just not needed. The initial WebSockets handshake both describes and establishes the session.  Further details are determined in-session using the simple mechanisms described above.  Furthermore, SDP is concerned with lots of lower level things that are either completely irrelevant or already determined by virtue of the fact that we’re running on WebSockets.  Perhaps if WebSockets used SDP as part of session establishment, there would be a case to leverage it.  But that isn’t the case, and as things actually stand we basically have no problems left for SDP to solve.
Received on Thursday, 16 June 2011 17:03:38 UTC