RE: Session initiation & media negotiation

Thanks Milan,

Apparently I should have proof-read this mail before I sent it :)

I mostly agree with the feedback, but have a couple of comments where I'd like to leave it unchanged.

Inline...

From: Young, Milan [mailto:Milan.Young@nuance.com]
Sent: Tuesday, June 21, 2011 1:00 PM
To: Robert Brown; Satish Sampath (Google); gshires@google.com; Patrick Ehlen (AT&T); Dan Burnett (Voxeo); Michael Johnston (AT&T); Marc Schroeder (DFKI); gshires@google.com; Fergus Henderson (google)
Cc: HTML Speech XG; Michael Bodell
Subject: RE: Session initiation & media negotiation

Inline...

________________________________
From: Robert Brown [mailto:Robert.Brown@microsoft.com]
Sent: Monday, June 20, 2011 4:28 PM
To: Robert Brown; Young, Milan; Satish Sampath (Google); Glen Shires (gshires@google.com); Patrick Ehlen (AT&T); Dan Burnett (Voxeo); Michael Johnston (AT&T); Marc Schroeder (DFKI); Glen Shires (gshires@google.com); 'Fergus Henderson (google)'
Cc: HTML Speech XG; Michael Bodell
Subject: RE: Session initiation & media negotiation

Thanks everyone for the very constructive feedback.

Discovering capabilities

The proposal I sent has two problems:

1.       GET-PARAMS is directed at a specific resource, whereas we probably need something that's directed at the service itself. (Thanks Milan).

2.       Returning the full list of server capabilities may be unnecessarily verbose for some services, and may also be insufficient when the server supports resampling. (Thanks Fergus).

Rather than try to mutate GET-PARAMS, I propose that we introduce a new SESSION-QUERY message the client can send to query whether the service supports specific capabilities nominated by the client.

For example, here the client asks for some capabilities, and the server returns the supported subset:

C->S: html-speech/1.0 ... SESSION-QUERY 34132
      recognize-media: audio/basic, audio/amr-wb,
                       audio/x-wav;channels=2;formattag=pcm;samplespersec=44100,
                       audio/dsr-es202212; rate:8000; maxptime:40
      recognize-lang: en-AU, en-GB, en-US, en
      speak-media: audio/ogg, audio/flac, audio/basic
      speak-lang: en-AU, en-GB

S->C: html-speech/1.0 ... 34132 200 COMPLETE
      recognize-media: audio/basic, audio/dsr-es202212; rate:8000; maxptime:40
      recognize-lang: en-GB, en
      speak-media: audio/flac, audio/basic
      speak-lang: en-GB

What do you think?
[Milan] I thought there was a privacy issue around requiring the client to request specific service capabilities.  That's why we were originally preferring a model where the server should send everything it can do, and the client could choose from that list.  So perhaps we should support a '*' syntax or something in addition to what you have spec'd above.

[Robert] I think we're okay here.  The privacy concern was around the UA revealing the user's personal language settings when querying the capabilities of the default services.   See these minutes: http://www.w3.org/2005/Incubator/htmlspeech/2011/05/f2fminutes201105.html#fallback.
<burn> tentative text: The API must provide a way to ask the user agent for the capabilities of a service. In the case of private information that the user agent may have when the default service is selected, the user agent may choose to give incorrect information to the web app or decline to answer.

I'd also prefer your 'speak' and 'recognize' keywords to be consistent with the resource names in the control protocol.  MRCP has these as 'recognizer' and 'synthesizer'.  Perhaps synthesizer is not the best name given the fact that it also supports recorded audio, but obviously the MRCP folks thought it was a decent fit.  So unless we have strong reason to diverge, I suggest we stick with convention.

[Robert] Agreed.

Media signaling

People identified a few problems with the proposal I sent:

1.       Media formats were insufficiently specified.  http://www.ietf.org/rfc/rfc3555.txt "MIME Type Registration of RTP Payload Formats" should do the trick.  (Thanks Marc, Michael J, & Bjorn).

2.       The timestamp in START-AUDIO is unnecessary for TTS.  (Thanks Marc).  My original thought here was to include that timestamp at the start of the TTS stream, so that it would have a base time from which to express the timestamp of interim events, such as mark events.  But I think Marc's right and it's redundant.  Furthermore, I think the entire START-AUDIO message redundant for TTS, since all audio will be sent in response to a SPEAK request.
I also have these other concerns:

3.       There's no way for the service to reject an input stream if it is unsupported (e.g. the service doesn't do SR, or the encoding format is not supported).  To be consistent, the service should return 200 PENDING to indicate it's receiving (or ready to receive) an input stream, and 200 COMPLETE when reaches the end of an input stream; or alternatively a 4xx COMPLETE to indicate it's unable to process the stream for whatever reason.

4.       I'm unconvinced that the audio-channel header is needed.  MRCP has a request-id for each request, and I suspect this should suffice as a channel ID for the audio stream.  The only drawback with this is that MRCP's request ID is defined as a 10-digit decimal number (1*10DIGIT), which would need 34 bits to encode.  This seems arbitrarily large and could easily fit into 16 bits and still be massively overprovisioned.

5.       The JUMP-AUDIO method I proposed seems quite out of place, since it's essentially a special media encoding ("silence happened here"), sent as a text message.  It would be more consistent to send this as a type of media packet.

So... here's a rewrite of the proposal:

Audio Packet Format

In the audio packet format:

1.       rename "binary-channel-id" to a 16-bit "binary-session-id".

2.       Define three message types:

a.       0x01 = audio packet. Data contains encoded audio data.

b.      0x02 = skip forward.  Data contains the new timestamp for the stream, where intervening time is assumed to be silence.

c.       0x03 = end-of-stream.  No data.

audio-packet          =  binary-message-type
                         binary-session-id
                         binary-reserved
                         binary-data
binary-message-type   =  OCTET ; Values > 0x03 are reserved. 0x00 is undefined.
binary-session-id     = 2OCTET ; Matches the request-id for a SPEAK or START-AUDIO request
binary-reserved       =  OCTET ; Just buffers out to the 32-bit boundary for now, but may be useful later.
binary-data           = *OCTET ;


MRCP Request ID

Redefine the Request ID as a decimal number between 0 and 65535 (i.e. 2^16), which should be more than sufficient, by several orders of magnitude.
[Milan] I'm usually wary of comments like that, especially coming from Microsoft :).  But in this case I suspect you are OK.
[Robert] I'm shocked and offended :P


TTS Audio

For TTS, we'd add a mandatory "audio-codec:" header to the SPEAK request.  All audio packets generated from that request will use that CODEC, and will encode the SPEAK request's session-id in their binary header.
[Milan] In the MRCP world, this would be a request-id not a session-id.  I understand the desire to be consistent with SR, but we need to be careful about diverging in terminology if we are going to reference another spec.

[Robert] Agreed.  This is a typo I made throughout the doc.


The client uses the MRCP "speech-marker:" header to calculate the timing of the TTS events.  The clock is defined as starting at zero at the beginning of the output stream.  (FWIW, this isn't too inconsistent with MRCP, which defines this header as being synched with the RTP timestamp, which we don't have, but RTP in turn says that it's okay to use zero - see http://www.ietf.org/rfc/rfc3550.txt section 6.4.1<http://www.ietf.org/rfc/rfc3550.txt%20section%206.4.1>).

Plagiarizing and modifying a sample from MRCPv2-24:

   C->S: html-speech/1.0 ... SPEAK 3257
         Channel-Identifier:32AECB23433802@speechsynth
         Voice-gender:neutral
         Voice-Age:25
         Audio-codec:audio/flac
         Prosody-volume:medium
         Content-Type:application/ssml+xml
         Content-Length:...

         <?xml version="1.0"?>
            <speak version="1.0"
            ...
            ...
   S->C: html-speech/1.0 ... 3257 200 IN-PROGRESS
         Channel-Identifier:32AECB23433802@speechsynth
         Speech-Marker:timestamp=0

   C->S: binary audio packet #1 (session-id = 3257 = 110010111001)
         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |           session-id          |   reserved    |
        |1 0 0 0 0 0 0 0|1 0 0 1 1 1 0 1 0 0 1 1 0 0 0 0|0 0 0 0 0 0 0 0|
        +---------------+-------------------------------+---------------+
        |                       encoded audio data                      |
        |                              ...                              |
        |                              ...                              |
        |                              ...                              |
        +---------------------------------------------------------------+

   C->S: binary audio packet #2
         ...
         ...

   S->C: html-speech/1.0 ... SPEECH-MARKER 3257 IN-PROGRESS
         Channel-Identifier:32AECB23433802@speechsynth
         Speech-Marker:timestamp=2059000;marker-1

   C->S: binary audio packet #3
         ...
         ...

   C->S: binary audio packet #4
         ...
         ...

   S->C: html-speech/1.0 ... SPEAK-COMPLETE 3257 COMPLETE
         Channel-Identifier:32AECB23433802@speechsynth
         Completion-Cause:000 normal
         Speech-Marker:timestamp=5011000

   C->S: binary audio packet #5
         ...
         ...

   C->S: binary audio packet #6: end of stream ( message type = 0x03 )
         0                   1                   2                   3
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  message type |           session-id          |   reserved    |
        |1 1 0 0 0 0 0 0|1 0 0 1 1 1 0 1 0 0 1 1 0 0 0 0|0 0 0 0 0 0 0 0|
        +---------------+-------------------------------+---------------+
[Milan] I suspect these binary packets should be S->C.

[Robert] Yep.


SR Audio

For SR, audio input streams are independent of any particular recognition request (e.g. there may be multiple recognition requests in series on the same audio stream, or there may be recognition requests that consume multiple input streams).  So we stick with the idea of a START-AUDIO request.  START-AUDIO has a session-ID which identifies the audio stream, and is included in all consequent audio packets.  It also describes the CODEC of the audio (audio-codec header), and contains the client's timestamp for the start of the audio stream (source-time), so the service can accurately express events using client-local time (and so that the service can synchronize the processing of multiple input streams, where relevant).
[Milan] Again, the 'session-id' name bothers me.  Maybe we can discuss this on the call.
[Robert] Nah, I'll just fix it.

The service responds to START-AUDIO with a standard MRCP-style response, which will be 200 COMPLETE to indicate that it is accepting the input stream, or 4xx complete to indicate it's rejecting the stream.  The client does not need to wait for this response before sending any audio.  But if it gets a 4xx response, and has already sent some audio, it shouldn't send any more.
[Milan] Why not respond with an IN-PROGRESS, and withhold the COMPLETE until the end of stream has been reached?
[Robert] Agreed.  In fact, this is what I put in the example below...  apparently I should have proof-read this mail before I sent it.


In the RECOGNIZE request:

1.       We'll add a mandatory "audio-sessions:" header to the RECOGNIZE request, which contains a comma-separated list of audio stream session IDs.

[Milan] If the multiple streams are coming from a microphone array or something, shouldn't that be mixed on the device?  And if they truly are independent sources (eg conference call), then shouldn't that be multiple RECOGNIZE requests?
[Robert] The scenario I have in mind is a multimodal device that sends touch or gesture streams, which wouldn't be part of the audio codec.



2.       We'll add a "source-time:" header to indicate the point in the input stream that the recognizer should start recognizing from.

For example:

   C->S: html-speech/1.0 ... START-AUDIO 41021
         Audio-codec: audio/dsr-es202212; rate:8000; maxptime:40
         source-time: 12753248231 (source's local time at the start of the first packet)

   C->S: binary audio packet #1
         ...
         ...

   S->C: html-speech/1.0 ... 41021 200 IN-PROGRESS (i.e. the service is accepting the audio)

   C->S: binary audio packet #2
         ...
         ...

   C->S: html-speech/1.0 ... RECOGNIZE 8322
         Channel-Identifier:32AECB23433801@speechrecog
         Confidence-Threshold:0.9
         Audio-sessions: 41021 (request-id of the input stream)
         Source-time: 12753432234 (where in the input stream recognition should start)

   S->C: MRCP/2.0 ... START-OF-INPUT 8322 IN-PROGRESS
         ...
         ...

   S->C: MRCP/2.0 ... RECOGNITION-COMPLETE 8322 COMPLETE
         ...
         ...

   C->S: binary audio packet #N: end of stream
         ...
         ...

   S->C: html-speech/1.0 ... 41021 200 COMPLETE (i.e. the service has received the end of stream)
         ...
         ...


[Milan] I'll soon be proposing an END-OF-INPUT, but I think that should fit just fine.

[Robert] I'm anxious to see it.

Received on Tuesday, 21 June 2011 21:07:22 UTC