RE: Session initiation & media negotiation from Robert Brown on 2011-06-20 (public-xg-htmlspeech@w3.org from June 2011)

From: Robert Brown <Robert.Brown@microsoft.com>
Date: Mon, 20 Jun 2011 23:27:47 +0000
To: Robert Brown <Robert.Brown@microsoft.com>, "Milan Young (Nuance)" <Milan.Young@nuance.com>, "Satish Sampath (Google)" <satish@google.com>, "Glen Shires (gshires@google.com)" <gshires@google.com>, "Patrick Ehlen (AT&T)" <pehlen@attinteractive.com>, "Dan Burnett (Voxeo)" <dburnett@voxeo.com>, "Michael Johnston (AT&T)" <johnston@research.att.com>, "Marc Schroeder (DFKI)" <marc.schroeder@dfki.de>, "Glen Shires (gshires@google.com)" <gshires@google.com>, "'Fergus Henderson (google)'" <fergus@google.com>
CC: HTML Speech XG <public-xg-htmlspeech@w3.org>, Michael Bodell <mbodell@microsoft.com>
Message-ID: <113BCF28740AF44989BE7D3F84AE18DD1B13A034@TK5EX14MBXC118.redmond.corp.microsoft.>
Thanks everyone for the very constructive feedback.



Discovering capabilities



The proposal I sent has two problems:

1.       GET-PARAMS is directed at a specific resource, whereas we probably need something that's directed at the service itself. (Thanks Milan).

2.       Returning the full list of server capabilities may be unnecessarily verbose for some services, and may also be insufficient when the server supports resampling. (Thanks Fergus).



Rather than try to mutate GET-PARAMS, I propose that we introduce a new SESSION-QUERY message the client can send to query whether the service supports specific capabilities nominated by the client.



For example, here the client asks for some capabilities, and the server returns the supported subset:



C->S: html-speech/1.0 ... SESSION-QUERY 34132

      recognize-media: audio/basic, audio/amr-wb,

                       audio/x-wav;channels=2;formattag=pcm;samplespersec=44100,

                       audio/dsr-es202212; rate:8000; maxptime:40

      recognize-lang: en-AU, en-GB, en-US, en

      speak-media: audio/ogg, audio/flac, audio/basic

      speak-lang: en-AU, en-GB



S->C: html-speech/1.0 ... 34132 200 COMPLETE

      recognize-media: audio/basic, audio/dsr-es202212; rate:8000; maxptime:40

      recognize-lang: en-GB, en

      speak-media: audio/flac, audio/basic

      speak-lang: en-GB



What do you think?



Media signaling



People identified a few problems with the proposal I sent:

1.       Media formats were insufficiently specified.  http://www.ietf.org/rfc/rfc3555.txt "MIME Type Registration of RTP Payload Formats" should do the trick.  (Thanks Marc, Michael J, & Bjorn).

2.       The timestamp in START-AUDIO is unnecessary for TTS.  (Thanks Marc).  My original thought here was to include that timestamp at the start of the TTS stream, so that it would have a base time from which to express the timestamp of interim events, such as mark events.  But I think Marc's right and it's redundant.  Furthermore, I think the entire START-AUDIO message redundant for TTS, since all audio will be sent in response to a SPEAK request.

I also have these other concerns:

3.       There's no way for the service to reject an input stream if it is unsupported (e.g. the service doesn't do SR, or the encoding format is not supported).  To be consistent, the service should return 200 PENDING to indicate it's receiving (or ready to receive) an input stream, and 200 COMPLETE when reaches the end of an input stream; or alternatively a 4xx COMPLETE to indicate it's unable to process the stream for whatever reason.

4.       I'm unconvinced that the audio-channel header is needed.  MRCP has a request-id for each request, and I suspect this should suffice as a channel ID for the audio stream.  The only drawback with this is that MRCP's request ID is defined as a 10-digit decimal number (1*10DIGIT), which would need 34 bits to encode.  This seems arbitrarily large and could easily fit into 16 bits and still be massively overprovisioned.

5.       The JUMP-AUDIO method I proposed seems quite out of place, since it's essentially a special media encoding ("silence happened here"), sent as a text message.  It would be more consistent to send this as a type of media packet.



So... here's a rewrite of the proposal:



Audio Packet Format



In the audio packet format:

1.       rename "binary-channel-id" to a 16-bit "binary-session-id".

2.       Define three message types:

a.       0x01 = audio packet. Data contains encoded audio data.

b.      0x02 = skip forward.  Data contains the new timestamp for the stream, where intervening time is assumed to be silence.

c.       0x03 = end-of-stream.  No data.



audio-packet          =  binary-message-type

                         binary-session-id

                         binary-reserved

                         binary-data

binary-message-type   =  OCTET ; Values > 0x03 are reserved. 0x00 is undefined.

binary-session-id     = 2OCTET ; Matches the request-id for a SPEAK or START-AUDIO request

binary-reserved       =  OCTET ; Just buffers out to the 32-bit boundary for now, but may be useful later.

binary-data           = *OCTET ;



MRCP Request ID



Redefine the Request ID as a decimal number between 0 and 65535 (i.e. 2^16), which should be more than sufficient, by several orders of magnitude.



TTS Audio



For TTS, we'd add a mandatory "audio-codec:" header to the SPEAK request.  All audio packets generated from that request will use that CODEC, and will encode the SPEAK request's session-id in their binary header.



The client uses the MRCP "speech-marker:" header to calculate the timing of the TTS events.  The clock is defined as starting at zero at the beginning of the output stream.  (FWIW, this isn't too inconsistent with MRCP, which defines this header as being synched with the RTP timestamp, which we don't have, but RTP in turn says that it's okay to use zero - see http://www.ietf.org/rfc/rfc3550.txt section 6.4.1<http://www.ietf.org/rfc/rfc3550.txt%20section%206.4.1>).



Plagiarizing and modifying a sample from MRCPv2-24:



   C->S: html-speech/1.0 ... SPEAK 3257

         Channel-Identifier:32AECB23433802@speechsynth

         Voice-gender:neutral

         Voice-Age:25

         Audio-codec:audio/flac

         Prosody-volume:medium

         Content-Type:application/ssml+xml

         Content-Length:...



         <?xml version="1.0"?>

            <speak version="1.0"

            ...

            ...

   S->C: html-speech/1.0 ... 3257 200 IN-PROGRESS

         Channel-Identifier:32AECB23433802@speechsynth

         Speech-Marker:timestamp=0



   C->S: binary audio packet #1 (session-id = 3257 = 110010111001)

         0                   1                   2                   3

         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

        |  message type |           session-id          |   reserved    |

        |1 0 0 0 0 0 0 0|1 0 0 1 1 1 0 1 0 0 1 1 0 0 0 0|0 0 0 0 0 0 0 0|

        +---------------+-------------------------------+---------------+

        |                       encoded audio data                      |

        |                              ...                              |

        |                              ...                              |

        |                              ...                              |

        +---------------------------------------------------------------+



   C->S: binary audio packet #2

         ...

         ...



   S->C: html-speech/1.0 ... SPEECH-MARKER 3257 IN-PROGRESS

         Channel-Identifier:32AECB23433802@speechsynth

         Speech-Marker:timestamp=2059000;marker-1



   C->S: binary audio packet #3

         ...

         ...



   C->S: binary audio packet #4

         ...

         ...



   S->C: html-speech/1.0 ... SPEAK-COMPLETE 3257 COMPLETE

         Channel-Identifier:32AECB23433802@speechsynth

         Completion-Cause:000 normal

         Speech-Marker:timestamp=5011000



   C->S: binary audio packet #5

         ...

         ...



   C->S: binary audio packet #6: end of stream ( message type = 0x03 )

         0                   1                   2                   3

         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

        |  message type |           session-id          |   reserved    |

        |1 1 0 0 0 0 0 0|1 0 0 1 1 1 0 1 0 0 1 1 0 0 0 0|0 0 0 0 0 0 0 0|

        +---------------+-------------------------------+---------------+





SR Audio



For SR, audio input streams are independent of any particular recognition request (e.g. there may be multiple recognition requests in series on the same audio stream, or there may be recognition requests that consume multiple input streams).  So we stick with the idea of a START-AUDIO request.  START-AUDIO has a session-ID which identifies the audio stream, and is included in all consequent audio packets.  It also describes the CODEC of the audio (audio-codec header), and contains the client's timestamp for the start of the audio stream (source-time), so the service can accurately express events using client-local time (and so that the service can synchronize the processing of multiple input streams, where relevant).



The service responds to START-AUDIO with a standard MRCP-style response, which will be 200 COMPLETE to indicate that it is accepting the input stream, or 4xx complete to indicate it's rejecting the stream.  The client does not need to wait for this response before sending any audio.  But if it gets a 4xx response, and has already sent some audio, it shouldn't send any more.



In the RECOGNIZE request:

1.       We'll add a mandatory "audio-sessions:" header to the RECOGNIZE request, which contains a comma-separated list of audio stream session IDs.

2.       We'll add a "source-time:" header to indicate the point in the input stream that the recognizer should start recognizing from.



For example:



   C->S: html-speech/1.0 ... START-AUDIO 41021

         Audio-codec: audio/dsr-es202212; rate:8000; maxptime:40

         source-time: 12753248231 (source's local time at the start of the first packet)



   C->S: binary audio packet #1

         ...

         ...



   S->C: html-speech/1.0 ... 41021 200 IN-PROGRESS (i.e. the service is accepting the audio)



   C->S: binary audio packet #2

         ...

         ...



   C->S: html-speech/1.0 ... RECOGNIZE 8322

         Channel-Identifier:32AECB23433801@speechrecog

         Confidence-Threshold:0.9

         Audio-sessions: 41021 (request-id of the input stream)

         Source-time: 12753432234 (where in the input stream recognition should start)



   S->C: MRCP/2.0 ... START-OF-INPUT 8322 IN-PROGRESS

         ...

         ...



   S->C: MRCP/2.0 ... RECOGNITION-COMPLETE 8322 COMPLETE

         ...

         ...



   C->S: binary audio packet #N: end of stream

         ...

         ...



   S->C: html-speech/1.0 ... 41021 200 COMPLETE (i.e. the service has received the end of stream)

         ...

         ...
Received on Monday, 20 June 2011 23:28:34 UTC