RE: Session initiation & media negotiation from Young, Milan on 2011-06-21 (public-xg-htmlspeech@w3.org from June 2011)

From: Young, Milan <Milan.Young@nuance.com>
Date: Tue, 21 Jun 2011 12:59:44 -0700
To: Robert Brown <Robert.Brown@microsoft.com>, "Satish Sampath (Google)" <satish@google.com>, <gshires@google.com>, "Patrick Ehlen (AT&T)" <pehlen@attinteractive.com>, "Dan Burnett (Voxeo)" <dburnett@voxeo.com>, "Michael Johnston (AT&T)" <johnston@research.att.com>, "Marc Schroeder (DFKI)" <marc.schroeder@dfki.de>, <gshires@google.com>, "Fergus Henderson (google)" <fergus@google.com>
CC: HTML Speech XG <public-xg-htmlspeech@w3.org>, Michael Bodell <mbodell@microsoft.com>
Message-ID: <1AA381D92997964F898DF2A3AA4FF9AD0BA698E7@SUN-EXCH01.nuance.com>
Inline...

 

________________________________

From: Robert Brown [mailto:Robert.Brown@microsoft.com] 
Sent: Monday, June 20, 2011 4:28 PM
To: Robert Brown; Young, Milan; Satish Sampath (Google); Glen Shires
(gshires@google.com); Patrick Ehlen (AT&T); Dan Burnett (Voxeo); Michael
Johnston (AT&T); Marc Schroeder (DFKI); Glen Shires
(gshires@google.com); 'Fergus Henderson (google)'
Cc: HTML Speech XG; Michael Bodell
Subject: RE: Session initiation & media negotiation

 

Thanks everyone for the very constructive feedback.

 

Discovering capabilities

 

The proposal I sent has two problems:

1.       GET-PARAMS is directed at a specific resource, whereas we
probably need something that's directed at the service itself. (Thanks
Milan).

2.       Returning the full list of server capabilities may be
unnecessarily verbose for some services, and may also be insufficient
when the server supports resampling. (Thanks Fergus).

 

Rather than try to mutate GET-PARAMS, I propose that we introduce a new
SESSION-QUERY message the client can send to query whether the service
supports specific capabilities nominated by the client.

 

For example, here the client asks for some capabilities, and the server
returns the supported subset:

 

C->S: html-speech/1.0 ... SESSION-QUERY 34132

      recognize-media: audio/basic, audio/amr-wb,

 
audio/x-wav;channels=2;formattag=pcm;samplespersec=44100,

                       audio/dsr-es202212; rate:8000; maxptime:40

      recognize-lang: en-AU, en-GB, en-US, en

      speak-media: audio/ogg, audio/flac, audio/basic

      speak-lang: en-AU, en-GB

 

S->C: html-speech/1.0 ... 34132 200 COMPLETE

      recognize-media: audio/basic, audio/dsr-es202212; rate:8000;
maxptime:40

      recognize-lang: en-GB, en

      speak-media: audio/flac, audio/basic

      speak-lang: en-GB

 

What do you think?

[Milan] I thought there was a privacy issue around requiring the client
to request specific service capabilities.  That's why we were originally
preferring a model where the server should send everything it can do,
and the client could choose from that list.  So perhaps we should
support a '*' syntax or something in addition to what you have spec'd
above.

 

I'd also prefer your 'speak' and 'recognize' keywords to be consistent
with the resource names in the control protocol.  MRCP has these as
'recognizer' and 'synthesizer'.  Perhaps synthesizer is not the best
name given the fact that it also supports recorded audio, but obviously
the MRCP folks thought it was a decent fit.  So unless we have strong
reason to diverge, I suggest we stick with convention.

 

 

 

 

 

 

 

Media signaling

 

People identified a few problems with the proposal I sent:

1.       Media formats were insufficiently specified.
http://www.ietf.org/rfc/rfc3555.txt "MIME Type Registration of RTP
Payload Formats" should do the trick.  (Thanks Marc, Michael J, &
Bjorn).

2.       The timestamp in START-AUDIO is unnecessary for TTS.  (Thanks
Marc).  My original thought here was to include that timestamp at the
start of the TTS stream, so that it would have a base time from which to
express the timestamp of interim events, such as mark events.  But I
think Marc's right and it's redundant.  Furthermore, I think the entire
START-AUDIO message redundant for TTS, since all audio will be sent in
response to a SPEAK request.

I also have these other concerns:

3.       There's no way for the service to reject an input stream if it
is unsupported (e.g. the service doesn't do SR, or the encoding format
is not supported).  To be consistent, the service should return 200
PENDING to indicate it's receiving (or ready to receive) an input
stream, and 200 COMPLETE when reaches the end of an input stream; or
alternatively a 4xx COMPLETE to indicate it's unable to process the
stream for whatever reason.

4.       I'm unconvinced that the audio-channel header is needed.  MRCP
has a request-id for each request, and I suspect this should suffice as
a channel ID for the audio stream.  The only drawback with this is that
MRCP's request ID is defined as a 10-digit decimal number (1*10DIGIT),
which would need 34 bits to encode.  This seems arbitrarily large and
could easily fit into 16 bits and still be massively overprovisioned.

5.       The JUMP-AUDIO method I proposed seems quite out of place,
since it's essentially a special media encoding ("silence happened
here"), sent as a text message.  It would be more consistent to send
this as a type of media packet.

 

So... here's a rewrite of the proposal:

 

Audio Packet Format

 

In the audio packet format:

1.       rename "binary-channel-id" to a 16-bit "binary-session-id".

2.       Define three message types:

a.       0x01 = audio packet. Data contains encoded audio data.

b.      0x02 = skip forward.  Data contains the new timestamp for the
stream, where intervening time is assumed to be silence.

c.       0x03 = end-of-stream.  No data.

 

audio-packet          =  binary-message-type 

                         binary-session-id 

                         binary-reserved 

                         binary-data

binary-message-type   =  OCTET ; Values > 0x03 are reserved. 0x00 is
undefined.

binary-session-id     = 2OCTET ; Matches the request-id for a SPEAK or
START-AUDIO request

binary-reserved       =  OCTET ; Just buffers out to the 32-bit boundary
for now, but may be useful later.

binary-data           = *OCTET ; 

 

 

MRCP Request ID

 

Redefine the Request ID as a decimal number between 0 and 65535 (i.e.
2^16), which should be more than sufficient, by several orders of
magnitude.

[Milan] I'm usually wary of comments like that, especially coming from
Microsoft :-).  But in this case I suspect you are OK.

 

 

TTS Audio

 

For TTS, we'd add a mandatory "audio-codec:" header to the SPEAK
request.  All audio packets generated from that request will use that
CODEC, and will encode the SPEAK request's session-id in their binary
header.

[Milan] In the MRCP world, this would be a request-id not a session-id.
I understand the desire to be consistent with SR, but we need to be
careful about diverging in terminology if we are going to reference
another spec.

 

 

The client uses the MRCP "speech-marker:" header to calculate the timing
of the TTS events.  The clock is defined as starting at zero at the
beginning of the output stream.  (FWIW, this isn't too inconsistent with
MRCP, which defines this header as being synched with the RTP timestamp,
which we don't have, but RTP in turn says that it's okay to use zero -
see http://www.ietf.org/rfc/rfc3550.txt section 6.4.1
<http://www.ietf.org/rfc/rfc3550.txt%20section%206.4.1> ).

 

Plagiarizing and modifying a sample from MRCPv2-24:

 

   C->S: html-speech/1.0 ... SPEAK 3257

         Channel-Identifier:32AECB23433802@speechsynth

         Voice-gender:neutral

         Voice-Age:25

         Audio-codec:audio/flac

         Prosody-volume:medium

         Content-Type:application/ssml+xml

         Content-Length:...

 

         <?xml version="1.0"?>

            <speak version="1.0"

            ...

            ...

   S->C: html-speech/1.0 ... 3257 200 IN-PROGRESS

         Channel-Identifier:32AECB23433802@speechsynth

         Speech-Marker:timestamp=0

 

   C->S: binary audio packet #1 (session-id = 3257 = 110010111001)

         0                   1                   2                   3

         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

 
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

        |  message type |           session-id          |   reserved
|

        |1 0 0 0 0 0 0 0|1 0 0 1 1 1 0 1 0 0 1 1 0 0 0 0|0 0 0 0 0 0 0
0|

 
+---------------+-------------------------------+---------------+

        |                       encoded audio data
|

        |                              ...
|

        |                              ...
|

        |                              ...
|

 
+---------------------------------------------------------------+

 

   C->S: binary audio packet #2

         ...

         ...

 

   S->C: html-speech/1.0 ... SPEECH-MARKER 3257 IN-PROGRESS

         Channel-Identifier:32AECB23433802@speechsynth

         Speech-Marker:timestamp=2059000;marker-1

 

   C->S: binary audio packet #3

         ...

         ...

 

   C->S: binary audio packet #4

         ...

         ...

 

   S->C: html-speech/1.0 ... SPEAK-COMPLETE 3257 COMPLETE

         Channel-Identifier:32AECB23433802@speechsynth

         Completion-Cause:000 normal

         Speech-Marker:timestamp=5011000

 

   C->S: binary audio packet #5

         ...

         ...

 

   C->S: binary audio packet #6: end of stream ( message type = 0x03 )

         0                   1                   2                   3

         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

 
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

        |  message type |           session-id          |   reserved
|

        |1 1 0 0 0 0 0 0|1 0 0 1 1 1 0 1 0 0 1 1 0 0 0 0|0 0 0 0 0 0 0
0|

 
+---------------+-------------------------------+---------------+

[Milan] I suspect these binary packets should be S->C.

 

 

SR Audio

 

For SR, audio input streams are independent of any particular
recognition request (e.g. there may be multiple recognition requests in
series on the same audio stream, or there may be recognition requests
that consume multiple input streams).  So we stick with the idea of a
START-AUDIO request.  START-AUDIO has a session-ID which identifies the
audio stream, and is included in all consequent audio packets.  It also
describes the CODEC of the audio (audio-codec header), and contains the
client's timestamp for the start of the audio stream (source-time), so
the service can accurately express events using client-local time (and
so that the service can synchronize the processing of multiple input
streams, where relevant).

[Milan] Again, the 'session-id' name bothers me.  Maybe we can discuss
this on the call.

 

The service responds to START-AUDIO with a standard MRCP-style response,
which will be 200 COMPLETE to indicate that it is accepting the input
stream, or 4xx complete to indicate it's rejecting the stream.  The
client does not need to wait for this response before sending any audio.
But if it gets a 4xx response, and has already sent some audio, it
shouldn't send any more.

[Milan] Why not respond with an IN-PROGRESS, and withhold the COMPLETE
until the end of stream has been reached?

 

 

In the RECOGNIZE request:

1.       We'll add a mandatory "audio-sessions:" header to the RECOGNIZE
request, which contains a comma-separated list of audio stream session
IDs.

[Milan] If the multiple streams are coming from a microphone array or
something, shouldn't that be mixed on the device?  And if they truly are
independent sources (eg conference call), then shouldn't that be
multiple RECOGNIZE requests?

 

2.       We'll add a "source-time:" header to indicate the point in the
input stream that the recognizer should start recognizing from.

 

For example:

 

   C->S: html-speech/1.0 ... START-AUDIO 41021

         Audio-codec: audio/dsr-es202212; rate:8000; maxptime:40

         source-time: 12753248231 (source's local time at the start of
the first packet)

 

   C->S: binary audio packet #1

         ...

         ...

 

   S->C: html-speech/1.0 ... 41021 200 IN-PROGRESS (i.e. the service is
accepting the audio)

 

   C->S: binary audio packet #2

         ...

         ...

 

   C->S: html-speech/1.0 ... RECOGNIZE 8322

         Channel-Identifier:32AECB23433801@speechrecog

         Confidence-Threshold:0.9

         Audio-sessions: 41021 (request-id of the input stream)

         Source-time: 12753432234 (where in the input stream recognition
should start)

 

   S->C: MRCP/2.0 ... START-OF-INPUT 8322 IN-PROGRESS

         ...

         ...

 

   S->C: MRCP/2.0 ... RECOGNITION-COMPLETE 8322 COMPLETE

         ...

         ...

 

   C->S: binary audio packet #N: end of stream

         ...

         ...

 

   S->C: html-speech/1.0 ... 41021 200 COMPLETE (i.e. the service has
received the end of stream)

         ...

         ...

 

 

[Milan] I'll soon be proposing an END-OF-INPUT, but I think that should
fit just fine.
Received on Tuesday, 21 June 2011 20:00:53 UTC