RE: Control portion of SS protocol from Young, Milan on 2011-06-15 (public-xg-htmlspeech@w3.org from June 2011)

From: Young, Milan <Milan.Young@nuance.com>
Date: Wed, 15 Jun 2011 12:38:21 -0700
To: Robert Brown <Robert.Brown@microsoft.com>, "Satish Sampath (Google)" <satish@google.com>, <gshires@google.com>, "Marc Schroeder (DFKI)" <marc.schroeder@dfki.de>, "Patrick Ehlen (AT&T)" <pehlen@attinteractive.com>, "JOHNSTON, MICHAEL J (MICHAEL J)" <johnston@research.att.com>
CC: "Dan Burnett (Voxeo)" <dburnett@voxeo.com>, HTML Speech XG <public-xg-htmlspeech@w3.org>
Message-ID: <1AA381D92997964F898DF2A3AA4FF9AD0B9B1965@SUN-EXCH01.nuance.com>
Yes, something like <builtin:record> should work as well.  There might
be some nuances around start-of-speech endpointing, but I suppose we
could always add parameters.  If that's how you want to handle it at the
protocol level, I don't have any objections.

 

But I do think recording is a relatively common task in many speech
applications.  So I was hoping that we could get the attention of the
API folks to see if this could be made into a first-class operation.

 

Thanks

 

________________________________

From: Robert Brown [mailto:Robert.Brown@microsoft.com] 
Sent: Wednesday, June 15, 2011 12:06 PM
To: Young, Milan; Satish Sampath (Google); gshires@google.com; Marc
Schroeder (DFKI); Patrick Ehlen (AT&T); JOHNSTON, MICHAEL J (MICHAEL J)
Cc: Dan Burnett (Voxeo); HTML Speech XG
Subject: RE: Control portion of SS protocol

 

The endpointing argument is strong.  But why not just use the RECOGNIZE
method with the save-waveform header, and a grammar with <ruleref
special="GARBAGE">?

 

From: Young, Milan [mailto:Milan.Young@nuance.com] 
Sent: Wednesday, June 15, 2011 11:11 AM
To: Robert Brown; Satish Sampath (Google); gshires@google.com; Marc
Schroeder (DFKI); Patrick Ehlen (AT&T); JOHNSTON, MICHAEL J (MICHAEL J)
Cc: Dan Burnett (Voxeo); HTML Speech XG
Subject: RE: Control portion of SS protocol

 

MMI is a scenario where the service (aka Interaction Manager) might want
to send an event outside the context of a request.  For example, the
service might report that the user has just entered text data, rotated
the phone, uploaded a photo, etc.  As I pointed out earlier, these
notifications do not NEED to come through the protocol channel, but it
may be a convenient transport.  My thought was that as long as we were
opening up a NOTIFY scheme, why limit the information to the context of
an ongoing request?

 

Regarding RECORD, perhaps we could start the discussion by commenting on
the validity of my reasons in the last mail:

*         Consistent use of server-based endpointing and channel
adaptation.

*         Shares the headers with the other control messages (eg
timeouts, cookies, and channel-identifier).

*         Same network paths

 

Thanks

 

________________________________

From: Robert Brown [mailto:Robert.Brown@microsoft.com] 
Sent: Wednesday, June 15, 2011 10:28 AM
To: Young, Milan; Satish Sampath (Google); gshires@google.com; Marc
Schroeder (DFKI); Patrick Ehlen (AT&T); JOHNSTON, MICHAEL J (MICHAEL J)
Cc: Dan Burnett (Voxeo); HTML Speech XG
Subject: RE: Control portion of SS protocol

 

Actually, I really was thinking that NOTIFY would only be in response to
something in progress.  But only because I couldn't think of use cases
to the contrary.  Are there any?  For subscription, if NOTIFY could be
sent at any time, then SET-PARAMS makes sense, whereas if it can only be
sent in response to an in-progress request, then a header in that
request makes sense.

 

Sorry, I meant to comment on RECORD in my first reply.  Is there a
strong case for it?  At some stage in the next couple of years the
community will presumably converge on a reasonable microphone API that
will enable recording.

 

 

From: public-xg-htmlspeech-request@w3.org
[mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of Young, Milan
Sent: Tuesday, June 14, 2011 6:01 PM
To: Robert Brown; Satish Sampath (Google); gshires@google.com; Marc
Schroeder (DFKI); Patrick Ehlen (AT&T); JOHNSTON, MICHAEL J (MICHAEL J)
Cc: Dan Burnett (Voxeo); HTML Speech XG
Subject: RE: Control portion of SS protocol

 

Glad to hear that we are converging.  Follow-up comments:

 

*         Regarding cookies, I thought we might use the MRCP headers to
at least transport information about the URL the page is executing
within.  Perhaps I've misunderstood, but giving that information to the
SS doesn't seem like a security breach.  Of course if Michael can figure
out a way to push all the cookies, then that's even better.

*         Regarding NOTIFY, my intention was that the server could send
this event at any time while the session is live.  It wouldn't need to
wait for a client request to be "in-progress".  Maybe you already
understood that, but your use of "in-progress" made me unsure.

*         I was thinking that it would be convenient to select the set
of NOTIFYs at runtime (eg SET-PARAMS) rather than always at session
startup.  In my proposal, the "a=resource:notify" was only a instruction
that the webapp was capable of dealing with the general concept of a
NOTIFY rather than a particular class.  But I suppose that if we can
agree that the browser never filters NOTIFYs we can have it both ways.

*         Curious to know your thoughts on RECORD.

 

Thanks

 

 

________________________________

From: Robert Brown [mailto:Robert.Brown@microsoft.com] 
Sent: Tuesday, June 14, 2011 4:33 PM
To: Young, Milan; Satish Sampath (Google); gshires@google.com; Marc
Schroeder (DFKI); Patrick Ehlen (AT&T); JOHNSTON, MICHAEL J (MICHAEL J)
Cc: Dan Burnett (Voxeo); HTML Speech XG
Subject: RE: Control portion of SS protocol

 

Thanks Milan, this is a nice tight list.

 

A couple of minor tweaks to make the method list consistent with
MRCP2-24:   (I assume 24 is the latest version?)

-          GET/SET-PARAMS are now listed as generic methods.

-          RECOGNITION-START-TIMERS has been re-named START-INPUT-TIMERS

 

I agree on the response & event list.  In addition, reco results would
default to EMMA rather than NLSML

 

I generally agree on using the same list of headers.  When you said
"except verification" I assume you mean those unique headers listed
under the speaker verification feature?  The other thing I think we
should remove is the cookie headers.  I recall we had a discussion on
cookies at the F2F, and a number of us felt that it was inappropriate to
give the service transitive use of the UA's cookies, and brainstormed an
alternative mechanism.  Michael Bodell volunteered make a proposal.

 

I like the NOTIFY event.  Services could send it while processing any
in-progress request.  We may want to introduce a mechanism for clients
to only subscribe to certain events.  For example, all the Microsoft TTS
engines can produce viseme events (e.g.
http://dict.bing.com.cn/#%3Ahome, and click on the orange TV icon), but
most apps wouldn't want to receive them.  This may be as simple as
introducing a "subscribe" header that lists the custom events you want
to receive.

 

 

 

 

From: Young, Milan [mailto:Milan.Young@nuance.com] 
Sent: Friday, June 10, 2011 11:46 AM
To: Robert Brown; Satish Sampath (Google); gshires@google.com; Marc
Schroeder (DFKI); Patrick Ehlen (AT&T); JOHNSTON, MICHAEL J (MICHAEL J)
Cc: Dan Burnett (Voxeo); HTML Speech XG
Subject: Control portion of SS protocol

 

Robert's draft referenced a few placeholder control methods and headers
that were "inspired from MRCP".  This is a start at making these
sections more concrete.

 

One notable omission is handling of continuous recognition results and
corrections.  I will follow up on this section early next week.

 

 

---------------------------

 

 

Client Requests

For the contents of 'recognition-method', I suggest we use the following
as defined by MRCP v2:

SET-PARAMS

GET-PARAMS

DEFINE-GRAMMAR

RECOGNIZE

RECOGNITION-START-TIMERS

STOP

INTERPRET

 

... and for 'synthesizer-method':

            SET-PARAMS

            GET-PARAMS

            SPEAK

            STOP

            PAUSE

            RESUME

            BARGE-IN-OCCURRED

            CONTROL

            DEFINE-LEXICON

 

I suggest we also add a recorder resource (this probably needs
discussion in the API group).  Although there are other ways to pass
recorded audio from client to server, doing it within the protocol has
some nice advantages:

*         Consistent use of server-based endpointing and channel
adaptation.

*         Shares the headers with the other control messages (eg
timeouts, cookies, and channel-identifier).

*         Same network paths

 

'recorder-method' would be defined as per MRCP v2 using the following
methods:

            RECORD

            STOP

            START-INPUT-TIMERS

 

 

 

Server Responses

Server request state should be exactly as defined by MRCP v2:

            COMPLETE

            IN-PROGRESS

            PENDING

 

 

For 'recognizer-event', I suggest we use the following as defined by
MRCP:

            START-OF-INPUT

            RECOGNITION-COMPLETE

            INTERPRETATION-COMPLETE

            

... and for 'synthesizer-event'

            SPEECH-MARKER

            SPEAK-COMPLETE

 

...and for 'recorder-event'

            START-OF-INPUT

            RECORD-COMPLETE

 

 

 

Headers

I suggest that we use all the headers defined by MRCP v2 except those
that are specific to verification.  Specifically, this means:

  * Generic (see
http://tools.ietf.org/html/draft-ietf-speechsc-mrcpv2-24#section-6.2).

  * Synthesizer (see
http://tools.ietf.org/html/draft-ietf-speechsc-mrcpv2-24#section-8.4)

  * Recognizer (see
http://tools.ietf.org/html/draft-ietf-speechsc-mrcpv2-24#section-9.4)

  * Recorder (see
http://tools.ietf.org/html/draft-ietf-speechsc-mrcpv2-24#section-10.4)

 

Appropriate use of these headers is defined as per MRCP v2 spec in the
context of a specific method or response reference by this
specification.

 

 

 

Server Notifications

Within MRCP v2, the server may only send message in response to a
client-driven request.  Client polling via GET-PARAMS is the only option
for "pushing" a message from the server to the client.

 

It's unclear whether server push through the HTML Speech protocol and
API is required functionality.  These messages could, for example, be
accomplished outside the specification using a separate WebSocket
connection.  On the other hand, frameworks like MMI hinge on the ability
for the server to proactively send state updates to the client.

 

If this is found to be convenient, then we may choose to add to our list
of 'event-names' with a 'notification-event'.  This new event would use
a status code of '200', and a request state of 'NOTIFY'.  The value of
the 'Channel-Identifier' header would use a new resource type called
'notification'.   For example:

 

html-speech/1.0 92 323340 200 NOTIFY

Channel-Identifier: 817@notification

Content-Length: 36

Content-Type: text/xml

 

<?xml version="1.0"?>

<foo>bar</foo>

 

A couple notes:

  * If the [body] was detected as being XML or JSON, it would be nice if
the client browser could automatically reflect the data as a DOM or EMCA
object.  But I don't know much about that sort of technology, so would
need someone else to comment.

  * The client would request notifications using the SDP-like setup
protocol that Robert is working on.  Something like
'a=resource:notification'.

  * The client browser would not interpret any headers in the
notification those required to parse the message (ie 'Content-Length',
'Content-Type', and 'Content-Encoding').

  * The request-id, Channel-Identifier, and other headers would be
bundled up along with the body and handed to the webapp.  It would be up
to the application to decide the meaning of such headers in the context
of the notification.
Received on Wednesday, 15 June 2011 19:39:37 UTC