RE: Notes from today's protocol call from Young, Milan on 2011-06-23 (public-xg-htmlspeech@w3.org from June 2011)

From: Young, Milan <Milan.Young@nuance.com>
Date: Wed, 22 Jun 2011 22:17:30 -0700
To: Robert Brown <Robert.Brown@microsoft.com>, HTML Speech XG <public-xg-htmlspeech@w3.org>
Message-ID: <1AA381D92997964F898DF2A3AA4FF9AD0BB88CC0@SUN-EXCH01.nuance.com>
For my part, I've updated the control portion of the protocol to cover
the continuous speech scenario.  Also made a few modifications based on
recent discussions and updated the document to match Robert's HTML
format.  Please see the attached.

 

There are a couple areas of the protocol that are still TBD in my mind,
but rather than let the perfect become the enemy of the good, I figured
I'd open this up for discussion.

 

 

 

________________________________

From: public-xg-htmlspeech-request@w3.org
[mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of Robert Brown
Sent: Wednesday, June 22, 2011 5:24 PM
To: HTML Speech XG
Subject: RE: Notes from today's protocol call

 

I haven't finished my work item (Redraft to incorporate everything we've
discussed so far) but it's in progress.

 

Here are some things I've noticed so far:

 

1.       Is message-length in the request line really necessary?
Presumably its only value in MRCP is to provide message framing in what
is otherwise just an open-ended character stream, which we get
automatically in WebSockets. Ditto for the Content-Length header.

2.       It's not clear that we need Cancel-If-Queue for recognition.
HTML apps won't have the same serialized dialog we see in IVR, so this
may not be a meaningful header.

3.       Will the API have hotword functionality? If not, do we need the
hotword headers?

4.       Does the reco portion of the protocol imply API requirements
that haven't been discussed yet? For example, START-INPUT-TIMERS is
there for a good reason, but AFAIK the XG hasn't spoken about the
scenario . Similarly, Early-No-Match seems useful. Is it?

5.       The TTS design has some IVR artifacts that don't make sense in
HTML. In IVR, the synthesizer essentially renders directly to the user's
telephone, and is an active part of the user interface. Whereas in HTML,
the synthesizer is just a provider of audio to the UA.  The UA buffer
the audio and control playback independent of rendering. In light of
this, the CONTROL method, jump-size header, Speak-Length header,
kill-on-barge-in header (and possibly others) don't really make sense.

6.       The TTS Speaker-Profile header will probably never be used,
because HTML UAs will want to pass values inline, rather than store them
in a separate URI-referenceable resource. Should we remove it?

7.       DEFINE-LEXICON and the Load-Lexicon header appear to be useful.
Does it need to surface in the API, or is its presence in SSML enough?
And if it is, why do we need the header? And also, why isn't there
corresponding functionality for recognition?

 

 

 

From: public-xg-htmlspeech-request@w3.org
[mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of Robert Brown
Sent: Thursday, June 16, 2011 10:06 AM
To: HTML Speech XG
Subject: Notes from today's protocol call

 

Attendees:

-          Robert Brown

-          Milan Young

-          Patrick Ehlen

-          Michael Johnston

 

Topic: control portion of protocol based on MRCP subset (this thread 
http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/0035.ht
ml)

 

-          Agreed that this subset is appropriate for ASR & TTS.

-          Unclear whether recording should be included.  Agreed to
escalate this to the XG.  IF there's agreement that recording scenarios
are common and valuable, we'll include that portion in the protocol.
OTHERWISE we'll omit it, since it's still possible through more
convoluted means.

o        We discussed this in the main call with the XG.  General
agreement was that recording isn't something we need to solve, and that
it should be possible as a side-effect of recognition (i.e. <ruleref
special="GARBAGE"> and retain the audio).

-          While services are free to implement a subset of the protocol
(e.g. only SR or only TTS portions), clients will need to implement the
full set.

 

Topic: RTP

-          Agreed that it is unneeded, for the reasons stated 
http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/0029.ht
ml

-          The basic design approach provides an extensibility mechanism
so that if new scenarios emerged in the future that required RTP or
another protocol, (
http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/att-000
8/speech-protocol-basic-approach-01.html)

 

Topic: SDP

-          Agreed, like RTP, that it is unneeded given the context we
already have as a byproduct of our design approach.  (See also the last
paragraph here: 
http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/0042.ht
ml)

 

Topic: session initiation & media negotiation (
http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/0042.ht
ml)

-          People haven't had a chance to review this.  Will discuss
more over the coming week.

-          GET-PARAMS is resource specific, so works a little
differently to what's written here.  (Robert will need to re-think and
make another proposal)

 

Next steps:

-          Continuous speech proposal (Milan)

-          Redraft to incorporate everything we've discussed so far
(Robert)

-          Examine whether the recognition portion of the protocol can
handle extended scenarios, like verification, etc. (Michael)
Attachments

text/html attachment: control-protocol.html
Received on Thursday, 23 June 2011 05:19:07 UTC