RE: Notes from today's protocol call

I haven't finished my work item (Redraft to incorporate everything we've discussed so far) but it's in progress.

Here are some things I've noticed so far:


1.       Is message-length in the request line really necessary? Presumably its only value in MRCP is to provide message framing in what is otherwise just an open-ended character stream, which we get automatically in WebSockets. Ditto for the Content-Length header.

2.       It's not clear that we need Cancel-If-Queue for recognition. HTML apps won't have the same serialized dialog we see in IVR, so this may not be a meaningful header.

3.       Will the API have hotword functionality? If not, do we need the hotword headers?

4.       Does the reco portion of the protocol imply API requirements that haven't been discussed yet? For example, START-INPUT-TIMERS is there for a good reason, but AFAIK the XG hasn't spoken about the scenario . Similarly, Early-No-Match seems useful. Is it?

5.       The TTS design has some IVR artifacts that don't make sense in HTML. In IVR, the synthesizer essentially renders directly to the user's telephone, and is an active part of the user interface. Whereas in HTML, the synthesizer is just a provider of audio to the UA.  The UA buffer the audio and control playback independent of rendering. In light of this, the CONTROL method, jump-size header, Speak-Length header, kill-on-barge-in header (and possibly others) don't really make sense.

6.       The TTS Speaker-Profile header will probably never be used, because HTML UAs will want to pass values inline, rather than store them in a separate URI-referenceable resource. Should we remove it?

7.       DEFINE-LEXICON and the Load-Lexicon header appear to be useful. Does it need to surface in the API, or is its presence in SSML enough? And if it is, why do we need the header? And also, why isn't there corresponding functionality for recognition?


From: public-xg-htmlspeech-request@w3.org [mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of Robert Brown
Sent: Thursday, June 16, 2011 10:06 AM
To: HTML Speech XG
Subject: Notes from today's protocol call

Attendees:

-          Robert Brown

-          Milan Young

-          Patrick Ehlen

-          Michael Johnston

Topic: control portion of protocol based on MRCP subset (this thread http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/0035.html)


-          Agreed that this subset is appropriate for ASR & TTS.

-          Unclear whether recording should be included.  Agreed to escalate this to the XG.  IF there's agreement that recording scenarios are common and valuable, we'll include that portion in the protocol.  OTHERWISE we'll omit it, since it's still possible through more convoluted means.

o   We discussed this in the main call with the XG.  General agreement was that recording isn't something we need to solve, and that it should be possible as a side-effect of recognition (i.e. <ruleref special="GARBAGE"> and retain the audio).

-          While services are free to implement a subset of the protocol (e.g. only SR or only TTS portions), clients will need to implement the full set.

Topic: RTP

-          Agreed that it is unneeded, for the reasons stated http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/0029.html

-          The basic design approach provides an extensibility mechanism so that if new scenarios emerged in the future that required RTP or another protocol, (http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/att-0008/speech-protocol-basic-approach-01.html)

Topic: SDP

-          Agreed, like RTP, that it is unneeded given the context we already have as a byproduct of our design approach.  (See also the last paragraph here: http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/0042.html)

Topic: session initiation & media negotiation (http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/0042.html)

-          People haven't had a chance to review this.  Will discuss more over the coming week.

-          GET-PARAMS is resource specific, so works a little differently to what's written here.  (Robert will need to re-think and make another proposal)

Next steps:

-          Continuous speech proposal (Milan)

-          Redraft to incorporate everything we've discussed so far (Robert)

-          Examine whether the recognition portion of the protocol can handle extended scenarios, like verification, etc. (Michael)

Received on Thursday, 23 June 2011 00:24:11 UTC