Using RECOGNIZE method for broader set of use cases

Following up on the issue of allowing a broader set of use cases
to be handled using the emerging control protocol (tasks other than straight
up speech recognition e.g. verification, prosody recognition, emotion
recognition, all of which can handled by shipping audio to a speech
resource and getting an EMMA result back).

The action item was to look into handling these using the recognizer
resource and RECOGNIZE method.  I don't seen any immediate problems
assuming we are happy with using DEFINE-GRAMMAR to specify
not just grammars but arbitrary models to be used to derive some kind of 
interpretation of the input. We are going to need to use DEFINE-GRAMMAR to
specify both SRGS and SLMs already so it could also be used to point to arbitrary
models that conduct other kinds of processing.  Thinking through beyond
EMMA to the JS result API, probably the result of this processing 
should show up in the 'interpretation' field. 

Michael
________________________________________
From: public-xg-htmlspeech-request@w3.org [public-xg-htmlspeech-request@w3.org] On Behalf Of Young, Milan [Milan.Young@nuance.com]
Sent: Thursday, June 23, 2011 1:17 AM
To: Robert Brown; HTML Speech XG
Subject: RE: Notes from today's protocol call

For my part, I’ve updated the control portion of the protocol to cover the continuous speech scenario.  Also made a few modifications based on recent discussions and updated the document to match Robert’s HTML format.  Please see the attached.

There are a couple areas of the protocol that are still TBD in my mind, but rather than let the perfect become the enemy of the good, I figured I’d open this up for discussion.



________________________________
From: public-xg-htmlspeech-request@w3.org [mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of Robert Brown
Sent: Wednesday, June 22, 2011 5:24 PM
To: HTML Speech XG
Subject: RE: Notes from today's protocol call

I haven’t finished my work item (Redraft to incorporate everything we’ve discussed so far) but it’s in progress.

Here are some things I’ve noticed so far:


1.       Is message-length in the request line really necessary? Presumably its only value in MRCP is to provide message framing in what is otherwise just an open-ended character stream, which we get automatically in WebSockets. Ditto for the Content-Length header.

2.       It's not clear that we need Cancel-If-Queue for recognition. HTML apps won't have the same serialized dialog we see in IVR, so this may not be a meaningful header.

3.       Will the API have hotword functionality? If not, do we need the hotword headers?

4.       Does the reco portion of the protocol imply API requirements that haven’t been discussed yet? For example, START-INPUT-TIMERS is there for a good reason, but AFAIK the XG hasn’t spoken about the scenario . Similarly, Early-No-Match seems useful. Is it?

5.       The TTS design has some IVR artifacts that don't make sense in HTML. In IVR, the synthesizer essentially renders directly to the user's telephone, and is an active part of the user interface. Whereas in HTML, the synthesizer is just a provider of audio to the UA.  The UA buffer the audio and control playback independent of rendering. In light of this, the CONTROL method, jump-size header, Speak-Length header, kill-on-barge-in header (and possibly others) don't really make sense.

6.       The TTS Speaker-Profile header will probably never be used, because HTML UAs will want to pass values inline, rather than store them in a separate URI-referenceable resource. Should we remove it?

7.       DEFINE-LEXICON and the Load-Lexicon header appear to be useful. Does it need to surface in the API, or is its presence in SSML enough? And if it is, why do we need the header? And also, why isn't there corresponding functionality for recognition?



From: public-xg-htmlspeech-request@w3.org [mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of Robert Brown
Sent: Thursday, June 16, 2011 10:06 AM
To: HTML Speech XG
Subject: Notes from today's protocol call

Attendees:

-          Robert Brown

-          Milan Young

-          Patrick Ehlen

-          Michael Johnston

Topic: control portion of protocol based on MRCP subset (this thread http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/0035.html)


-          Agreed that this subset is appropriate for ASR & TTS.

-          Unclear whether recording should be included.  Agreed to escalate this to the XG.  IF there’s agreement that recording scenarios are common and valuable, we’ll include that portion in the protocol.  OTHERWISE we’ll omit it, since it’s still possible through more convoluted means.

o        We discussed this in the main call with the XG.  General agreement was that recording isn’t something we need to solve, and that it should be possible as a side-effect of recognition (i.e. <ruleref special="GARBAGE"> and retain the audio).

-          While services are free to implement a subset of the protocol (e.g. only SR or only TTS portions), clients will need to implement the full set.

Topic: RTP

-          Agreed that it is unneeded, for the reasons stated http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/0029.html

-          The basic design approach provides an extensibility mechanism so that if new scenarios emerged in the future that required RTP or another protocol, (http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/att-0008/speech-protocol-basic-approach-01.html)

Topic: SDP

-          Agreed, like RTP, that it is unneeded given the context we already have as a byproduct of our design approach.  (See also the last paragraph here: http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/0042.html)

Topic: session initiation & media negotiation (http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/0042.html)

-          People haven’t had a chance to review this.  Will discuss more over the coming week.

-          GET-PARAMS is resource specific, so works a little differently to what’s written here.  (Robert will need to re-think and make another proposal)

Next steps:

-          Continuous speech proposal (Milan)

-          Redraft to incorporate everything we’ve discussed so far (Robert)

-          Examine whether the recognition portion of the protocol can handle extended scenarios, like verification, etc. (Michael)

Received on Thursday, 23 June 2011 14:20:55 UTC