Re: Using RECOGNIZE method for broader set of use cases from Patrick Ehlen on 2011-06-23 (public-xg-htmlspeech@w3.org from June 2011)

From: Patrick Ehlen <pehlen@attinteractive.com>
Date: Thu, 23 Jun 2011 08:07:47 -0700
To: Michael Johnston <johnston@research.att.com>, "Young, Milan" <Milan.Young@nuance.com>, Robert Brown <Robert.Brown@microsoft.com>, HTML Speech XG <public-xg-htmlspeech@w3.org>
Message-ID: <CA28A4D3.8F3F%pehlen@attinteractive.com>
If we are to use RECOGNIZE for calling these other types of resources, maybe
DEFINE-GRAMMAR should be generalized to SET-MODEL or SPECIFY-MODEL?
DEFINE-GRAMMAR sounds too SRGS-specific to me.


On 6/23/11 7:20 AM, "Michael Johnston" <johnston@research.att.com> wrote:

> Following up on the issue of allowing a broader set of use cases
> to be handled using the emerging control protocol (tasks other than straight
> up speech recognition e.g. verification, prosody recognition, emotion
> recognition, all of which can handled by shipping audio to a speech
> resource and getting an EMMA result back).
> 
> The action item was to look into handling these using the recognizer
> resource and RECOGNIZE method.  I don't seen any immediate problems
> assuming we are happy with using DEFINE-GRAMMAR to specify
> not just grammars but arbitrary models to be used to derive some kind of
> interpretation of the input. We are going to need to use DEFINE-GRAMMAR to
> specify both SRGS and SLMs already so it could also be used to point to
> arbitrary
> models that conduct other kinds of processing.  Thinking through beyond
> EMMA to the JS result API, probably the result of this processing
> should show up in the 'interpretation' field.
> 
> Michael
> ________________________________________
> From: public-xg-htmlspeech-request@w3.org
> [public-xg-htmlspeech-request@w3.org] On Behalf Of Young, Milan
> [Milan.Young@nuance.com]
> Sent: Thursday, June 23, 2011 1:17 AM
> To: Robert Brown; HTML Speech XG
> Subject: RE: Notes from today's protocol call
> 
> For my part, I¹ve updated the control portion of the protocol to cover the
> continuous speech scenario.  Also made a few modifications based on recent
> discussions and updated the document to match Robert¹s HTML format.  Please
> see the attached.
> 
> There are a couple areas of the protocol that are still TBD in my mind, but
> rather than let the perfect become the enemy of the good, I figured I¹d open
> this up for discussion.
> 
> 
> 
> ________________________________
> From: public-xg-htmlspeech-request@w3.org
> [mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of Robert Brown
> Sent: Wednesday, June 22, 2011 5:24 PM
> To: HTML Speech XG
> Subject: RE: Notes from today's protocol call
> 
> I haven¹t finished my work item (Redraft to incorporate everything we¹ve
> discussed so far) but it¹s in progress.
> 
> Here are some things I¹ve noticed so far:
> 
> 
> 1.       Is message-length in the request line really necessary? Presumably
> its only value in MRCP is to provide message framing in what is otherwise just
> an open-ended character stream, which we get automatically in WebSockets.
> Ditto for the Content-Length header.
> 
> 2.       It's not clear that we need Cancel-If-Queue for recognition. HTML
> apps won't have the same serialized dialog we see in IVR, so this may not be a
> meaningful header.
> 
> 3.       Will the API have hotword functionality? If not, do we need the
> hotword headers?
> 
> 4.       Does the reco portion of the protocol imply API requirements that
> haven¹t been discussed yet? For example, START-INPUT-TIMERS is there for a
> good reason, but AFAIK the XG hasn¹t spoken about the scenario . Similarly,
> Early-No-Match seems useful. Is it?
> 
> 5.       The TTS design has some IVR artifacts that don't make sense in HTML.
> In IVR, the synthesizer essentially renders directly to the user's telephone,
> and is an active part of the user interface. Whereas in HTML, the synthesizer
> is just a provider of audio to the UA.  The UA buffer the audio and control
> playback independent of rendering. In light of this, the CONTROL method,
> jump-size header, Speak-Length header, kill-on-barge-in header (and possibly
> others) don't really make sense.
> 
> 6.       The TTS Speaker-Profile header will probably never be used, because
> HTML UAs will want to pass values inline, rather than store them in a separate
> URI-referenceable resource. Should we remove it?
> 
> 7.       DEFINE-LEXICON and the Load-Lexicon header appear to be useful. Does
> it need to surface in the API, or is its presence in SSML enough? And if it
> is, why do we need the header? And also, why isn't there corresponding
> functionality for recognition?
> 
> 
> 
> From: public-xg-htmlspeech-request@w3.org
> [mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of Robert Brown
> Sent: Thursday, June 16, 2011 10:06 AM
> To: HTML Speech XG
> Subject: Notes from today's protocol call
> 
> Attendees:
> 
> -          Robert Brown
> 
> -          Milan Young
> 
> -          Patrick Ehlen
> 
> -          Michael Johnston
> 
> Topic: control portion of protocol based on MRCP subset (this thread
> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/0035.html)
> 
> 
> -          Agreed that this subset is appropriate for ASR & TTS.
> 
> -          Unclear whether recording should be included.  Agreed to escalate
> this to the XG.  IF there¹s agreement that recording scenarios are common and
> valuable, we¹ll include that portion in the protocol.  OTHERWISE we¹ll omit
> it, since it¹s still possible through more convoluted means.
> 
> o        We discussed this in the main call with the XG.  General agreement
> was that recording isn¹t something we need to solve, and that it should be
> possible as a side-effect of recognition (i.e. <ruleref special="GARBAGE"> and
> retain the audio).
> 
> -          While services are free to implement a subset of the protocol (e.g.
> only SR or only TTS portions), clients will need to implement the full set.
> 
> Topic: RTP
> 
> -          Agreed that it is unneeded, for the reasons stated
> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/0029.html
> 
> -          The basic design approach provides an extensibility mechanism so
> that if new scenarios emerged in the future that required RTP or another
> protocol, 
> (http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/att-0008/spe
> ech-protocol-basic-approach-01.html)
> 
> Topic: SDP
> 
> -          Agreed, like RTP, that it is unneeded given the context we already
> have as a byproduct of our design approach.  (See also the last paragraph
> here: 
> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/0042.html)
> 
> Topic: session initiation & media negotiation
> (http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/0042.html)
> 
> -          People haven¹t had a chance to review this.  Will discuss more over
> the coming week.
> 
> -          GET-PARAMS is resource specific, so works a little differently to
> what¹s written here.  (Robert will need to re-think and make another proposal)
> 
> Next steps:
> 
> -          Continuous speech proposal (Milan)
> 
> -          Redraft to incorporate everything we¹ve discussed so far (Robert)
> 
> -          Examine whether the recognition portion of the protocol can handle
> extended scenarios, like verification, etc. (Michael)
> 
> 
>
Received on Thursday, 23 June 2011 15:08:20 UTC