Re: Notes from today's protocol call from JOHNSTON, MICHAEL J (MICHAEL J) on 2011-06-29 (public-xg-htmlspeech@w3.org from June 2011)

From: JOHNSTON, MICHAEL J (MICHAEL J) <johnston@research.att.com>
Date: Wed, 29 Jun 2011 19:23:22 -0400
To: "Young, Milan" <Milan.Young@nuance.com>
CC: Robert Brown <Robert.Brown@microsoft.com>, HTML Speech XG <public-xg-htmlspeech@w3.org>
Message-ID: <C29281C0-B240-4D65-B113-A0C9B5F86FDC@research.att.com>
On Jun 27, 2011, at 1:07 PM, Young, Milan wrote:

> Hello Michael,
> 
> Regarding lattice: I'd like to keep the lattice example in-place to
> re-enforce the notion that it's an option.  But putting alternate
> representations side-by-side seems like a good idea because I agree
> lattice results are less common.
> 

Glad to hear you like emma:lattice.  I have no problem having it there
but we really should suggest that this is the 'normal' form for the speech
result, it really is a special case. Also we should make sure that any
examples given are true EMMA documents that will validate
not fragments, i.e. we need the root element <emma:emma> appropriate
namespace declarations etc. 

> Regarding alignment: I was disappointed that EMMA didn't provide more
> structure in this area.  But I suppose that our group isn't a good fit
> for solving that problem.  In any case, I'm open to your recommendations
> on how to proceed.
> 

We'd certainly welcome feedback and proposals on this to the EMMA
subgroup, and if it came soon enough and received support this could
feed into the current work on EMMA 1.1.


> Regarding SET-PARAMS vs. header: Perhaps there is some confusion here.
> SET-PARAMS is an MRCP method that sets values based on their header
> definitions.  In other words, if I wanted to set User-ID, I would issue
> a SET-PARAMS method call and supply the User-ID header with the value I
> desired.  More commonly, I could also simply include the User-ID header
> on just those methods where I thought it was needed.
> 

Got it, in either case the parameters are always header definitions.
As long as we have way to transmit application and vendor specific
header definitions along with those that are pre-ordained, no problem. 


> Thanks
> 
> 
> -----Original Message-----
> From: public-xg-htmlspeech-request@w3.org
> [mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of JOHNSTON,
> MICHAEL J (MICHAEL J)
> Sent: Thursday, June 23, 2011 7:21 AM
> To: Young, Milan; Robert Brown; HTML Speech XG
> Subject: RE: Notes from today's protocol call
> 
> Looks good Milan, few questions:
> 
> 1. The EMMA example:
> 
> Is there a particular reason you are using <emma:lattice> in the
> example? 
> We haven't done anything to prevent lattices being returned, presumably
> one sets a parameter (SET-PARAMS) requesting that the EMMA output
> be a lattice.  I don't think it would be the default to return though.
> For the example at
> the end of control-protocol.html maybe something more like the
> following.
> 
> <emma:emma version="1.0"
>   	xmlns:emma="http://www.w3.org/2003/04/emma"
>   	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>   	xsi:schemaLocation="http://www.w3.org/2003/04/emma
>   	 http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd"
>   	xmlns="http://www.example.com/example">
> <emma:interpretation 
> 	id="interp1" 
> 	emma:medium="acoustic" 
> 	emma:mode="voice"
> 	emma:tokens="he bought a damn nice five dollar watch in new
> york"
>       emma:confidence="0.9">
>       <text>He bought a d*** nice $5.00 watch in New York.</text>
>       <alignment>Content undefined (i.e. vendor specific)</alignment>
> </emma:interpretation>
> </emma:emma>
> 
> The raw string is in emma:tokens and the formatted version
> is inside emma:interpretation. Note that there is nothing standard
> here about <text> and <alignment> they are arbitrary elements
> in the application specific markup. You could also just put the 
> text interpretation (here some formatted text) in <emma:literal>:
> 
> <emma:emma version="1.0"
>   	xmlns:emma="http://www.w3.org/2003/04/emma"
>   	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>   	xsi:schemaLocation="http://www.w3.org/2003/04/emma
>   	 http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd"
>   	xmlns="http://www.example.com/example">
> <emma:interpretation 
> 	id="interp1" 
> 	emma:medium="acoustic" 
> 	emma:mode="voice"
> 	emma:tokens="he bought a damn nice five dollar watch in new
> york"
>       emma:confidence="0.9">
>       <emma:literal>He bought a d*** nice $5.00 watch in New
> York.</emma:literal>
> </emma:interpretation>
> </emma:emma>
> 
> 
> 
> 2. Headers:
> 
> Agree makes sense to have a way to specify the User-ID for user
> specific or adapted models.  Should this be a new header though
> 'User-ID' or 
> a parameter that is set by SET-PARAMS? More generally we need the
> protocol to
> support transmission of arbitrary other parameters that might be
> used to improve or enhance recognition performance beyond
> User ID.   There may be some information types that we want to break out
> have standardized
> parameters or headers for, then there should also be a more
> general mechanism for passing arbitrary information to the resource
> to be used in recognition or post processing. 
> 
> best
> Michael
> 
> 
> 
> ________________________________________
> From: public-xg-htmlspeech-request@w3.org
> [public-xg-htmlspeech-request@w3.org] On Behalf Of Young, Milan
> [Milan.Young@nuance.com]
> Sent: Thursday, June 23, 2011 1:17 AM
> To: Robert Brown; HTML Speech XG
> Subject: RE: Notes from today's protocol call
> 
> For my part, I've updated the control portion of the protocol to cover
> the continuous speech scenario.  Also made a few modifications based on
> recent discussions and updated the document to match Robert's HTML
> format.  Please see the attached.
> 
> There are a couple areas of the protocol that are still TBD in my mind,
> but rather than let the perfect become the enemy of the good, I figured
> I'd open this up for discussion.
> 
> 
> 
> ________________________________
> From: public-xg-htmlspeech-request@w3.org
> [mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of Robert Brown
> Sent: Wednesday, June 22, 2011 5:24 PM
> To: HTML Speech XG
> Subject: RE: Notes from today's protocol call
> 
> I haven't finished my work item (Redraft to incorporate everything we've
> discussed so far) but it's in progress.
> 
> Here are some things I've noticed so far:
> 
> 
> 1.       Is message-length in the request line really necessary?
> Presumably its only value in MRCP is to provide message framing in what
> is otherwise just an open-ended character stream, which we get
> automatically in WebSockets. Ditto for the Content-Length header.
> 
> 2.       It's not clear that we need Cancel-If-Queue for recognition.
> HTML apps won't have the same serialized dialog we see in IVR, so this
> may not be a meaningful header.
> 
> 3.       Will the API have hotword functionality? If not, do we need the
> hotword headers?
> 
> 4.       Does the reco portion of the protocol imply API requirements
> that haven't been discussed yet? For example, START-INPUT-TIMERS is
> there for a good reason, but AFAIK the XG hasn't spoken about the
> scenario . Similarly, Early-No-Match seems useful. Is it?
> 
> 5.       The TTS design has some IVR artifacts that don't make sense in
> HTML. In IVR, the synthesizer essentially renders directly to the user's
> telephone, and is an active part of the user interface. Whereas in HTML,
> the synthesizer is just a provider of audio to the UA.  The UA buffer
> the audio and control playback independent of rendering. In light of
> this, the CONTROL method, jump-size header, Speak-Length header,
> kill-on-barge-in header (and possibly others) don't really make sense.
> 
> 6.       The TTS Speaker-Profile header will probably never be used,
> because HTML UAs will want to pass values inline, rather than store them
> in a separate URI-referenceable resource. Should we remove it?
> 
> 7.       DEFINE-LEXICON and the Load-Lexicon header appear to be useful.
> Does it need to surface in the API, or is its presence in SSML enough?
> And if it is, why do we need the header? And also, why isn't there
> corresponding functionality for recognition?
> 
> 
> 
> 
> From: public-xg-htmlspeech-request@w3.org
> [mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of Robert Brown
> Sent: Thursday, June 16, 2011 10:06 AM
> To: HTML Speech XG
> Subject: Notes from today's protocol call
> 
> Attendees:
> 
> -          Robert Brown
> 
> -          Milan Young
> 
> -          Patrick Ehlen
> 
> -          Michael Johnston
> 
> Topic: control portion of protocol based on MRCP subset (this thread
> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/0035.ht
> ml)
> 
> 
> -          Agreed that this subset is appropriate for ASR & TTS.
> 
> -          Unclear whether recording should be included.  Agreed to
> escalate this to the XG.  IF there's agreement that recording scenarios
> are common and valuable, we'll include that portion in the protocol.
> OTHERWISE we'll omit it, since it's still possible through more
> convoluted means.
> 
> o        We discussed this in the main call with the XG.  General
> agreement was that recording isn't something we need to solve, and that
> it should be possible as a side-effect of recognition (i.e. <ruleref
> special="GARBAGE"> and retain the audio).
> 
> -          While services are free to implement a subset of the protocol
> (e.g. only SR or only TTS portions), clients will need to implement the
> full set.
> 
> Topic: RTP
> 
> -          Agreed that it is unneeded, for the reasons stated
> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/0029.ht
> ml
> 
> -          The basic design approach provides an extensibility mechanism
> so that if new scenarios emerged in the future that required RTP or
> another protocol,
> (http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/att-00
> 08/speech-protocol-basic-approach-01.html)
> 
> Topic: SDP
> 
> -          Agreed, like RTP, that it is unneeded given the context we
> already have as a byproduct of our design approach.  (See also the last
> paragraph here:
> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/0042.ht
> ml)
> 
> Topic: session initiation & media negotiation
> (http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/0042.h
> tml)
> 
> -          People haven't had a chance to review this.  Will discuss
> more over the coming week.
> 
> -          GET-PARAMS is resource specific, so works a little
> differently to what's written here.  (Robert will need to re-think and
> make another proposal)
> 
> Next steps:
> 
> -          Continuous speech proposal (Milan)
> 
> -          Redraft to incorporate everything we've discussed so far
> (Robert)
> 
> -          Examine whether the recognition portion of the protocol can
> handle extended scenarios, like verification, etc. (Michael)
> 
> 
>
Received on Wednesday, 29 June 2011 23:24:03 UTC