RE: Notes from today's protocol call from JOHNSTON, MICHAEL J (MICHAEL J) on 2011-06-23 (public-xg-htmlspeech@w3.org from June 2011)

From: JOHNSTON, MICHAEL J (MICHAEL J) <johnston@research.att.com>
Date: Thu, 23 Jun 2011 10:20:48 -0400
To: "Young, Milan" <Milan.Young@nuance.com>, Robert Brown <Robert.Brown@microsoft.com>, HTML Speech XG <public-xg-htmlspeech@w3.org>
Message-ID: <DE13570BD8A23F4FA2139E596105E0409AEC1F047B@njfpsrvexg1.research.att.com>
Looks good Milan, few questions:

1. The EMMA example:

Is there a particular reason you are using <emma:lattice> in the example? 
We haven't done anything to prevent lattices being returned, presumably
one sets a parameter (SET-PARAMS) requesting that the EMMA output
be a lattice.  I don't think it would be the default to return though. For the example at
the end of control-protocol.html maybe something more like the following.

<emma:emma version="1.0"
   	xmlns:emma="http://www.w3.org/2003/04/emma"
   	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   	xsi:schemaLocation="http://www.w3.org/2003/04/emma
   	 http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd"
   	xmlns="http://www.example.com/example">
<emma:interpretation 
	id="interp1" 
	emma:medium="acoustic" 
	emma:mode="voice"
	emma:tokens="he bought a damn nice five dollar watch in new york"
       emma:confidence="0.9">
       <text>He bought a d*** nice $5.00 watch in New York.</text>
       <alignment>Content undefined (i.e. vendor specific)</alignment>
 </emma:interpretation>
</emma:emma>

The raw string is in emma:tokens and the formatted version
is inside emma:interpretation. Note that there is nothing standard
here about <text> and <alignment> they are arbitrary elements
in the application specific markup. You could also just put the 
text interpretation (here some formatted text) in <emma:literal>:

<emma:emma version="1.0"
   	xmlns:emma="http://www.w3.org/2003/04/emma"
   	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   	xsi:schemaLocation="http://www.w3.org/2003/04/emma
   	 http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd"
   	xmlns="http://www.example.com/example">
<emma:interpretation 
	id="interp1" 
	emma:medium="acoustic" 
	emma:mode="voice"
	emma:tokens="he bought a damn nice five dollar watch in new york"
       emma:confidence="0.9">
       <emma:literal>He bought a d*** nice $5.00 watch in New York.</emma:literal>
 </emma:interpretation>
</emma:emma>

2. Headers:

Agree makes sense to have a way to specify the User-ID for user
specific or adapted models.  Should this be a new header though 'User-ID' or 
a parameter that is set by SET-PARAMS? More generally we need the protocol to
support transmission of arbitrary other parameters that might be
used to improve or enhance recognition performance beyond
User ID.   There may be some information types that we want to break out have standardized
parameters or headers for, then there should also be a more
general mechanism for passing arbitrary information to the resource
to be used in recognition or post processing. 

best
Michael



________________________________________
From: public-xg-htmlspeech-request@w3.org [public-xg-htmlspeech-request@w3.org] On Behalf Of Young, Milan [Milan.Young@nuance.com]
Sent: Thursday, June 23, 2011 1:17 AM
To: Robert Brown; HTML Speech XG
Subject: RE: Notes from today's protocol call

For my part, I’ve updated the control portion of the protocol to cover the continuous speech scenario.  Also made a few modifications based on recent discussions and updated the document to match Robert’s HTML format.  Please see the attached.

There are a couple areas of the protocol that are still TBD in my mind, but rather than let the perfect become the enemy of the good, I figured I’d open this up for discussion.



________________________________
From: public-xg-htmlspeech-request@w3.org [mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of Robert Brown
Sent: Wednesday, June 22, 2011 5:24 PM
To: HTML Speech XG
Subject: RE: Notes from today's protocol call

I haven’t finished my work item (Redraft to incorporate everything we’ve discussed so far) but it’s in progress.

Here are some things I’ve noticed so far:


1.       Is message-length in the request line really necessary? Presumably its only value in MRCP is to provide message framing in what is otherwise just an open-ended character stream, which we get automatically in WebSockets. Ditto for the Content-Length header.

2.       It's not clear that we need Cancel-If-Queue for recognition. HTML apps won't have the same serialized dialog we see in IVR, so this may not be a meaningful header.

3.       Will the API have hotword functionality? If not, do we need the hotword headers?

4.       Does the reco portion of the protocol imply API requirements that haven’t been discussed yet? For example, START-INPUT-TIMERS is there for a good reason, but AFAIK the XG hasn’t spoken about the scenario . Similarly, Early-No-Match seems useful. Is it?

5.       The TTS design has some IVR artifacts that don't make sense in HTML. In IVR, the synthesizer essentially renders directly to the user's telephone, and is an active part of the user interface. Whereas in HTML, the synthesizer is just a provider of audio to the UA.  The UA buffer the audio and control playback independent of rendering. In light of this, the CONTROL method, jump-size header, Speak-Length header, kill-on-barge-in header (and possibly others) don't really make sense.

6.       The TTS Speaker-Profile header will probably never be used, because HTML UAs will want to pass values inline, rather than store them in a separate URI-referenceable resource. Should we remove it?

7.       DEFINE-LEXICON and the Load-Lexicon header appear to be useful. Does it need to surface in the API, or is its presence in SSML enough? And if it is, why do we need the header? And also, why isn't there corresponding functionality for recognition?



From: public-xg-htmlspeech-request@w3.org [mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of Robert Brown
Sent: Thursday, June 16, 2011 10:06 AM
To: HTML Speech XG
Subject: Notes from today's protocol call

Attendees:

-          Robert Brown

-          Milan Young

-          Patrick Ehlen

-          Michael Johnston

Topic: control portion of protocol based on MRCP subset (this thread http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/0035.html)


-          Agreed that this subset is appropriate for ASR & TTS.

-          Unclear whether recording should be included.  Agreed to escalate this to the XG.  IF there’s agreement that recording scenarios are common and valuable, we’ll include that portion in the protocol.  OTHERWISE we’ll omit it, since it’s still possible through more convoluted means.

o        We discussed this in the main call with the XG.  General agreement was that recording isn’t something we need to solve, and that it should be possible as a side-effect of recognition (i.e. <ruleref special="GARBAGE"> and retain the audio).

-          While services are free to implement a subset of the protocol (e.g. only SR or only TTS portions), clients will need to implement the full set.

Topic: RTP

-          Agreed that it is unneeded, for the reasons stated http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/0029.html

-          The basic design approach provides an extensibility mechanism so that if new scenarios emerged in the future that required RTP or another protocol, (http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/att-0008/speech-protocol-basic-approach-01.html)

Topic: SDP

-          Agreed, like RTP, that it is unneeded given the context we already have as a byproduct of our design approach.  (See also the last paragraph here: http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/0042.html)

Topic: session initiation & media negotiation (http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/0042.html)

-          People haven’t had a chance to review this.  Will discuss more over the coming week.

-          GET-PARAMS is resource specific, so works a little differently to what’s written here.  (Robert will need to re-think and make another proposal)

Next steps:

-          Continuous speech proposal (Milan)

-          Redraft to incorporate everything we’ve discussed so far (Robert)

-          Examine whether the recognition portion of the protocol can handle extended scenarios, like verification, etc. (Michael)
Received on Thursday, 23 June 2011 14:21:44 UTC