W3C home > Mailing lists > Public > public-xg-htmlspeech@w3.org > September 2011

RE: SpeechInputResult merged proposal

From: Young, Milan <Milan.Young@nuance.com>
Date: Thu, 29 Sep 2011 10:17:09 -0700
Message-ID: <1AA381D92997964F898DF2A3AA4FF9AD0CFB5896@SUN-EXCH01.nuance.com>
To: Michael Bodell <mbodell@microsoft.com>, <public-xg-htmlspeech@w3.org>
This proposal sounds good to me.  But the "Testing this example" parsing
is potentially misleading because the recognizer is always returning one
word per result.  While it's true that a recognizer *could* do it this
way, I find it unlikely.  Perhaps the example could be improved as

"Testing this example, and launching this missile. < cough>"

// Recognizer modifies word
Event1: "text", 1 				["text"]
Event2: "test", 1				["test"]

// Recognizer combines words
Event3: "test sting", 1				["test sting"]
Event4: "testing this", 1				["testing this"]

// Recognizer moves word from one phrase to another
Event5: "testing this example", 1		["testing this example"]
Event6:	"testing this example ant",1 		["testing this example
Event7: "testing this example", 1		["testing this example"]

Event8: "and launching", 2			["testing this example",
"and launching"]

// Recognizer removes invalid phrase
Event9: "and launching this missile", 2		["testing this example",
"and launching this missile"]
Event10: "confirm", 3				["testing this example",
"and launching this missile", "confirm"]
Event11: NULL, 3				["testing this example",
"and launching this missile"]

// Recognizer marks select candidates as final
Event12: FINALIZE, 1-2				

 * I wasn't sure how to represent the finalized elements.  Should we put
this info in a parallel array, or update the elements from strings to
{string, bool} tuples?
 * Are arrays indexed from zero or one?


-----Original Message-----
From: public-xg-htmlspeech-request@w3.org
[mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of Michael Bodell
Sent: Thursday, September 29, 2011 2:21 AM
To: public-xg-htmlspeech@w3.org
Subject: SpeechInputResult merged proposal

So we have two existing proposals for SpeechInputResult:

Bjorn's mail of:

Satish's mail of:

I like Bjorn's proposal in that it incorporates the items we talked
about at the F2F including:

- EMMA XML representation
- a triple of utterance, confidence, and interpretation
- nbest list

But it doesn't incorporate continuous recognition where you could get
multiple recognition results.

Satish's proposal deals with the different prelim, stable, and
alternatives by having an array of them and a span of them which gets
the continuous part right, but which I fear breaks some of the things we
want in the simple non-continuous case (as well as in the continuous
case) like the EMMA XML, the interpretation, and simplicity.

What about something that tries to combine both ideas building off
Bjorn's proposal but adding the arrays idea from Satish's to handle the
continuous case.  Something like:

interface SpeechInputResultEvent : Event {
        readonly attribute SpeechInputResult result;
        readonly attribute short resultIndex;
        readonly attribute SpeechInputResult[] results;
        readonly attribute DOMString sessionId;

interface SpeechInputResult {
   readonly attribute Document resultEMMAXML;
   readonly attribute DOMString resultEMMAText;
   readonly attribute unsigned long length;
   getter SpeechInputResultAlternative item(in unsigned long index); };
// Item in N-best list interface SpeechInputAlternative {
   readonly attribute DOMString utterance;
   readonly attribute float confidence;
   readonly attribute any interpretation; };

It is possible that the results array and/or sessionId belongs as a
readonly attribute on the SpeechInputRequest interface instead of on
each SpeechInputResultEvent, but I figure it is easiest here.  

If all you are doing is non-continuous recognition you never need look
at anything except the result which contains the structure Bjorn
proposed.  I think this is a big simplicity win as the easy case really
is easy.

If you are doing continuous recognition you get an array of results that
builds up over time.  Each time the recogntion occurs you'll get at
least one new SpeechInputResultEvent returned and it will have a
complete SpeechInputResult structure at some index of the results array
(each index gets its own result event, possibly multiple if we are
correcting incorrect and/or preliminary results).  The index that this
event is filling is given by the resultIndex.  By having an explicit
index there the recognizer can correct earlier results, so you may get
events with indexes 1, 2, 3, 2, 3, 4, 5, 6, 5, 7, 8 in the case that the
recognizer is recognizing a continuous recognition and correcting
earlier frames/results as it gets later ones.  Or, in the case, the
recognizer is correcting the same one you might go 1, 2, 2, 2, 3, 3, 4,
4, 4, 5 as it gives preliminary recognition results and corrects them
soon there after.  If you send a NULL result with an index that can
remove that index from the array.

If we really wanted to we could add a readonly hint/flag that indicates
if a result is final or not.  But I'm not sure there is any value in
forbiding a recognition system from correcting an earlier result in the
array if new processing indicates an earlier one would be more correct.

Taking Satish's example of the processing the "testing this example"
string and ignoring the details of the EMMA and confidence and
interpretation and sessionId you'd get the following (utterance, index,
resutls[]) tuples:

Event1: "text", 1, ["text"]

Event2: "test", 1, ["test"]

Event3: "sting", 2, ["test", "sting"]

Event4: "testing", 1, ["testing", "sting"]

Event5: "this", 2, ["testing", "this"]

Event6: "ex", 3, ["testing", "this", "ex"]

Event7: "apple", 4, ["testing", "this", "ex", "apple"]

Event8: "example", 3, ["testing", "this", "example", "apple"]

Event9:  NULL, 4, ["testing", "this", "example"]
Received on Thursday, 29 September 2011 17:18:45 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 29 September 2011 17:18:46 GMT