RE: SpeechInputResult merged proposal

Seems like the root problem here is that neither of our examples
contained alternate hypotheses.  But as long as long as we willing to
live with the simplifying assumption that the alternate hypothesis at
index N does not impact the hypotheses at index N-1 or N+1, it should be
straightforward.  Are you OK with this?

 

Thanks

 

 

From: Satish S [mailto:satish@google.com] 
Sent: Thursday, October 13, 2011 8:58 AM
To: Young, Milan
Cc: Michael Bodell; public-xg-htmlspeech@w3.org
Subject: Re: SpeechInputResult merged proposal

 

User utters: "1 2 3 pictures of the moon"

 

Recognized text shown by the web app: "123 pictures of the moon"

Clicking on any character in the first word shows a drop down with "one
two three" and "oh one two three" as 2 alternates for that word/phrase.

Clicking on any character in the second word shows a drop down with
"picture" as an alternate for that word

Clicking on any character in the third word shows a drop down with "off"
as an alternate

Clicking on the last word shows a dropdown with "move", "mood" and
"mall" as alternates

 

These are all alternates returned by the server for the final hypotheses
at each recognition segment.

 

You can see this in action on an Android phone with the Voice IME. Tap
on any text field and click the microphone button in the keyboard to
trigger Voice IME.


Cheers
Satish



On Thu, Oct 13, 2011 at 4:38 PM, Young, Milan <Milan.Young@nuance.com>
wrote:

Could you provide an enhanced example that demonstrates your use case?

 

Thanks

 

 

From: public-xg-htmlspeech-request@w3.org
[mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of Satish S
Sent: Thursday, October 13, 2011 7:49 AM


To: Michael Bodell
Cc: public-xg-htmlspeech@w3.org
Subject: Re: SpeechInputResult merged proposal

 

I also read through Milan's example and it is indeed a more realistic
example. However it doesn't address the alternate spans of recognition
results that I was interested in. To quote what i wrote earlier:

 

"

What I think the earlier proposal clearly addresses are how to handle
alternates - not just n-best list but alternate spans of recognition
results. This was highlighted in item (5) of the example section in
http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Sep/0034.ht
ml
<http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Sep/0034.h
tml> . Such alternates can overlap any part of the result stream, not
just at word boundaries of the top result. This seems lacking in the new
proposal.

"

 

The use case I'm looking at is dictation and correction of
mis-recognized phrases. In the original and my updated proposal, it is
possible for a web app collect & show a set of alternate phrases at
various points in the text stream. This allows users to point at a
particular phrase in the document and select a different recognition
phrase, without having to type it manually or re-dictate it.

 

Cheers
Satish

On Thu, Oct 13, 2011 at 12:09 PM, Satish S <satish@google.com> wrote:

Thanks Michael.

 

I still don't understand what exactly was broken in the original
proposal or my updated proposal. Could you give specific examples?


Cheers
Satish

 

On Thu, Oct 13, 2011 at 11:52 AM, Michael Bodell <mbodell@microsoft.com>
wrote:

Sorry for the slow response, I was trying to get the updated proposal
out first since it is easiest to refer to a working proposal and since
the update was needed for today's call.

 

I think the concern you raise was discussed on the call when we all
circled around and talked through the various examples and layouts.  We
definitely wanted to have recognition results that are easy in the
non-continuous case and are consistent in the continuous case (I.e.,
don't break the easy case and don't do something wildly different in the
continuous case) and ideally that would work even if continuous=false
but interim=true.  Note it isn't just the EMMA that we want in the
alternatives/hypothesis, but also the interpretations to go with the
utterances and the confidences.  Also note that while a number of our
examples used word boundary for simplicity of discussion, the current
version of the web api does not need the results that come back to be on
word boundaries.  They could be broken on words or sentences or phrases
or paragraphs or whatever (up to the recognition service and the
grammars in use and the actual utterances than anything else) - we were
just doing single words because it was easier to write up.  Milan had a
more complete and realistic example that he mailed out Sept 29th.

 

Hopefully that context combined with the write up of the current API
will satisfy your requirements.  We can certainly discuss as part of
this week's call.

 

From: Satish S [mailto:satish@google.com] 
Sent: Tuesday, October 11, 2011 7:16 AM
To: Michael Bodell
Cc: public-xg-htmlspeech@w3.org
Subject: Re: SpeechInputResult merged proposal

 

Hi all,

 

Any thoughts on my questions below and the proposal? If some of these
get resolved over mail we could make better use of this week's call.


Cheers
Satish

On Wed, Oct 5, 2011 at 9:59 PM, Satish S <satish@google.com> wrote:

Sorry I had to miss last week's call. I read through the call notes but
I am unsure where the earlier proposal breaks things in the simple
non-continuous case. In the simple one-shot reco case, the script would
just read the "event.stable" array of hypotheses.

 

As for EMMA fields, that was an oversight on my part. They should be
present wherever there was an array of hypotheses. So I'd replace the
"readonly attribute Hypothesis[]" in the IDL I sent with an interface
that contains this array, emmaXML & emmaText attributes.

 

What I think the earlier proposal clearly addresses are how to handle
alternates - not just n-best list but alternate spans of recognition
results. This was highlighted in item (5) of the example section in
http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Sep/0034.ht
ml. Such alternates can overlap any part of the result stream, not just
at word boundaries of the top result. This seems lacking in the new
proposal.

 

Below is an updated IDL that includes the EMMA fields. Please let us
know if there are use cases this doesn't address.

 
interface SpeechInputEvent {
  readonly attribute SpeechInputResult prelim;
  readonly attribute SpeechInputResult stable;
  readonly attribute Alternative[] alternatives;
}
interface SpeechInputResult {
  readonly attribute Hypothesis[] hypotheses;
  readonly attribute Document emmaXML;
  readonly attribute DOMString emmaText;
}
interface Hypothesis {
  readonly attribute DOMString utterance;
  readonly attribute float confidence;  // Range 0.0 - 1.0
}

And if the app cares about alternates and correction, then:

 
interface Alternative {
  readonly attribute int start;
  readonly attribute AlternativeSpan[] spans;
}
 
 
interface AlternativeSpan {
  readonly attribute int length;
  readonly attribute float confidence;
  readonly attribute SpeechInputResult items;
}

Cheers

Satish

 

On Thu, Sep 29, 2011 at 10:20 AM, Michael Bodell <mbodell@microsoft.com>
wrote:

So we have two existing proposals for SpeechInputResult:

Bjorn's mail of:
http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Aug/0033.ht
ml

Satish's mail of:
http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Sep/0034.ht
ml

I like Bjorn's proposal in that it incorporates the items we talked
about at the F2F including:

- EMMA XML representation
- a triple of utterance, confidence, and interpretation
- nbest list

But it doesn't incorporate continuous recognition where you could get
multiple recognition results.

Satish's proposal deals with the different prelim, stable, and
alternatives by having an array of them and a span of them which gets
the continuous part right, but which I fear breaks some of the things we
want in the simple non-continuous case (as well as in the continuous
case) like the EMMA XML, the interpretation, and simplicity.

What about something that tries to combine both ideas building off
Bjorn's proposal but adding the arrays idea from Satish's to handle the
continuous case.  Something like:

interface SpeechInputResultEvent : Event {
       readonly attribute SpeechInputResult result;
       readonly attribute short resultIndex;
       readonly attribute SpeechInputResult[] results;
       readonly attribute DOMString sessionId;
   };

interface SpeechInputResult {
  readonly attribute Document resultEMMAXML;
  readonly attribute DOMString resultEMMAText;
  readonly attribute unsigned long length;
  getter SpeechInputResultAlternative item(in unsigned long index);
};
// Item in N-best list
interface SpeechInputAlternative {
  readonly attribute DOMString utterance;
  readonly attribute float confidence;
  readonly attribute any interpretation;
};

It is possible that the results array and/or sessionId belongs as a
readonly attribute on the SpeechInputRequest interface instead of on
each SpeechInputResultEvent, but I figure it is easiest here.

If all you are doing is non-continuous recognition you never need look
at anything except the result which contains the structure Bjorn
proposed.  I think this is a big simplicity win as the easy case really
is easy.

If you are doing continuous recognition you get an array of results that
builds up over time.  Each time the recogntion occurs you'll get at
least one new SpeechInputResultEvent returned and it will have a
complete SpeechInputResult structure at some index of the results array
(each index gets its own result event, possibly multiple if we are
correcting incorrect and/or preliminary results).  The index that this
event is filling is given by the resultIndex.  By having an explicit
index there the recognizer can correct earlier results, so you may get
events with indexes 1, 2, 3, 2, 3, 4, 5, 6, 5, 7, 8 in the case that the
recognizer is recognizing a continuous recognition and correcting
earlier frames/results as it gets later ones.  Or, in the case, the
recognizer is correcting the same one you might go 1, 2, 2, 2, 3, 3, 4,
4, 4, 5 as it gives preliminary recognition results and corrects them
soon there after.  If you send a NULL result with an index that can
remove that index from the array.

If we really wanted to we could add a readonly hint/flag that indicates
if a result is final or not.  But I'm not sure there is any value in
forbiding a recognition system from correcting an earlier result in the
array if new processing indicates an earlier one would be more correct.

Taking Satish's example of the processing the "testing this example"
string and ignoring the details of the EMMA and confidence and
interpretation and sessionId you'd get the following (utterance, index,
resutls[]) tuples:

Event1: "text", 1, ["text"]

Event2: "test", 1, ["test"]

Event3: "sting", 2, ["test", "sting"]

Event4: "testing", 1, ["testing", "sting"]

Event5: "this", 2, ["testing", "this"]

Event6: "ex", 3, ["testing", "this", "ex"]

Event7: "apple", 4, ["testing", "this", "ex", "apple"]

Event8: "example", 3, ["testing", "this", "example", "apple"]

Event9:  NULL, 4, ["testing", "this", "example"]

 

 

 

 

 

Received on Thursday, 13 October 2011 16:09:38 UTC