Re: SpeechInputResult merged proposal from Satish S on 2011-10-13 (public-xg-htmlspeech@w3.org from October 2011)

From: Satish S <satish@google.com>
Date: Thu, 13 Oct 2011 15:49:06 +0100
To: Michael Bodell <mbodell@microsoft.com>
Cc: "public-xg-htmlspeech@w3.org" <public-xg-htmlspeech@w3.org>
Message-ID: <CAHZf7Rnhi9PUSp2xQ=MkGfRr+t8u3NL2UJdLMCvE75HMSKu+ew@mail.gmail.com>
I also read through Milan's example and it is indeed a more realistic
example. However it doesn't address the alternate spans of recognition
results that I was interested in. To quote what i wrote earlier:

"
What I think the earlier proposal clearly addresses are how to handle
alternates - not just n-best list but alternate spans of recognition
results. This was highlighted in item (5) of the example section in
http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Sep/0034.html.
Such alternates can overlap any part of the result stream, not just at word
boundaries of the top result. This seems lacking in the new proposal.
"

The use case I'm looking at is dictation and correction of mis-recognized
phrases. In the original and my updated proposal, it is possible for a web
app collect & show a set of alternate phrases at various points in the text
stream. This allows users to point at a particular phrase in the document
and select a different recognition phrase, without having to type it
manually or re-dictate it.

Cheers
Satish


On Thu, Oct 13, 2011 at 12:09 PM, Satish S <satish@google.com> wrote:

> Thanks Michael.
>
> I still don't understand what exactly was broken in the original proposal
> or my updated proposal. Could you give specific examples?
>
> Cheers
> Satish
>
>
>
> On Thu, Oct 13, 2011 at 11:52 AM, Michael Bodell <mbodell@microsoft.com>wrote:
>
>>  Sorry for the slow response, I was trying to get the updated proposal
>> out first since it is easiest to refer to a working proposal and since the
>> update was needed for today’s call.****
>>
>> ** **
>>
>> I think the concern you raise was discussed on the call when we all
>> circled around and talked through the various examples and layouts.  We
>> definitely wanted to have recognition results that are easy in the
>> non-continuous case and are consistent in the continuous case (I.e., don’t
>> break the easy case and don’t do something wildly different in the
>> continuous case) and ideally that would work even if continuous=false but
>> interim=true.  Note it isn’t just the EMMA that we want in the
>> alternatives/hypothesis, but also the interpretations to go with the
>> utterances and the confidences.  Also note that while a number of our
>> examples used word boundary for simplicity of discussion, the current
>> version of the web api does not need the results that come back to be on
>> word boundaries.  They could be broken on words or sentences or phrases or
>> paragraphs or whatever (up to the recognition service and the grammars in
>> use and the actual utterances than anything else) – we were just doing
>> single words because it was easier to write up.  Milan had a more complete
>> and realistic example that he mailed out Sept 29th.****
>>
>> ** **
>>
>> Hopefully that context combined with the write up of the current API will
>> satisfy your requirements.  We can certainly discuss as part of this week’s
>> call.****
>>
>> ** **
>>
>> *From:* Satish S [mailto:satish@google.com]
>> *Sent:* Tuesday, October 11, 2011 7:16 AM
>> *To:* Michael Bodell
>> *Cc:* public-xg-htmlspeech@w3.org
>> *Subject:* Re: SpeechInputResult merged proposal****
>>
>> ** **
>>
>> Hi all,****
>>
>> ** **
>>
>> Any thoughts on my questions below and the proposal? If some of these get
>> resolved over mail we could make better use of this week's call.****
>>
>>
>> Cheers
>> Satish
>>
>> ****
>>
>> On Wed, Oct 5, 2011 at 9:59 PM, Satish S <satish@google.com> wrote:****
>>
>> Sorry I had to miss last week's call. I read through the call notes but I
>> am unsure where the earlier proposal breaks things in the simple
>> non-continuous case. In the simple one-shot reco case, the script would just
>> read the "event.stable" array of hypotheses.****
>>
>> ** **
>>
>> As for EMMA fields, that was an oversight on my part. They should be
>> present wherever there was an array of hypotheses. So I'd replace the "readonly
>> attribute Hypothesis[]" in the IDL I sent with an interface that contains
>> this array, emmaXML & emmaText attributes.****
>>
>> ** **
>>
>> What I think the earlier proposal clearly addresses are how to handle
>> alternates - not just n-best list but alternate spans of recognition
>> results. This was highlighted in item (5) of the example section in
>> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Sep/0034.html.
>> Such alternates can overlap any part of the result stream, not just at word
>> boundaries of the top result. This seems lacking in the new proposal.****
>>
>> ** **
>>
>> Below is an updated IDL that includes the EMMA fields. Please let us know
>> if there are use cases this doesn't address.****
>>
>> ** **
>>
>> interface SpeechInputEvent {****
>>
>>   readonly attribute SpeechInputResult prelim;****
>>
>>   readonly attribute SpeechInputResult stable;****
>>
>>   readonly attribute Alternative[] alternatives;****
>>
>> }****
>>
>> interface SpeechInputResult {****
>>
>>   readonly attribute Hypothesis[] hypotheses;****
>>
>>   readonly attribute Document emmaXML;****
>>
>>   readonly attribute DOMString emmaText;****
>>
>> }****
>>
>> interface Hypothesis {****
>>
>>   readonly attribute DOMString utterance;****
>>
>>   readonly attribute float confidence;  // Range 0.0 - 1.0****
>>
>> }****
>>
>>  And if the app cares about alternates and correction, then:****
>>
>> ** **
>>
>> interface Alternative {****
>>
>>   readonly attribute int start;****
>>
>>   readonly attribute AlternativeSpan[] spans;****
>>
>> }****
>>
>> ** **
>>
>> interface AlternativeSpan {****
>>
>>   readonly attribute int length;****
>>
>>   readonly attribute float confidence;****
>>
>>   readonly attribute SpeechInputResult items;****
>>
>> }****
>>
>>  Cheers****
>>
>> Satish****
>>
>>
>>
>> ****
>>
>> On Thu, Sep 29, 2011 at 10:20 AM, Michael Bodell <mbodell@microsoft.com>
>> wrote:****
>>
>> So we have two existing proposals for SpeechInputResult:
>>
>> Bjorn's mail of:
>> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Aug/0033.html
>>
>> Satish's mail of:
>> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Sep/0034.html
>>
>> I like Bjorn's proposal in that it incorporates the items we talked about
>> at the F2F including:
>>
>> - EMMA XML representation
>> - a triple of utterance, confidence, and interpretation
>> - nbest list
>>
>> But it doesn't incorporate continuous recognition where you could get
>> multiple recognition results.
>>
>> Satish's proposal deals with the different prelim, stable, and
>> alternatives by having an array of them and a span of them which gets the
>> continuous part right, but which I fear breaks some of the things we want in
>> the simple non-continuous case (as well as in the continuous case) like the
>> EMMA XML, the interpretation, and simplicity.
>>
>> What about something that tries to combine both ideas building off Bjorn's
>> proposal but adding the arrays idea from Satish's to handle the continuous
>> case.  Something like:
>>
>> interface SpeechInputResultEvent : Event {
>>        readonly attribute SpeechInputResult result;
>>        readonly attribute short resultIndex;
>>        readonly attribute SpeechInputResult[] results;
>>        readonly attribute DOMString sessionId;
>>    };
>>
>> interface SpeechInputResult {
>>   readonly attribute Document resultEMMAXML;
>>   readonly attribute DOMString resultEMMAText;
>>   readonly attribute unsigned long length;
>>   getter SpeechInputResultAlternative item(in unsigned long index);
>> };
>> // Item in N-best list
>> interface SpeechInputAlternative {
>>   readonly attribute DOMString utterance;
>>   readonly attribute float confidence;
>>   readonly attribute any interpretation;
>> };
>>
>> It is possible that the results array and/or sessionId belongs as a
>> readonly attribute on the SpeechInputRequest interface instead of on each
>> SpeechInputResultEvent, but I figure it is easiest here.
>>
>> If all you are doing is non-continuous recognition you never need look at
>> anything except the result which contains the structure Bjorn proposed.  I
>> think this is a big simplicity win as the easy case really is easy.
>>
>> If you are doing continuous recognition you get an array of results that
>> builds up over time.  Each time the recogntion occurs you'll get at least
>> one new SpeechInputResultEvent returned and it will have a complete
>> SpeechInputResult structure at some index of the results array (each index
>> gets its own result event, possibly multiple if we are correcting incorrect
>> and/or preliminary results).  The index that this event is filling is given
>> by the resultIndex.  By having an explicit index there the recognizer can
>> correct earlier results, so you may get events with indexes 1, 2, 3, 2, 3,
>> 4, 5, 6, 5, 7, 8 in the case that the recognizer is recognizing a continuous
>> recognition and correcting earlier frames/results as it gets later ones.
>>  Or, in the case, the recognizer is correcting the same one you might go 1,
>> 2, 2, 2, 3, 3, 4, 4, 4, 5 as it gives preliminary recognition results and
>> corrects them soon there after.  If you send a NULL result with an index
>> that can remove that index from the array.
>>
>> If we really wanted to we could add a readonly hint/flag that indicates if
>> a result is final or not.  But I'm not sure there is any value in forbiding
>> a recognition system from correcting an earlier result in the array if new
>> processing indicates an earlier one would be more correct.
>>
>> Taking Satish's example of the processing the "testing this example"
>> string and ignoring the details of the EMMA and confidence and
>> interpretation and sessionId you'd get the following (utterance, index,
>> resutls[]) tuples:
>>
>> Event1: "text", 1, ["text"]
>>
>> Event2: "test", 1, ["test"]
>>
>> Event3: "sting", 2, ["test", "sting"]
>>
>> Event4: "testing", 1, ["testing", "sting"]
>>
>> Event5: "this", 2, ["testing", "this"]
>>
>> Event6: "ex", 3, ["testing", "this", "ex"]
>>
>> Event7: "apple", 4, ["testing", "this", "ex", "apple"]
>>
>> Event8: "example", 3, ["testing", "this", "example", "apple"]
>>
>> Event9:  NULL, 4, ["testing", "this", "example"]****
>>
>> ** **
>>
>> ** **
>>
>
>
Received on Thursday, 13 October 2011 14:49:45 UTC