Re: SpeechInputResult merged proposal from Satish S on 2011-10-11 (public-xg-htmlspeech@w3.org from October 2011)

From: Satish S <satish@google.com>
Date: Tue, 11 Oct 2011 15:16:07 +0100
To: Michael Bodell <mbodell@microsoft.com>
Cc: "public-xg-htmlspeech@w3.org" <public-xg-htmlspeech@w3.org>
Message-ID: <CAHZf7RmQgGJB2mmPOLQtyGiB7P+oqJ=DgsYwEisuNR28tWU16A@mail.gmail.com>
Hi all,

Any thoughts on my questions below and the proposal? If some of these get
resolved over mail we could make better use of this week's call.

Cheers
Satish


On Wed, Oct 5, 2011 at 9:59 PM, Satish S <satish@google.com> wrote:

> Sorry I had to miss last week's call. I read through the call notes but I
> am unsure where the earlier proposal breaks things in the simple
> non-continuous case. In the simple one-shot reco case, the script would just
> read the "event.stable" array of hypotheses.
>
> As for EMMA fields, that was an oversight on my part. They should be
> present wherever there was an array of hypotheses. So I'd replace the "readonly
> attribute Hypothesis[]" in the IDL I sent with an interface that contains
> this array, emmaXML & emmaText attributes.
>
> What I think the earlier proposal clearly addresses are how to handle
> alternates - not just n-best list but alternate spans of recognition
> results. This was highlighted in item (5) of the example section in
> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Sep/0034.html.
> Such alternates can overlap any part of the result stream, not just at word
> boundaries of the top result. This seems lacking in the new proposal.
>
> Below is an updated IDL that includes the EMMA fields. Please let us know
> if there are use cases this doesn't address.
>
> interface SpeechInputEvent {
>   readonly attribute SpeechInputResult prelim;
>   readonly attribute SpeechInputResult stable;
>   readonly attribute Alternative[] alternatives;
> }
>
> interface SpeechInputResult {
>   readonly attribute Hypothesis[] hypotheses;
>   readonly attribute Document emmaXML;
>   readonly attribute DOMString emmaText;
> }
>
> interface Hypothesis {
>   readonly attribute DOMString utterance;
>   readonly attribute float confidence;  // Range 0.0 - 1.0
> }
>
> And if the app cares about alternates and correction, then:
>
> interface Alternative {
>   readonly attribute int start;
>   readonly attribute AlternativeSpan[] spans;
> }
>
> interface AlternativeSpan {
>   readonly attribute int length;
>   readonly attribute float confidence;
>   readonly attribute SpeechInputResult items;
> }
>
> Cheers
> Satish
>
>
>
> On Thu, Sep 29, 2011 at 10:20 AM, Michael Bodell <mbodell@microsoft.com>wrote:
>
>> So we have two existing proposals for SpeechInputResult:
>>
>> Bjorn's mail of:
>> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Aug/0033.html
>>
>> Satish's mail of:
>> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Sep/0034.html
>>
>> I like Bjorn's proposal in that it incorporates the items we talked about
>> at the F2F including:
>>
>> - EMMA XML representation
>> - a triple of utterance, confidence, and interpretation
>> - nbest list
>>
>> But it doesn't incorporate continuous recognition where you could get
>> multiple recognition results.
>>
>> Satish's proposal deals with the different prelim, stable, and
>> alternatives by having an array of them and a span of them which gets the
>> continuous part right, but which I fear breaks some of the things we want in
>> the simple non-continuous case (as well as in the continuous case) like the
>> EMMA XML, the interpretation, and simplicity.
>>
>> What about something that tries to combine both ideas building off Bjorn's
>> proposal but adding the arrays idea from Satish's to handle the continuous
>> case.  Something like:
>>
>> interface SpeechInputResultEvent : Event {
>>        readonly attribute SpeechInputResult result;
>>        readonly attribute short resultIndex;
>>        readonly attribute SpeechInputResult[] results;
>>        readonly attribute DOMString sessionId;
>>    };
>>
>> interface SpeechInputResult {
>>   readonly attribute Document resultEMMAXML;
>>   readonly attribute DOMString resultEMMAText;
>>   readonly attribute unsigned long length;
>>   getter SpeechInputResultAlternative item(in unsigned long index);
>> };
>> // Item in N-best list
>> interface SpeechInputAlternative {
>>   readonly attribute DOMString utterance;
>>   readonly attribute float confidence;
>>   readonly attribute any interpretation;
>> };
>>
>> It is possible that the results array and/or sessionId belongs as a
>> readonly attribute on the SpeechInputRequest interface instead of on each
>> SpeechInputResultEvent, but I figure it is easiest here.
>>
>> If all you are doing is non-continuous recognition you never need look at
>> anything except the result which contains the structure Bjorn proposed.  I
>> think this is a big simplicity win as the easy case really is easy.
>>
>> If you are doing continuous recognition you get an array of results that
>> builds up over time.  Each time the recogntion occurs you'll get at least
>> one new SpeechInputResultEvent returned and it will have a complete
>> SpeechInputResult structure at some index of the results array (each index
>> gets its own result event, possibly multiple if we are correcting incorrect
>> and/or preliminary results).  The index that this event is filling is given
>> by the resultIndex.  By having an explicit index there the recognizer can
>> correct earlier results, so you may get events with indexes 1, 2, 3, 2, 3,
>> 4, 5, 6, 5, 7, 8 in the case that the recognizer is recognizing a continuous
>> recognition and correcting earlier frames/results as it gets later ones.
>>  Or, in the case, the recognizer is correcting the same one you might go 1,
>> 2, 2, 2, 3, 3, 4, 4, 4, 5 as it gives preliminary recognition results and
>> corrects them soon there after.  If you send a NULL result with an index
>> that can remove that index from the array.
>>
>> If we really wanted to we could add a readonly hint/flag that indicates if
>> a result is final or not.  But I'm not sure there is any value in forbiding
>> a recognition system from correcting an earlier result in the array if new
>> processing indicates an earlier one would be more correct.
>>
>> Taking Satish's example of the processing the "testing this example"
>> string and ignoring the details of the EMMA and confidence and
>> interpretation and sessionId you'd get the following (utterance, index,
>> resutls[]) tuples:
>>
>> Event1: "text", 1, ["text"]
>>
>> Event2: "test", 1, ["test"]
>>
>> Event3: "sting", 2, ["test", "sting"]
>>
>> Event4: "testing", 1, ["testing", "sting"]
>>
>> Event5: "this", 2, ["testing", "this"]
>>
>> Event6: "ex", 3, ["testing", "this", "ex"]
>>
>> Event7: "apple", 4, ["testing", "this", "ex", "apple"]
>>
>> Event8: "example", 3, ["testing", "this", "example", "apple"]
>>
>> Event9:  NULL, 4, ["testing", "this", "example"]
>>
>
>
Received on Tuesday, 11 October 2011 14:16:48 UTC