W3C home > Mailing lists > Public > public-xg-htmlspeech@w3.org > October 2011

Re: SpeechInputResult merged proposal

From: Satish S <satish@google.com>
Date: Thu, 13 Oct 2011 17:13:44 +0100
Message-ID: <CAHZf7Rm_wLZjvtO3ofs-YwyCCLJForTiwZdjZnnyMxf-coJSUg@mail.gmail.com>
To: "Young, Milan" <Milan.Young@nuance.com>
Cc: Michael Bodell <mbodell@microsoft.com>, public-xg-htmlspeech@w3.org
My original and updated proposals have API support for alternates. The use
case I mentioned in the previous mail gives the user side of the scenario
and it should be  possible to translate this to a protocol level example.

Cheers
Satish


On Thu, Oct 13, 2011 at 5:08 PM, Young, Milan <Milan.Young@nuance.com>wrote:

> Seems like the root problem here is that neither of our examples contained
> alternate hypotheses.  But as long as long as we willing to live with the
> simplifying assumption that the alternate hypothesis at index N does not
> impact the hypotheses at index N-1 or N+1, it should be straightforward.
> Are you OK with this?****
>
> ** **
>
> Thanks****
>
> ** **
>
> ** **
>
> *From:* Satish S [mailto:satish@google.com]
> *Sent:* Thursday, October 13, 2011 8:58 AM
> *To:* Young, Milan
> *Cc:* Michael Bodell; public-xg-htmlspeech@w3.org
>
> *Subject:* Re: SpeechInputResult merged proposal****
>
> ** **
>
> User utters: "1 2 3 pictures of the moon"****
>
> ** **
>
> Recognized text shown by the web app: "123 pictures of the moon"****
>
> Clicking on any character in the first word shows a drop down with "one two
> three" and "oh one two three" as 2 alternates for that word/phrase.****
>
> Clicking on any character in the second word shows a drop down with
> "picture" as an alternate for that word****
>
> Clicking on any character in the third word shows a drop down with "off" as
> an alternate****
>
> Clicking on the last word shows a dropdown with "move", "mood" and "mall"
> as alternates****
>
> ** **
>
> These are all alternates returned by the server for the final hypotheses at
> each recognition segment.****
>
> ** **
>
> You can see this in action on an Android phone with the Voice IME. Tap on
> any text field and click the microphone button in the keyboard to trigger
> Voice IME.****
>
>
> Cheers
> Satish
>
> ****
>
> On Thu, Oct 13, 2011 at 4:38 PM, Young, Milan <Milan.Young@nuance.com>
> wrote:****
>
> Could you provide an enhanced example that demonstrates your use case?****
>
>  ****
>
> Thanks****
>
>  ****
>
>  ****
>
> *From:* public-xg-htmlspeech-request@w3.org [mailto:
> public-xg-htmlspeech-request@w3.org] *On Behalf Of *Satish S
> *Sent:* Thursday, October 13, 2011 7:49 AM****
>
>
> *To:* Michael Bodell
> *Cc:* public-xg-htmlspeech@w3.org
> *Subject:* Re: SpeechInputResult merged proposal****
>
>  ****
>
> I also read through Milan's example and it is indeed a more realistic
> example. However it doesn't address the alternate spans of recognition
> results that I was interested in. To quote what i wrote earlier:****
>
>  ****
>
> "****
>
> What I think the earlier proposal clearly addresses are how to handle
> alternates - not just n-best list but alternate spans of recognition
> results. This was highlighted in item (5) of the example section in
> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Sep/0034.html.
> Such alternates can overlap any part of the result stream, not just at word
> boundaries of the top result. This seems lacking in the new proposal.****
>
> "****
>
>  ****
>
> The use case I'm looking at is dictation and correction of mis-recognized
> phrases. In the original and my updated proposal, it is possible for a web
> app collect & show a set of alternate phrases at various points in the text
> stream. This allows users to point at a particular phrase in the document
> and select a different recognition phrase, without having to type it
> manually or re-dictate it.****
>
>  ****
>
> Cheers
> Satish****
>
> On Thu, Oct 13, 2011 at 12:09 PM, Satish S <satish@google.com> wrote:****
>
> Thanks Michael.****
>
>  ****
>
> I still don't understand what exactly was broken in the original proposal
> or my updated proposal. Could you give specific examples?****
>
>
> Cheers
> Satish****
>
> ** **
>
> On Thu, Oct 13, 2011 at 11:52 AM, Michael Bodell <mbodell@microsoft.com>
> wrote:****
>
> Sorry for the slow response, I was trying to get the updated proposal out
> first since it is easiest to refer to a working proposal and since the
> update was needed for todayís call.****
>
>  ****
>
> I think the concern you raise was discussed on the call when we all circled
> around and talked through the various examples and layouts.  We definitely
> wanted to have recognition results that are easy in the non-continuous case
> and are consistent in the continuous case (I.e., donít break the easy case
> and donít do something wildly different in the continuous case) and ideally
> that would work even if continuous=false but interim=true.  Note it isnít
> just the EMMA that we want in the alternatives/hypothesis, but also the
> interpretations to go with the utterances and the confidences.  Also note
> that while a number of our examples used word boundary for simplicity of
> discussion, the current version of the web api does not need the results
> that come back to be on word boundaries.  They could be broken on words or
> sentences or phrases or paragraphs or whatever (up to the recognition
> service and the grammars in use and the actual utterances than anything
> else) Ė we were just doing single words because it was easier to write up.
> Milan had a more complete and realistic example that he mailed out Sept 29
> th.****
>
>  ****
>
> Hopefully that context combined with the write up of the current API will
> satisfy your requirements.  We can certainly discuss as part of this weekís
> call.****
>
>  ****
>
> *From:* Satish S [mailto:satish@google.com]
> *Sent:* Tuesday, October 11, 2011 7:16 AM
> *To:* Michael Bodell
> *Cc:* public-xg-htmlspeech@w3.org
> *Subject:* Re: SpeechInputResult merged proposal****
>
>  ****
>
> Hi all,****
>
>  ****
>
> Any thoughts on my questions below and the proposal? If some of these get
> resolved over mail we could make better use of this week's call.****
>
>
> Cheers
> Satish****
>
> On Wed, Oct 5, 2011 at 9:59 PM, Satish S <satish@google.com> wrote:****
>
> Sorry I had to miss last week's call. I read through the call notes but I
> am unsure where the earlier proposal breaks things in the simple
> non-continuous case. In the simple one-shot reco case, the script would just
> read the "event.stable" array of hypotheses.****
>
>  ****
>
> As for EMMA fields, that was an oversight on my part. They should be
> present wherever there was an array of hypotheses. So I'd replace the "readonly
> attribute Hypothesis[]" in the IDL I sent with an interface that contains
> this array, emmaXML & emmaText attributes.****
>
>  ****
>
> What I think the earlier proposal clearly addresses are how to handle
> alternates - not just n-best list but alternate spans of recognition
> results. This was highlighted in item (5) of the example section in
> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Sep/0034.html.
> Such alternates can overlap any part of the result stream, not just at word
> boundaries of the top result. This seems lacking in the new proposal.****
>
>  ****
>
> Below is an updated IDL that includes the EMMA fields. Please let us know
> if there are use cases this doesn't address.****
>
>  ****
>
> interface SpeechInputEvent {****
>
>   readonly attribute SpeechInputResult prelim;****
>
>   readonly attribute SpeechInputResult stable;****
>
>   readonly attribute Alternative[] alternatives;****
>
> }****
>
> interface SpeechInputResult {****
>
>   readonly attribute Hypothesis[] hypotheses;****
>
>   readonly attribute Document emmaXML;****
>
>   readonly attribute DOMString emmaText;****
>
> }****
>
> interface Hypothesis {****
>
>   readonly attribute DOMString utterance;****
>
>   readonly attribute float confidence;  // Range 0.0 - 1.0****
>
> }****
>
> And if the app cares about alternates and correction, then:****
>
>  ****
>
> interface Alternative {****
>
>   readonly attribute int start;****
>
>   readonly attribute AlternativeSpan[] spans;****
>
> }****
>
>  ****
>
> ** **
>
> interface AlternativeSpan {****
>
>   readonly attribute int length;****
>
>   readonly attribute float confidence;****
>
>   readonly attribute SpeechInputResult items;****
>
> }****
>
> Cheers****
>
> Satish****
>
>  ****
>
> On Thu, Sep 29, 2011 at 10:20 AM, Michael Bodell <mbodell@microsoft.com>
> wrote:****
>
> So we have two existing proposals for SpeechInputResult:
>
> Bjorn's mail of:
> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Aug/0033.html
>
> Satish's mail of:
> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Sep/0034.html
>
> I like Bjorn's proposal in that it incorporates the items we talked about
> at the F2F including:
>
> - EMMA XML representation
> - a triple of utterance, confidence, and interpretation
> - nbest list
>
> But it doesn't incorporate continuous recognition where you could get
> multiple recognition results.
>
> Satish's proposal deals with the different prelim, stable, and alternatives
> by having an array of them and a span of them which gets the continuous part
> right, but which I fear breaks some of the things we want in the simple
> non-continuous case (as well as in the continuous case) like the EMMA XML,
> the interpretation, and simplicity.
>
> What about something that tries to combine both ideas building off Bjorn's
> proposal but adding the arrays idea from Satish's to handle the continuous
> case.  Something like:
>
> interface SpeechInputResultEvent : Event {
>        readonly attribute SpeechInputResult result;
>        readonly attribute short resultIndex;
>        readonly attribute SpeechInputResult[] results;
>        readonly attribute DOMString sessionId;
>    };
>
> interface SpeechInputResult {
>   readonly attribute Document resultEMMAXML;
>   readonly attribute DOMString resultEMMAText;
>   readonly attribute unsigned long length;
>   getter SpeechInputResultAlternative item(in unsigned long index);
> };
> // Item in N-best list
> interface SpeechInputAlternative {
>   readonly attribute DOMString utterance;
>   readonly attribute float confidence;
>   readonly attribute any interpretation;
> };
>
> It is possible that the results array and/or sessionId belongs as a
> readonly attribute on the SpeechInputRequest interface instead of on each
> SpeechInputResultEvent, but I figure it is easiest here.
>
> If all you are doing is non-continuous recognition you never need look at
> anything except the result which contains the structure Bjorn proposed.  I
> think this is a big simplicity win as the easy case really is easy.
>
> If you are doing continuous recognition you get an array of results that
> builds up over time.  Each time the recogntion occurs you'll get at least
> one new SpeechInputResultEvent returned and it will have a complete
> SpeechInputResult structure at some index of the results array (each index
> gets its own result event, possibly multiple if we are correcting incorrect
> and/or preliminary results).  The index that this event is filling is given
> by the resultIndex.  By having an explicit index there the recognizer can
> correct earlier results, so you may get events with indexes 1, 2, 3, 2, 3,
> 4, 5, 6, 5, 7, 8 in the case that the recognizer is recognizing a continuous
> recognition and correcting earlier frames/results as it gets later ones.
>  Or, in the case, the recognizer is correcting the same one you might go 1,
> 2, 2, 2, 3, 3, 4, 4, 4, 5 as it gives preliminary recognition results and
> corrects them soon there after.  If you send a NULL result with an index
> that can remove that index from the array.
>
> If we really wanted to we could add a readonly hint/flag that indicates if
> a result is final or not.  But I'm not sure there is any value in forbiding
> a recognition system from correcting an earlier result in the array if new
> processing indicates an earlier one would be more correct.
>
> Taking Satish's example of the processing the "testing this example" string
> and ignoring the details of the EMMA and confidence and interpretation and
> sessionId you'd get the following (utterance, index, resutls[]) tuples:
>
> Event1: "text", 1, ["text"]
>
> Event2: "test", 1, ["test"]
>
> Event3: "sting", 2, ["test", "sting"]
>
> Event4: "testing", 1, ["testing", "sting"]
>
> Event5: "this", 2, ["testing", "this"]
>
> Event6: "ex", 3, ["testing", "this", "ex"]
>
> Event7: "apple", 4, ["testing", "this", "ex", "apple"]
>
> Event8: "example", 3, ["testing", "this", "example", "apple"]
>
> Event9:  NULL, 4, ["testing", "this", "example"]****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
> ** **
>
Received on Thursday, 13 October 2011 16:14:11 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 13 October 2011 16:14:12 GMT