- From: Satish S <satish@google.com>
- Date: Thu, 13 Oct 2011 17:13:44 +0100
- To: "Young, Milan" <Milan.Young@nuance.com>
- Cc: Michael Bodell <mbodell@microsoft.com>, public-xg-htmlspeech@w3.org
- Message-ID: <CAHZf7Rm_wLZjvtO3ofs-YwyCCLJForTiwZdjZnnyMxf-coJSUg@mail.gmail.com>
My original and updated proposals have API support for alternates. The use case I mentioned in the previous mail gives the user side of the scenario and it should be possible to translate this to a protocol level example. Cheers Satish On Thu, Oct 13, 2011 at 5:08 PM, Young, Milan <Milan.Young@nuance.com>wrote: > Seems like the root problem here is that neither of our examples contained > alternate hypotheses. But as long as long as we willing to live with the > simplifying assumption that the alternate hypothesis at index N does not > impact the hypotheses at index N-1 or N+1, it should be straightforward. > Are you OK with this?**** > > ** ** > > Thanks**** > > ** ** > > ** ** > > *From:* Satish S [mailto:satish@google.com] > *Sent:* Thursday, October 13, 2011 8:58 AM > *To:* Young, Milan > *Cc:* Michael Bodell; public-xg-htmlspeech@w3.org > > *Subject:* Re: SpeechInputResult merged proposal**** > > ** ** > > User utters: "1 2 3 pictures of the moon"**** > > ** ** > > Recognized text shown by the web app: "123 pictures of the moon"**** > > Clicking on any character in the first word shows a drop down with "one two > three" and "oh one two three" as 2 alternates for that word/phrase.**** > > Clicking on any character in the second word shows a drop down with > "picture" as an alternate for that word**** > > Clicking on any character in the third word shows a drop down with "off" as > an alternate**** > > Clicking on the last word shows a dropdown with "move", "mood" and "mall" > as alternates**** > > ** ** > > These are all alternates returned by the server for the final hypotheses at > each recognition segment.**** > > ** ** > > You can see this in action on an Android phone with the Voice IME. Tap on > any text field and click the microphone button in the keyboard to trigger > Voice IME.**** > > > Cheers > Satish > > **** > > On Thu, Oct 13, 2011 at 4:38 PM, Young, Milan <Milan.Young@nuance.com> > wrote:**** > > Could you provide an enhanced example that demonstrates your use case?**** > > **** > > Thanks**** > > **** > > **** > > *From:* public-xg-htmlspeech-request@w3.org [mailto: > public-xg-htmlspeech-request@w3.org] *On Behalf Of *Satish S > *Sent:* Thursday, October 13, 2011 7:49 AM**** > > > *To:* Michael Bodell > *Cc:* public-xg-htmlspeech@w3.org > *Subject:* Re: SpeechInputResult merged proposal**** > > **** > > I also read through Milan's example and it is indeed a more realistic > example. However it doesn't address the alternate spans of recognition > results that I was interested in. To quote what i wrote earlier:**** > > **** > > "**** > > What I think the earlier proposal clearly addresses are how to handle > alternates - not just n-best list but alternate spans of recognition > results. This was highlighted in item (5) of the example section in > http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Sep/0034.html. > Such alternates can overlap any part of the result stream, not just at word > boundaries of the top result. This seems lacking in the new proposal.**** > > "**** > > **** > > The use case I'm looking at is dictation and correction of mis-recognized > phrases. In the original and my updated proposal, it is possible for a web > app collect & show a set of alternate phrases at various points in the text > stream. This allows users to point at a particular phrase in the document > and select a different recognition phrase, without having to type it > manually or re-dictate it.**** > > **** > > Cheers > Satish**** > > On Thu, Oct 13, 2011 at 12:09 PM, Satish S <satish@google.com> wrote:**** > > Thanks Michael.**** > > **** > > I still don't understand what exactly was broken in the original proposal > or my updated proposal. Could you give specific examples?**** > > > Cheers > Satish**** > > ** ** > > On Thu, Oct 13, 2011 at 11:52 AM, Michael Bodell <mbodell@microsoft.com> > wrote:**** > > Sorry for the slow response, I was trying to get the updated proposal out > first since it is easiest to refer to a working proposal and since the > update was needed for today’s call.**** > > **** > > I think the concern you raise was discussed on the call when we all circled > around and talked through the various examples and layouts. We definitely > wanted to have recognition results that are easy in the non-continuous case > and are consistent in the continuous case (I.e., don’t break the easy case > and don’t do something wildly different in the continuous case) and ideally > that would work even if continuous=false but interim=true. Note it isn’t > just the EMMA that we want in the alternatives/hypothesis, but also the > interpretations to go with the utterances and the confidences. Also note > that while a number of our examples used word boundary for simplicity of > discussion, the current version of the web api does not need the results > that come back to be on word boundaries. They could be broken on words or > sentences or phrases or paragraphs or whatever (up to the recognition > service and the grammars in use and the actual utterances than anything > else) – we were just doing single words because it was easier to write up. > Milan had a more complete and realistic example that he mailed out Sept 29 > th.**** > > **** > > Hopefully that context combined with the write up of the current API will > satisfy your requirements. We can certainly discuss as part of this week’s > call.**** > > **** > > *From:* Satish S [mailto:satish@google.com] > *Sent:* Tuesday, October 11, 2011 7:16 AM > *To:* Michael Bodell > *Cc:* public-xg-htmlspeech@w3.org > *Subject:* Re: SpeechInputResult merged proposal**** > > **** > > Hi all,**** > > **** > > Any thoughts on my questions below and the proposal? If some of these get > resolved over mail we could make better use of this week's call.**** > > > Cheers > Satish**** > > On Wed, Oct 5, 2011 at 9:59 PM, Satish S <satish@google.com> wrote:**** > > Sorry I had to miss last week's call. I read through the call notes but I > am unsure where the earlier proposal breaks things in the simple > non-continuous case. In the simple one-shot reco case, the script would just > read the "event.stable" array of hypotheses.**** > > **** > > As for EMMA fields, that was an oversight on my part. They should be > present wherever there was an array of hypotheses. So I'd replace the "readonly > attribute Hypothesis[]" in the IDL I sent with an interface that contains > this array, emmaXML & emmaText attributes.**** > > **** > > What I think the earlier proposal clearly addresses are how to handle > alternates - not just n-best list but alternate spans of recognition > results. This was highlighted in item (5) of the example section in > http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Sep/0034.html. > Such alternates can overlap any part of the result stream, not just at word > boundaries of the top result. This seems lacking in the new proposal.**** > > **** > > Below is an updated IDL that includes the EMMA fields. Please let us know > if there are use cases this doesn't address.**** > > **** > > interface SpeechInputEvent {**** > > readonly attribute SpeechInputResult prelim;**** > > readonly attribute SpeechInputResult stable;**** > > readonly attribute Alternative[] alternatives;**** > > }**** > > interface SpeechInputResult {**** > > readonly attribute Hypothesis[] hypotheses;**** > > readonly attribute Document emmaXML;**** > > readonly attribute DOMString emmaText;**** > > }**** > > interface Hypothesis {**** > > readonly attribute DOMString utterance;**** > > readonly attribute float confidence; // Range 0.0 - 1.0**** > > }**** > > And if the app cares about alternates and correction, then:**** > > **** > > interface Alternative {**** > > readonly attribute int start;**** > > readonly attribute AlternativeSpan[] spans;**** > > }**** > > **** > > ** ** > > interface AlternativeSpan {**** > > readonly attribute int length;**** > > readonly attribute float confidence;**** > > readonly attribute SpeechInputResult items;**** > > }**** > > Cheers**** > > Satish**** > > **** > > On Thu, Sep 29, 2011 at 10:20 AM, Michael Bodell <mbodell@microsoft.com> > wrote:**** > > So we have two existing proposals for SpeechInputResult: > > Bjorn's mail of: > http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Aug/0033.html > > Satish's mail of: > http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Sep/0034.html > > I like Bjorn's proposal in that it incorporates the items we talked about > at the F2F including: > > - EMMA XML representation > - a triple of utterance, confidence, and interpretation > - nbest list > > But it doesn't incorporate continuous recognition where you could get > multiple recognition results. > > Satish's proposal deals with the different prelim, stable, and alternatives > by having an array of them and a span of them which gets the continuous part > right, but which I fear breaks some of the things we want in the simple > non-continuous case (as well as in the continuous case) like the EMMA XML, > the interpretation, and simplicity. > > What about something that tries to combine both ideas building off Bjorn's > proposal but adding the arrays idea from Satish's to handle the continuous > case. Something like: > > interface SpeechInputResultEvent : Event { > readonly attribute SpeechInputResult result; > readonly attribute short resultIndex; > readonly attribute SpeechInputResult[] results; > readonly attribute DOMString sessionId; > }; > > interface SpeechInputResult { > readonly attribute Document resultEMMAXML; > readonly attribute DOMString resultEMMAText; > readonly attribute unsigned long length; > getter SpeechInputResultAlternative item(in unsigned long index); > }; > // Item in N-best list > interface SpeechInputAlternative { > readonly attribute DOMString utterance; > readonly attribute float confidence; > readonly attribute any interpretation; > }; > > It is possible that the results array and/or sessionId belongs as a > readonly attribute on the SpeechInputRequest interface instead of on each > SpeechInputResultEvent, but I figure it is easiest here. > > If all you are doing is non-continuous recognition you never need look at > anything except the result which contains the structure Bjorn proposed. I > think this is a big simplicity win as the easy case really is easy. > > If you are doing continuous recognition you get an array of results that > builds up over time. Each time the recogntion occurs you'll get at least > one new SpeechInputResultEvent returned and it will have a complete > SpeechInputResult structure at some index of the results array (each index > gets its own result event, possibly multiple if we are correcting incorrect > and/or preliminary results). The index that this event is filling is given > by the resultIndex. By having an explicit index there the recognizer can > correct earlier results, so you may get events with indexes 1, 2, 3, 2, 3, > 4, 5, 6, 5, 7, 8 in the case that the recognizer is recognizing a continuous > recognition and correcting earlier frames/results as it gets later ones. > Or, in the case, the recognizer is correcting the same one you might go 1, > 2, 2, 2, 3, 3, 4, 4, 4, 5 as it gives preliminary recognition results and > corrects them soon there after. If you send a NULL result with an index > that can remove that index from the array. > > If we really wanted to we could add a readonly hint/flag that indicates if > a result is final or not. But I'm not sure there is any value in forbiding > a recognition system from correcting an earlier result in the array if new > processing indicates an earlier one would be more correct. > > Taking Satish's example of the processing the "testing this example" string > and ignoring the details of the EMMA and confidence and interpretation and > sessionId you'd get the following (utterance, index, resutls[]) tuples: > > Event1: "text", 1, ["text"] > > Event2: "test", 1, ["test"] > > Event3: "sting", 2, ["test", "sting"] > > Event4: "testing", 1, ["testing", "sting"] > > Event5: "this", 2, ["testing", "this"] > > Event6: "ex", 3, ["testing", "this", "ex"] > > Event7: "apple", 4, ["testing", "this", "ex", "apple"] > > Event8: "example", 3, ["testing", "this", "example", "apple"] > > Event9: NULL, 4, ["testing", "this", "example"]**** > > **** > > **** > > **** > > **** > > ** ** >
Received on Thursday, 13 October 2011 16:14:11 UTC