- From: Satish S <satish@google.com>
- Date: Thu, 13 Oct 2011 15:49:06 +0100
- To: Michael Bodell <mbodell@microsoft.com>
- Cc: "public-xg-htmlspeech@w3.org" <public-xg-htmlspeech@w3.org>
- Message-ID: <CAHZf7Rnhi9PUSp2xQ=MkGfRr+t8u3NL2UJdLMCvE75HMSKu+ew@mail.gmail.com>
I also read through Milan's example and it is indeed a more realistic example. However it doesn't address the alternate spans of recognition results that I was interested in. To quote what i wrote earlier: " What I think the earlier proposal clearly addresses are how to handle alternates - not just n-best list but alternate spans of recognition results. This was highlighted in item (5) of the example section in http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Sep/0034.html. Such alternates can overlap any part of the result stream, not just at word boundaries of the top result. This seems lacking in the new proposal. " The use case I'm looking at is dictation and correction of mis-recognized phrases. In the original and my updated proposal, it is possible for a web app collect & show a set of alternate phrases at various points in the text stream. This allows users to point at a particular phrase in the document and select a different recognition phrase, without having to type it manually or re-dictate it. Cheers Satish On Thu, Oct 13, 2011 at 12:09 PM, Satish S <satish@google.com> wrote: > Thanks Michael. > > I still don't understand what exactly was broken in the original proposal > or my updated proposal. Could you give specific examples? > > Cheers > Satish > > > > On Thu, Oct 13, 2011 at 11:52 AM, Michael Bodell <mbodell@microsoft.com>wrote: > >> Sorry for the slow response, I was trying to get the updated proposal >> out first since it is easiest to refer to a working proposal and since the >> update was needed for today’s call.**** >> >> ** ** >> >> I think the concern you raise was discussed on the call when we all >> circled around and talked through the various examples and layouts. We >> definitely wanted to have recognition results that are easy in the >> non-continuous case and are consistent in the continuous case (I.e., don’t >> break the easy case and don’t do something wildly different in the >> continuous case) and ideally that would work even if continuous=false but >> interim=true. Note it isn’t just the EMMA that we want in the >> alternatives/hypothesis, but also the interpretations to go with the >> utterances and the confidences. Also note that while a number of our >> examples used word boundary for simplicity of discussion, the current >> version of the web api does not need the results that come back to be on >> word boundaries. They could be broken on words or sentences or phrases or >> paragraphs or whatever (up to the recognition service and the grammars in >> use and the actual utterances than anything else) – we were just doing >> single words because it was easier to write up. Milan had a more complete >> and realistic example that he mailed out Sept 29th.**** >> >> ** ** >> >> Hopefully that context combined with the write up of the current API will >> satisfy your requirements. We can certainly discuss as part of this week’s >> call.**** >> >> ** ** >> >> *From:* Satish S [mailto:satish@google.com] >> *Sent:* Tuesday, October 11, 2011 7:16 AM >> *To:* Michael Bodell >> *Cc:* public-xg-htmlspeech@w3.org >> *Subject:* Re: SpeechInputResult merged proposal**** >> >> ** ** >> >> Hi all,**** >> >> ** ** >> >> Any thoughts on my questions below and the proposal? If some of these get >> resolved over mail we could make better use of this week's call.**** >> >> >> Cheers >> Satish >> >> **** >> >> On Wed, Oct 5, 2011 at 9:59 PM, Satish S <satish@google.com> wrote:**** >> >> Sorry I had to miss last week's call. I read through the call notes but I >> am unsure where the earlier proposal breaks things in the simple >> non-continuous case. In the simple one-shot reco case, the script would just >> read the "event.stable" array of hypotheses.**** >> >> ** ** >> >> As for EMMA fields, that was an oversight on my part. They should be >> present wherever there was an array of hypotheses. So I'd replace the "readonly >> attribute Hypothesis[]" in the IDL I sent with an interface that contains >> this array, emmaXML & emmaText attributes.**** >> >> ** ** >> >> What I think the earlier proposal clearly addresses are how to handle >> alternates - not just n-best list but alternate spans of recognition >> results. This was highlighted in item (5) of the example section in >> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Sep/0034.html. >> Such alternates can overlap any part of the result stream, not just at word >> boundaries of the top result. This seems lacking in the new proposal.**** >> >> ** ** >> >> Below is an updated IDL that includes the EMMA fields. Please let us know >> if there are use cases this doesn't address.**** >> >> ** ** >> >> interface SpeechInputEvent {**** >> >> readonly attribute SpeechInputResult prelim;**** >> >> readonly attribute SpeechInputResult stable;**** >> >> readonly attribute Alternative[] alternatives;**** >> >> }**** >> >> interface SpeechInputResult {**** >> >> readonly attribute Hypothesis[] hypotheses;**** >> >> readonly attribute Document emmaXML;**** >> >> readonly attribute DOMString emmaText;**** >> >> }**** >> >> interface Hypothesis {**** >> >> readonly attribute DOMString utterance;**** >> >> readonly attribute float confidence; // Range 0.0 - 1.0**** >> >> }**** >> >> And if the app cares about alternates and correction, then:**** >> >> ** ** >> >> interface Alternative {**** >> >> readonly attribute int start;**** >> >> readonly attribute AlternativeSpan[] spans;**** >> >> }**** >> >> ** ** >> >> interface AlternativeSpan {**** >> >> readonly attribute int length;**** >> >> readonly attribute float confidence;**** >> >> readonly attribute SpeechInputResult items;**** >> >> }**** >> >> Cheers**** >> >> Satish**** >> >> >> >> **** >> >> On Thu, Sep 29, 2011 at 10:20 AM, Michael Bodell <mbodell@microsoft.com> >> wrote:**** >> >> So we have two existing proposals for SpeechInputResult: >> >> Bjorn's mail of: >> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Aug/0033.html >> >> Satish's mail of: >> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Sep/0034.html >> >> I like Bjorn's proposal in that it incorporates the items we talked about >> at the F2F including: >> >> - EMMA XML representation >> - a triple of utterance, confidence, and interpretation >> - nbest list >> >> But it doesn't incorporate continuous recognition where you could get >> multiple recognition results. >> >> Satish's proposal deals with the different prelim, stable, and >> alternatives by having an array of them and a span of them which gets the >> continuous part right, but which I fear breaks some of the things we want in >> the simple non-continuous case (as well as in the continuous case) like the >> EMMA XML, the interpretation, and simplicity. >> >> What about something that tries to combine both ideas building off Bjorn's >> proposal but adding the arrays idea from Satish's to handle the continuous >> case. Something like: >> >> interface SpeechInputResultEvent : Event { >> readonly attribute SpeechInputResult result; >> readonly attribute short resultIndex; >> readonly attribute SpeechInputResult[] results; >> readonly attribute DOMString sessionId; >> }; >> >> interface SpeechInputResult { >> readonly attribute Document resultEMMAXML; >> readonly attribute DOMString resultEMMAText; >> readonly attribute unsigned long length; >> getter SpeechInputResultAlternative item(in unsigned long index); >> }; >> // Item in N-best list >> interface SpeechInputAlternative { >> readonly attribute DOMString utterance; >> readonly attribute float confidence; >> readonly attribute any interpretation; >> }; >> >> It is possible that the results array and/or sessionId belongs as a >> readonly attribute on the SpeechInputRequest interface instead of on each >> SpeechInputResultEvent, but I figure it is easiest here. >> >> If all you are doing is non-continuous recognition you never need look at >> anything except the result which contains the structure Bjorn proposed. I >> think this is a big simplicity win as the easy case really is easy. >> >> If you are doing continuous recognition you get an array of results that >> builds up over time. Each time the recogntion occurs you'll get at least >> one new SpeechInputResultEvent returned and it will have a complete >> SpeechInputResult structure at some index of the results array (each index >> gets its own result event, possibly multiple if we are correcting incorrect >> and/or preliminary results). The index that this event is filling is given >> by the resultIndex. By having an explicit index there the recognizer can >> correct earlier results, so you may get events with indexes 1, 2, 3, 2, 3, >> 4, 5, 6, 5, 7, 8 in the case that the recognizer is recognizing a continuous >> recognition and correcting earlier frames/results as it gets later ones. >> Or, in the case, the recognizer is correcting the same one you might go 1, >> 2, 2, 2, 3, 3, 4, 4, 4, 5 as it gives preliminary recognition results and >> corrects them soon there after. If you send a NULL result with an index >> that can remove that index from the array. >> >> If we really wanted to we could add a readonly hint/flag that indicates if >> a result is final or not. But I'm not sure there is any value in forbiding >> a recognition system from correcting an earlier result in the array if new >> processing indicates an earlier one would be more correct. >> >> Taking Satish's example of the processing the "testing this example" >> string and ignoring the details of the EMMA and confidence and >> interpretation and sessionId you'd get the following (utterance, index, >> resutls[]) tuples: >> >> Event1: "text", 1, ["text"] >> >> Event2: "test", 1, ["test"] >> >> Event3: "sting", 2, ["test", "sting"] >> >> Event4: "testing", 1, ["testing", "sting"] >> >> Event5: "this", 2, ["testing", "this"] >> >> Event6: "ex", 3, ["testing", "this", "ex"] >> >> Event7: "apple", 4, ["testing", "this", "ex", "apple"] >> >> Event8: "example", 3, ["testing", "this", "example", "apple"] >> >> Event9: NULL, 4, ["testing", "this", "example"]**** >> >> ** ** >> >> ** ** >> > >
Received on Thursday, 13 October 2011 14:49:45 UTC