- From: Glen Shires <gshires@google.com>
- Date: Thu, 14 Jun 2012 14:28:01 -0700
- To: "Young, Milan" <Milan.Young@nuance.com>
- Cc: Satish S <satish@google.com>, "public-speech-api@w3.org" <public-speech-api@w3.org>
- Message-ID: <CAEE5bcg9qLjQn9M0+WOnswBDZH7oaD=Khon+YDh7Krskkpw-aw@mail.gmail.com>
To clarify: with Proposal C, Group 3 developers do NOT have to translate any scores and they can use "direct import". On Thu, Jun 14, 2012 at 2:20 PM, Glen Shires <gshires@google.com> wrote: > Yes, good suggestion. Looking at how the proposals affect these three > groups of web developers is a great way to evaluate them. > > Here's how my proposal affects these three groups: > > Analysis: > > Group 1: No advantage or disadvantage. > > > > Group 2: This is a perfect solution. These developers only care about > setting the input confidenceThreshold. Since 0.0 is the > default confidenceThreshold, if they want to get reasonable rejection > behavior for nomatch, they can simply do: > > recognizer.confidenceThreshold = 0.5 > > Since the confidenceThreshold is skewed, they get reasonable nomatch > behavior with 0.5 which enables them to write recognizer independent > code (at least to some extent). If they're still getting too many results, > they can simply increment or decrement this value, again as recognizer > independent code. > > Conversely, if the confidenceThreshold were not skewed and instead used > native recognizer confidence values, there would be no way for these > developers to write recognizer independent code and get reasonable > rejection behavior for nomatch. > > Since Group 2 developers don't process or analyze at the output confidence > values (in results or emma), so they don't care whether these output values > are skewed or not. > > Summary: This proposal offers big benefits for Group 2. Conversely, not > having skewing would be a major hinderance for Group 2 developers because > they couldn't reliably use nomatch behavior. > > > > Group 3: This is a perfect solution. These developers want to use native > recognizer confidence values for both input (setting the threshold) and > output (processing the results in SpeechRecognitionAlternative.confidence > or SpeechRecognitionResult.emma). They don't want any skewing that can > complicate things, and this solution allows them to only use native values > everywhere, they never have to worry about skewing. The only thing they > have to do is cut and paste a simple JavaScript function (which I presume > most recognizer vendors would gladly post on their website) in to their > code. For example, they could simply cut and paste the following: > > function SetNativeConfidenceThreshold(conf) { > if (conf < 0.7) > recognizer.confidenceThreshold = conf / 1.4; > else > recognizer.confidenceThreshold = 0.5 + ((c - 0.7) / 0.6); > } > > Now, all the Group 3 developer has to do to set the confidence threshold > using a native confidence value is: > > SetNativeConfidenceThreshold(value); > > Copying-and-pasting a short function is a trivial amount of effort, > particularly when compared to all the effort that Group 3 is doing by > definition to review, tune, process and tweak confidence values. That is, > this proposal has trivial impact on the effort required for Group 3 > developers. > > Summary: This proposal has virtually no impact on Group 3 developers. In > contrast, a proposal that skews the results and emma confidence value would > have a major, negative impact on Group 3 developers. > > > > > Now, to compare how all three proposals affect these groups, let's label > them: > > Proposal A: No skewing. Use native recognizer confidence values for both > input (confidenceThreshold) and output (results and emma) > > Proposal B: Skew both input (confidenceThreshold) and output (results and > emma) in the same manner. > > Proposal C: (my proposal) Skew only input (confidenceThreshold). > Use recognizer confidence values for output (results and emma) in the same > manner. > > > Group 1: All proposals are fine, they provide no advantage or disadvantage. > > Group 2: Proposal A is problematic. Proposal B and C both provide a huge > advantage. > > Group 3: Proposal B is problematic. Proposal A and C both provide a huge > advantage. > > > The intersection of these is Proposal C - provides huge advantages and is > not problematic for any of the 3 groups of developers. > > Glen > > > > > On Thu, Jun 14, 2012 at 12:23 PM, Young, Milan <Milan.Young@nuance.com>wrote: > >> Let’s try this another way. The most obvious/simplest solution is to >> report results on the same scale as the threshold. Can we agree on that? >> **** >> >> ** ** >> >> Assuming yes, then we should only entertain alternate/complicated >> suggestions if there is a clear and significant advantage. Let’s break >> down this analysis to the three target audiences we’ve used before:**** >> >> ** ** >> >> **1) **Developers who just want the default behavior.**** >> >> **2) **Developers who think confidence is a neat feature, but they >> do not run offline experiments or have any preference for a speech engine. >> This class will probably either use incremental adjustments to the >> threshold or pick round numbers like “.5” as arbitrary thresholds. They >> are aware that confidence thresholds do not mean the same thing to >> different engines, but they do know: i) By default they get all results, >> and ii) If they want to limit the number of results they should use larger >> thresholds.**** >> >> **3) **Power developers that either run offline experiments or have >> a port of the application on some other modality (e.g. IVR). These >> developers leave nothing to chance, and have a custom confidence score for >> each application state. If they do support multiple engines, each engine >> will have a distinct set of thresholds.**** >> >> ** ** >> >> ** ** >> >> Analysis:**** >> >> ** ** >> >> Group 1: No advantage or disadvantage.**** >> >> ** ** >> >> Group 2: There could be some advantage here, but as of yet I do not see >> it. Please make your case.**** >> >> ** ** >> >> Group 3: Your solution is a disadvantage because they must translate the >> scores on a per recognizer basis. These developers would prefer to use the >> much simpler solution of a direct import.**** >> >> ** ** >> >> ** ** >> >> *From:* Glen Shires [mailto:gshires@google.com] >> *Sent:* Thursday, June 14, 2012 12:19 PM >> *To:* Young, Milan >> *Cc:* Satish S; public-speech-api@w3.org >> *Subject:* Re: Confidence property**** >> >> ** ** >> >> Perhaps a more intuitive name for that wrapper function would >> be SetNativeConfidenceThreshold.**** >> >> Also, I realize the logic was wrong as it used a different scale. In this >> wrapper function, both the input (conf) and the output >> (recognizer.confidenceThreshold) use a 0.0 - 1.0 scale. For example, the >> following works for when recognizer.confidenceThreshold of 0.5 is skewed to >> a native value of 0.7.**** >> >> ** ** >> >> function SetNativeConfidenceThreshold(conf) {**** >> >> if (conf < 0.7)**** >> >> recognizer.confidenceThreshold = conf / 0.7;**** >> >> else**** >> >> recognizer.confidenceThreshold = 0.5 + ((c - 0.7) / 0.3);**** >> >> }**** >> >> ** ** >> >> On Thu, Jun 14, 2012 at 11:43 AM, Glen Shires <gshires@google.com> wrote: >> **** >> >> Yes, the confidenceThreshold is on a 0.0 - 1.0 scale.**** >> >> Yes, the confidence reported in the results are on a 0.0 - 1.0 scale.**** >> >> Yes, the confidence reported in the EMMA are on a 0.0 - 1.0 scale.**** >> >> ** ** >> >> What I am saying is that:**** >> >> - The recognizer may skew the confidenceThreshold such that 0.5 maps to >> something reasonable for nomatch.**** >> >> - The recognizer is not required to skew to the reported results or EMMA >> results. (The recognizer may skew them, or it may not.)**** >> >> ** ** >> >> Simply put: the input must be skewed for 0.5, the output is not required >> to be skewed in a similar manner.**** >> >> ** ** >> >> I've added additional comments inline below...**** >> >> On Thu, Jun 14, 2012 at 11:04 AM, Young, Milan <Milan.Young@nuance.com> >> wrote:**** >> >> I requested “If the threshold is set to X, all alternative.confidence >> values will be >= X.” I’d like to address your listed disadvantages:**** >> >> **** >> >> [Glen] It would require remapping all the results**** >> >> [Milan] Every modern recognizer that I know of is capable of reporting >> results on a [0-1] scale. That’s really the only relevant requirement to >> this part of the request. Which alternate scale are you suggesting?**** >> >> [Glen] Scale remains 0.0 - 1.0. **** >> >> **** >> >> [Glen] It would require re-writing EMMA with the new results.**** >> >> [Milan] EMMA is natively on a [0-1] scale.**** >> >> [Glen] Yes, scale remains 0.0 - 1.0 **** >> >> **** >> >> [Glen] Nearly all developers who do process these results will simply be >> comparing relative values, skewing the output could mask the differences >> between alternatives.**** >> >> [Milan] A significant portion of developers and a **majority** of >> consumers will be using absolute thresholds derived from offline tuning >> experiments. Let’s address the “skew” part of your statement as part of >> the first question/response.**** >> >> ** ** >> >> [Glen] I believe my proposal is particularly advantageous for these >> developers and customers. Most likely their offline tuning experiments will >> be using backend logs and these backend logs use the recognizer's native >> 0.0 - 1.0 confidence scale (not a skewed scale). In fact, some customers >> may have multiple applications/implementations (not just those using a >> browser with our Speech Javascript API) and/or may have prior experience >> with other applications, certainly these tuning experiments or logs would >> be using the recognizer's native 0.0 - 1.0 confidence scale (not a skewed >> scale). So the advantage these developers and customers have is that all >> the tuning data and logs they have ever gathered over years and multiple >> applications, all use, and continue to use, the same native scale. When >> they write Javascript code to process results with the Speech Javascript >> API, they continue to use the same native scale.**** >> >> ** ** >> >> The only thing that these developers and customers and customers must do >> to use these results directly in their Javascript code, is to set >> confidenceThreshold through a simple Javascript wrapper function. For >> example, that wrapper function might look like the following. Recognizer >> vendors may wish to document a suggested wrapper function like this, so >> that their developers and customers can tune applications without any >> additional effort or skewing concerns.**** >> >> ** ** >> >> function SetConfidenceAbsolute(conf) {**** >> >> var c = conf - 0.7;**** >> >> if (c > 0)**** >> >> recognizer.confidenceAdjustment = c / 0.3;**** >> >> else**** >> >> recognizer.confidenceAdjustment = c / 0.7;**** >> >> }**** >> >> **** >> >> ** ** >> >> Thanks,**** >> >> Glen**** >> >> ** ** >> >> Thanks**** >> >> **** >> >> **** >> >> *From:* Glen Shires [mailto:gshires@google.com] >> *Sent:* Wednesday, June 13, 2012 11:09 PM**** >> >> >> *To:* Young, Milan >> *Cc:* Satish S; public-speech-api@w3.org >> *Subject:* Re: Confidence property**** >> >> **** >> >> Milan,**** >> >> Great, I believe we are almost fully in agreement. Here's the key points >> that I think should be in the specification. Most of this is your wording. >> {The portions in curly braces are things I agree with, but that I don't >> think need to be specified in the spec.}**** >> >> **** >> >> **** >> >> - Engines must support a default confidenceThreshold of 0.0 on a range of >> [0.0-1.0].**** >> >> **** >> >> - A 0.0 confidenceThreshold means that engines should aggressively retain >> all speech candidates limited only by the length of the nbest list. How >> this is defined in practice, however, is still vendor specific. Some >> engines may throw nomatch, other engines may never throw nomatch with a 0.0 >> confidenceThreshold.**** >> >> {I would think that some engines might want to still generate nomatch >> events on select classes of noise input even with a threshold of 0.0.}*** >> * >> >> **** >> >> - When the confidenceThreshold is set to 0.5, nomatch should be thrown >> when there are no speech candidates found with good/reasonable confidence. >> **** >> >> {The developer can have a reasonable expectation that nomatch will be >> thrown if there is no likely match, and have reasonable expectation >> that nomatch will be not thrown if there is a likely match. In other >> words, if nomatch is thrown, it's likely that any results (if any) are >> garbage, and if nomatch is not thrown, it's likely that the results are >> useful.}**** >> >> **** >> >> - Engines are free to meet the above requirements through internally >> skewing.**** >> >> **** >> >> { Adjustments to this threshold could be made in either absolute terms >> (eg recognizer.confidence = .72) or relative terms (eg >> recognizer.confidence +=.2).**** >> >> **** >> >> { The confidence property can be read, the UA keeps track of the value >> and sends it to the recognizer along with the recognition request.}**** >> >> **** >> >> 1) The reported confidence property on the >> SpeechRecognitionAlternatives must report on a [0.0-1.0] scale**** >> >> 2) If the UA is generating EMMA because the engine does not supply >> EMMA, and if the confidence is included in EMMA, then it must be identical >> to the alternative.confidence property(s). {If instead, the EMMA is >> generated by the engine, the UA should pass the EMMA through >> verbatim...it's the engine's job to ensure that these two match, not the >> UA's.)**** >> >> **** >> >> confidenceThreshold is monotonically increasing such that larger values >> will return an equal or fewer number of results than lower values. **** >> >> **** >> >> **** >> >> **** >> >> The only significant way in which I disagree with your description is >> that I don't believe there is a significant benefit for developers in >> specifying the following, in fact I believe this can be detrimental in some >> cases:**** >> >> **** >> >> 3) If the threshold is set to X, all alternative.confidence values will >> be >= X.**** >> >> **** >> >> Doing so would have these disadvantages:**** >> >> **** >> >> - It would require remapping all the results**** >> >> - It would require re-writing EMMA with the new results**** >> >> - Nearly all developers who do process these results will simply be >> comparing relative values, skewing the output could mask the differences >> between alternatives.**** >> >> **** >> >> **** >> >> **** >> >> Based on all of the above, heres' the specific working I propose for the >> spec:**** >> >> **** >> >> attribute float confidenceThreshold;**** >> >> **** >> >> - confidenceThreshold attribute - This attribute defines a threshold for >> rejecting recognition results based on the estimated confidence score that >> they are correct. The value of confidenceThreshold ranges from 0.0 (least >> confidence) to 1.0 (most confidence), with 0.0 as the default value. A 0.0 >> confidenceThreshold will aggressively return many results limited only by >> the length of the maxNBest parameter. It is implementation-dependent >> whether onnomatch is ever fired when the confidenceThreshold is 0.0.**** >> >> confidenceThreshold is monotonically increasing such that larger values >> will return an equal or fewer number of results than lower values. Also, >> with larger values of confidenceThreshold, onnomatch is more likely, or >> just as likely, to be fired than with lower values. Unlike maxNBest, there >> is no defined mapping between the value of the threshold and how many >> results will be returned.**** >> >> If the confidenceThreshold is set to 0.5, the recognize should provide a >> good balance between firing onnomatch when it is unlikely that any of the >> return values are correct and firing onresult instead when it is likely >> that at least one return value is valid. The precise behavior is >> implementation dependent, but it should provide a reasonable mechanism that >> enables the developer to accept or reject responses based on whether >> onnomatch fires.**** >> >> It is implementation dependent how confidenceThreshold is mapped, and its >> relation (if any) to the confidence values returned in results.**** >> >> **** >> >> Glen Shires**** >> >> **** >> >> **** >> >> On Wed, Jun 6, 2012 at 2:54 PM, Young, Milan <Milan.Young@nuance.com> >> wrote:**** >> >> Inline…**** >> >> **** >> >> *From:* Glen Shires [mailto:gshires@google.com] >> *Sent:* Wednesday, June 06, 2012 1:40 PM**** >> >> >> *To:* Young, Milan >> *Cc:* Satish S; public-speech-api@w3.org >> *Subject:* Re: Confidence property**** >> >> **** >> >> Milan,**** >> >> It seems we are converging on a solution, however before I respond to >> your proposal, I'd like some clarifications:**** >> >> [Milan] I expected you would like this proposal. It’s my favorite of the >> bunch so far as well.**** >> >> **** >> >> 1.**** >> >> You wrote: "Engines must support a default confidence of 0.5"**** >> >> and then: "The default threshold should be 0 which means accept all >> candidates"**** >> >> So I presume you're proposing that your first sentence reads: "Engines >> must support a default confidence of 0.0 on a range of [0.0-1.0]"**** >> >> If so, does this mean that there is is no possibility of an onnomatch >> event if the developer never sets confidence? (Or that onnomatch only >> occurs if there are no possible results at all, such as for complete >> silence?)**** >> >> [Milan] Silence should result in a timeout event of some sort (commonly >> noinput). I mentioned that on this thread earlier, but somehow it fell of >> the dashboard. I’ll start a new thread.**** >> >> **** >> >> But to answer the main question, the “0.0” threshold means that engines >> should aggressively retain all speech candidates limited only by the length >> of the nbest list. How this is defined in practice, however, is still >> vendor specific. I would think that some engines might want to still >> generate nomatch events on select classes of noise input even with a >> threshold of “0.0”.**** >> >> **** >> >> The only assert able point we could make here is that if nomatch events >> are generated on a threshold of “0.0”, then they must not contain an >> interpretation property. This is in contrast to regular nomatch events >> which can contain an interpretation.**** >> >> **** >> >> **** >> >> 2.**** >> >> I agree, defining that defining that all engines must support the same >> confidence value is very beneficial. It also means that the UA can keep >> track of the setting (without a round trip to the recognizer), which means >> that relative adjustments can be made using float values (rather than >> strings). So do you agree with the following: in either absolute terms (eg >> recognizer.confidence = .72) or relative terms (eg recognizer.confidence += >> .2)**** >> >> [Milan] Good catch. Let’s stay with floats and have the UA maintain the >> value.**** >> >> **** >> >> 3.**** >> >> While I agree that all engines must support the same confidence value (as >> an input to the recognizer), and that "engines are free to meet the above >> requirement through internally skewing", I don't agree that it is >> necessary, or even beneficial, to (as an output from the recognizer) >> "ensure that all results are reported on the external scale", because (a) >> nearly all developers who do process these results will simply be comparing >> relative values, (b) skewing the output could mask the differences between >> alternatives, (c) it's extra overhead to substitute all the output values. >> **** >> >> [Milan] Internally, all recognition engines that I know of must skew in >> order to achieve a 0-1 range. The native scales are going to be a function >> of grammar size, type (rule or statistical), and acoustic modeling. If you >> ask around with the Google speech team, they are probably going to tell you >> the same. **** >> >> **** >> >> But let’s put aside that detail for now and focus on the observable >> (assertable) upshots of my request: **** >> >> 1) The reported confidence property on the >> SpeechRecognitionAlternatives must report on a 0-1 scale**** >> >> 2) If confidence is included in EMMA, it must be identical to the >> alternative.confidence property(s).**** >> >> 3) If the threshold is set to X, all alternative.confidence values will >> be >= X.**** >> >> **** >> >> Can we agree on that?**** >> >> **** >> >> **** >> >> Thanks,**** >> >> Glen Shires**** >> >> **** >> >> **** >> >> On Tue, Jun 5, 2012 at 11:41 AM, Young, Milan <Milan.Young@nuance.com> >> wrote:**** >> >> One minor adjustment to the proposal below. The default threshold should >> be 0 which means accept all candidates. This will provide a better out of >> the box experience across the largest range of grammars. Power users who >> are concerned with performance/latency can adjust as needed.**** >> >> Thanks**** >> >> **** >> >> **** >> >> *From:* Young, Milan >> *Sent:* Tuesday, June 05, 2012 11:00 AM >> *To:* 'Glen Shires' >> *Cc:* Satish S; public-speech-api@w3.org >> *Subject:* RE: Confidence property**** >> >> **** >> >> Glen, **** >> >> **** >> >> I suggest the needs of all groups would be best served by the following >> new hybrid proposal:**** >> >> · Engines must support a default confidence of 0.5 on a range of >> [0.0-1.0].**** >> >> · Engines are free to meet the above requirement through >> internally skewing, but they must ensure that all results are reported on >> the external scale. For example, if the developer sets a threshold of 0.8, >> then no result should be returned with a score of less than 0.8.**** >> >> · Adjustments to this threshold could be made in either absolute >> terms (eg recognizer.confidence = .72) or relative terms (eg >> recognizer.confidence = “+.2”). The UA enforces syntax.**** >> >> · Relative adjustments that index out of bounds are silently >> truncated.**** >> >> · The confidence property can be read, but applications that care >> about latency could avoid the hit by keeping track of the value themselves >> with a local shadow.**** >> >> **** >> >> Thoughts?**** >> >> **** >> >> **** >> >> *From:* Glen Shires [mailto:gshires@google.com] >> *Sent:* Monday, June 04, 2012 7:23 PM**** >> >> >> *To:* Young, Milan >> *Cc:* Satish S; public-speech-api@w3.org >> *Subject:* Re: Confidence property**** >> >> **** >> >> Milan,**** >> >> I think we agree that different web developers have different needs:**** >> >> **** >> >> 1: Some web developers don't want to adjust confidence at all (they just >> use the default value).**** >> >> **** >> >> 2: Some web developers want to adjust confidence in a >> recognizer-independent manner (realizing performance will vary between >> recognizers).**** >> >> **** >> >> 3: Some web developers want to fine-tune confidence in a >> recognizer-specific manner (optimizing using engine logs and tuning tools). >> If none of these specific recognizers are available, their app will either >> not function, or function but perform no confidence adjustments.**** >> >> **** >> >> 2.5: Some developers are a mix of 2 and 3: they want to fine-tune >> confidence in a recognizer-specific manner for certain recognizers, and >> for all other recognizers (such as when the recognizers of choice are not >> available) they want to adjust confidence in a recognizer-independent >> manner.**** >> >> **** >> >> **** >> >> I believe it's our job, in defining and in implementing the spec, to make >> things work as well as possible for all 4 types of developers. I believe >> the confidenceThresholdAdjustment proposal [1] accomplishes this:**** >> >> **** >> >> 1: This first group doesn't use confidence.**** >> >> **** >> >> 2: For this second group, it enables adjusting confidence in the most >> recognizer-independent manner that we know of.**** >> >> **** >> >> 3: For this third group, it allows precise, recognizer-specific setting >> of confidence (so absolute confidence values obtained from engine logs and >> tuning tools can be used directly) with just a trivial bit more effort.** >> ** >> >> **** >> >> 2.5: This group gains all the benefits of both 2 and 3.**** >> >> **** >> >> **** >> >> Our various proposals vary in two ways:**** >> >> **** >> >> - Whether the confidence is specified as an absolute value or a relative >> value.**** >> >> - Whether there is any mapping to inflate/deflate ranges.**** >> >> **** >> >> Specifying the attribute as an absolute value and making it readable >> entails major complications:**** >> >> **** >> >> - If a new recognizer is selected, it's default threshold needs to be >> retrieved, an operation that may have latency. If the developer then reads >> the confidenceThreshold attribute, the read can't stall until the threshold >> is read (because it is illegal for JavaScript to stall). Fixing this would >> require defining an asynchronous event to indicate that the >> confidenceThreshold value is now available to be read. All very messy for >> both the web developer and the UA implementer.**** >> >> **** >> >> - The semantics are unclear and recognizer-dependent. If the developer >> set the confidenceThreshold = 0.4, then selects a new recognizer (or >> perhaps a new task or grammar), does the confidenceThreshold change? When, >> and if so, how does the developer know to what value - does it get reset to >> the recognizer's default? If not, what does 0.4 now mean in this new >> context?**** >> >> **** >> >> In contrast, using a relative value has these advantages:**** >> >> **** >> >> - It avoids all issues of latency and asynchrony issues. The UA does not >> have to inquire the recognizer's default threshold value from the >> [potentially remote] recognizer before the UA returns the value when >> this JavaScript attribute is read. Instead, the UA maintains the value of >> this attribute, and simply sends it to the recognizer along with the >> recognition request.**** >> >> **** >> >> - It avoids all issues of threshold values change due to changes in the >> selected recognizer or task or grammar.**** >> >> **** >> >> Most importantly, from the point of view of web developers (group 2 and >> group 3), the advantages of using a relative value include:**** >> >> **** >> >> - Semantics are clear and simple.**** >> >> **** >> >> - The attribute is directly readable at any time, with no latency.**** >> >> **** >> >> - Changing the selected recognizer or task or grammar has no unexpected >> affect: the relative value does not change.**** >> >> **** >> >> In addition, web developers in group 2 get the following benefits:**** >> >> **** >> >> - Developers can easily adjust the threshold for certain tasks. For >> example, to confirm a transaction, the developer may increase the threshold >> to be more stringent than the recognizer's default, e.g. >> confidenceThresholdAdjustment = 0.3**** >> >> **** >> >> - Developer can adjust the threshold based on prior usage. For example, >> if not getting enough (or any) results, he may bump down the confidence to >> be more lenient, e.g: confidenceThreshold -= 0.1**** >> >> **** >> >> - (As Milan wrote "I suggest the recognizer internally truncate on the >> range" to saturate at the min/max values.)**** >> >> **** >> >> The only downside for this is that developers in group 3 (who are by >> definition writing recognizer-specific code) must maintain an offset for >> each recognizer they are specifically optimizing for. For example, if the >> default confidence value is 0.7 for the recognizer they're writing for, >> they simply write:**** >> >> **** >> >> recognizer.confidenceAdjustment = confidence - 0.7;**** >> >> **** >> >> or alternatively maintain a global that changes when they switch >> recognizers:**** >> >> **** >> >> recognizer.confidenceAdjustment = confidence - >> defaultConfidenceOfCurrentRecognizer;**** >> >> **** >> >> or alternatively, create a JavaScript function:**** >> >> **** >> >> function SetConfidenceAbsolute(conf) {**** >> >> recognizer.confidenceAdjustment = conf - 0.7;**** >> >> }**** >> >> **** >> >> The point being, there's a lot of very simple ways to handle this, all >> very trivial, particularly when compared to the extensive effort they're >> already investing to fine-tune confidence values for each recognizer using >> engine logs or tuning tools. Further, the group 2.5 developers get the >> advantages of all of the above.**** >> >> **** >> >> **** >> >> For all these reasons, I believe that defining this as a relative value >> is clearly preferable over an absolute value.**** >> >> **** >> >> **** >> >> The remaining question is whether there should also be some mapping, or >> just a purely linear scale. I believe a trivial mapping is preferable >> because it is very beneficial for group 2 and group 2.5 developers (because >> it provides a greater level of recognizer-independent adjustment), and adds >> trivial overhead for group 3 developers. For example, here's one method >> that allows group 3 developers to directly use absolute confidence values >> from engine logs or tuning tools:**** >> >> **** >> >> function SetConfidenceAbsolute(conf) {**** >> >> var c = conf - 0.7;**** >> >> if (c > 0)**** >> >> recognizer.confidenceAdjustment = c / 0.3;**** >> >> else**** >> >> recognizer.confidenceAdjustment = c / 0.7;**** >> >> }**** >> >> **** >> >> Here I'm assuming that 0.7 is the current recognizer's default confidence >> value. This function linearly maps the values above 0.7 to between 0.0 and >> 1.0 and the values below 0.7 to between -1.0 and 0.0. Conversely, the >> un-mapping that the engine would have to do would be equally trivial:**** >> >> **** >> >> function MapConfidence(c) {**** >> >> if (c > 0)**** >> >> return c * 0.3 + 0.7;**** >> >> else**** >> >> return c * 0.7 + 0.7;**** >> >> }**** >> >> **** >> >> /Glen Shires**** >> >> **** >> >> [1] >> http://lists.w3.org/Archives/Public/public-speech-api/2012Jun/0000.html** >> ** >> >> **** >> >> **** >> >> **** >> >> On Mon, Jun 4, 2012 at 12:09 PM, Young, Milan <Milan.Young@nuance.com> >> wrote:**** >> >> Comments inline…**** >> >> **** >> >> *From:* Glen Shires [mailto:gshires@google.com] >> *Sent:* Friday, June 01, 2012 6:46 PM**** >> >> >> *To:* Young, Milan >> *Cc:* Satish S; public-speech-api@w3.org >> *Subject:* Re: Confidence property**** >> >> **** >> >> Milan,**** >> >> Can you please clarify your proposal:**** >> >> **** >> >> - Does it pass a string or a float to the recognizer?**** >> >> [Milan] String.**** >> >> **** >> >> - Can the developer inquire (read) the current confidence value? Is the >> value returned relative (with plus/minus prefix) or absolute? A string or a >> float?**** >> >> [Milan] Yes, the property could be read and it would return the absolute >> value. We would just document that if the recognizer is remote, this would >> trigger a trip to the server. Developers would choose whether the cost is >> worth the reward.**** >> >> **** >> >> - If the developer sets recognizer.confidence = “+.1”, then later sets >> recognizer.confidence = “+.2”, would the result be summed "+.3" or >> overwritten "+.2" ?**** >> >> [Milan] I figured they would be cumulative, but could be swayed.**** >> >> **** >> >> The main question is what to do with an out of bounds event (eg default >> value is 0.5 and developer sets +0.6). I suggest the recognizer internally >> truncate on the [0.0-1.0] range (essentially a scaling operation similar to >> your proposal). The important thing is that higher thresholds must always >> generate >= number of results than lower thresholds.**** >> >> **** >> >> - Is there a defined range for the increments? (Example, is "+0.5" valid? >> is "+1.0" valid? is "+10.0" valid?)**** >> >> [Milan] The UA would enforce syntax and limit the range to [-1.0,1.0].*** >> * >> >> **** >> >> - It seems that what you are defining is an offset from a >> recognizer-dependent default value, which seems very similar to >> the confidenceThresholdAdjustment I propose. What are the advantages of >> your proposal over the syntax I proposed?**** >> >> [Milan] Yes, the functionality from a developer perspective is >> essentially the same. The advantage of my proposal:**** >> >> · Minimize work on the engine side with the implementation of a >> scaling system.**** >> >> · Confidence scores in the result have a direct correspondence to >> the values pushed through the UA.**** >> >> · Tuning tools can continue to use the actual threshold instead of >> having to special case applications developed for HTML Speech.**** >> >> **** >> >> **** >> >> I disagree with your contention that confidenceThresholdAdjustment that I >> proposed "is just as recognizer-dependent as the much simpler mechanism of >> just setting the value". Because the range is defined, a >> confidenceThresholdAdjustment = 0.3 indicates, in a recognizer-independent >> manner, that the confidence is substantially greater than the recognizer's >> default, but still far from the maximum possible setting. In contrast, the >> meaning of recognizer.confidence = “+.3” may vary greatly, for example, the >> recognizer's default may be 0.2 (meaning the new setting is still nowhere >> near maximum confidence) or it may be 0.7 (meaning the new setting is the >> maximum confidence.)**** >> >> [Milan] All true, but at the end of the day calling it a “adjustment” >> instead of a “threshold” doesn’t add any testable assertions.**** >> >> **** >> >> I agree that confidenceThresholdAdjustment is not perfect, but it's the >> most recognizer-independent solution I have seen to date, and I believe >> that the majority of web developers will be able to use it to accomplish >> the majority of tasks without resorting to any recognition-dependent >> programming.**** >> >> [Milan] I think this is the fundamental disconnect between us. A >> developer who sets an adjustment of 0.3 on recognizer A must not assume >> that behavior will be the same on recognizer B. If they want to support >> multiple engines they must test on each engine and tune accordingly. >> Otherwise they risk undefined/incorrect behavior.**** >> >> **** >> >> **** >> >> I also agree that for the subset of developers that want to fine-tune >> their application for specific recognizers by using engine logs and >> training tools, this introduces an abstraction. However, for this subset of >> developers, either of two simple solutions can be used: (a) the recognition >> vendor could provide the engine-specific mapping so that the developer can >> easily convert the values, or (b) the vendor could provide a >> recognizer-specific custom setting that >> overrides confidenceThresholdAdjustment.**** >> >> [Milan] These work-abounds would be worth the cost if we were defining a >> truly recognizer-independent solution. But since we are not, I view the >> proposal as a pointless exercise in semantic juggling.**** >> >> **** >> >> **** >> >> I believe it's crucial that we define all attributes in the spec in a >> recognizer-independent manner, or at least recognizer-independent enough >> that most developers don't have to resort to recognizer-dependent >> programming. If there are attributes that cannot be defined in a >> recognizer-independent manner, then I believe such inherently >> recognizer-specific settings should be just that, >> recognizer-specific custom settings. **** >> >> [Milan] I could point to 100s of examples in W3C and IETF specifications >> where expected behavior is not 100% clear and I assure you these >> ambiguities were not the product of careless editing. There is good reason >> and precedent behind the industry definition of confidence. Please don’t >> throw the baby out with the bathwater. **** >> >> **** >> >> **** >> >> Thanks,**** >> >> Glen Shires**** >> >> [Milan] Thank you too for keeping this discussion active.**** >> >> **** >> >> **** >> >> On Fri, Jun 1, 2012 at 5:20 PM, Young, Milan <Milan.Young@nuance.com> >> wrote:**** >> >> Glen, it’s clear that you put a lot of thought into trying to come up >> with a compromise. I appreciate the effort.**** >> >> **** >> >> My contention, however, is that this new mechanism for manipulating >> confidence is just as recognizer dependent as the much simpler mechanism of >> just setting the value. All you have done is precisely define a new term >> using existing terminology that has no precise definition. An “adjustment” >> of 0.3 doesn’t have any more of grounded or recognizer independent meaning >> than a “threshold” of 0.3.**** >> >> **** >> >> Furthermore, you’ve introduced yet another parameter to jiggle, and this >> will cause all sorts of headaches during the tuning phase. That’s because >> the engine, logged results, and training tools will all be based on >> absolute confidence thresholds, and the user will need to figure out how to >> map those absolute thresholds onto the relative scale. And they still need >> to perform this exercise independently for each engine.**** >> >> **** >> >> One of the things I do like about your proposal is that it circumvents >> the need to read the confidence threshold before setting it in incremental >> mode. But this could just as easily be accomplished with syntax such as >> recognizer.confidence = “+.1”. If I added such a plus/minus prefix to my >> previous proposal would you be satisfied?**** >> >> **** >> >> Thanks**** >> >> **** >> >> **** >> >> *From:* Glen Shires [mailto:gshires@google.com] >> *Sent:* Friday, June 01, 2012 9:01 AM >> *To:* Young, Milan >> *Cc:* Satish S; public-speech-api@w3.org >> *Subject:* Re: Confidence property**** >> >> **** >> >> I propose the following definition:**** >> >> **** >> >> attribute float confidenceThresholdAdjustment;**** >> >> **** >> >> - confidenceThresholdAdjustment attribute - This attribute defines a >> relative threshold for rejecting recognition results based on the estimated >> confidence score that they are correct. The value >> of confidenceThresholdAdjustment ranges from -1.0 (least confidence) to 1.0 >> (most confidence), with 0.0 mapping to the default confidence threshold as >> defined by the recognizer. confidenceThresholdAdjustment is monotonically >> increasing such that larger values will return an equal or fewer number of >> results than lower values. (Note that the confidence scores reported >> within the SpeechRecognitionResult and within the EMMA results use a 0.0 - >> 1.0 scale, and the correspondence between these scores >> and confidenceThresholdAdjustment may vary across UAs, recognition engines, >> and even task to task.) Unlike maxNBest, there is no defined mapping >> between the value of the threshold and how many results will be returned. >> **** >> >> **** >> >> **** >> >> **** >> >> This definition has these advantages:**** >> >> **** >> >> For web developers, it provides flexibility and simplicity in a >> recognizer-independent manner. It covers the vast majority of the ways in >> which developers use confidence values:**** >> >> **** >> >> - Developers can easily adjust the threshold for certain tasks. For >> example, to confirm a transaction, the developer may increase the threshold >> to be more stringent than the recognizer's default, e.g. >> confidenceThresholdAdjustment = 0.3**** >> >> **** >> >> - Developer can adjust the threshold based on prior usage. For example, >> if not getting enough (or any) results, he may bump down the confidence to >> be more lenient, e.g: confidenceThreshold -= 0.1 (Developers should ensure >> they don't underflow/overflow the -1.0 - 1.0 scale.)**** >> >> **** >> >> - Developers can perform their own processing of the results by comparing >> confidence scores in the normal manner. (The confidence scores in the >> results use the recognizer's native scale, so they are not mapped or skewed >> and so relative comparisons are not affected by "inflated" or "deflated" >> ranges.)**** >> >> **** >> >> It provides clear semantics that are recognizer-independent:**** >> >> **** >> >> - It avoids all issues of latency and asynchrony issues. The UA does not >> have to inquire the recognizer's default threshold value from the >> [potentially remote] recognizer before the UA returns the value when >> this JavaScript attribute is read. Instead, the UA maintains the value of >> this attribute, and simply sends it to the recognizer along with the >> recognition request.**** >> >> **** >> >> - It avoids all issues of threshold values change due to changes in the >> selected recognizer or task or grammar.**** >> >> **** >> >> - It allows recognition engines the freedom to define any mapping that is >> appropriate, and use any internal default threshold value they choose >> (which may vary from engine to engine and/or from task to task).**** >> >> **** >> >> The one drawback is that the confidenceThresholdAdjustment mapping >> may "require significant skewing of the range" and "squeeze" and "inflate". >> However, I see this as a minimal disadvantage, particularly when weighed >> against all the advantages above.**** >> >> **** >> >> **** >> >> **** >> >> Earlier in this thread we looked at four different options [1]. This >> solution is a variation of option 1 in that list. All the other options in >> that list have significant drawbacks:**** >> >> **** >> >> Option 2) Let speech recognizers define the default: has these >> disadvantages:**** >> >> **** >> >> - If a new recognizer is selected, it's default threshold needs to be >> retrieved, an operation that may have latency. If the developer then reads >> the confidenceThreshold attribute, the read can't stall until the threshold >> is read. Fixing this requires defining an asynchronous event to indicate >> that the confidenceThreshold value is now available to be read. All very >> messy for both the web developer and the UA implementer.**** >> >> **** >> >> - The semantics are unclear and recognizer-dependent. If the developer >> set the confidenceThreshold = 0.4, then selects a new recognizer (or >> perhaps a new task or grammar), does the confidenceThreshold change? When, >> and if so, how does the developer know to what value - does it get reset to >> the recognizer's default? If not, what does 0.4 now mean in this new >> context?**** >> >> **** >> >> Option 3) Make it write-only (not readable): has these disadvantages:**** >> >> **** >> >> - A developer must write recognizer-dependent code. Since he can't read >> the value, he can't increment/decrement it, so he must blindly set it. He >> must know what set confidenceThreshold = 0.4 means for the current >> recognizer.**** >> >> **** >> >> **** >> >> Thus I propose the solution above, with it's many advantages and only a >> minor drawback.**** >> >> **** >> >> [1] >> http://lists.w3.org/Archives/Public/public-speech-api/2012Apr/0051.html** >> ** >> >> **** >> >> **** >> >> On Wed, May 23, 2012 at 3:56 PM, Young, Milan <Milan.Young@nuance.com> >> wrote:**** >> >> >> The benefit of minimizing deaf periods is therefore again recognizer >> specific**** >> >> **** >> >> Most (all?) of the recognition engines which can be embedded within an >> HTML browser currently operate over a network. In fact if you study the >> use cases, you’d find that the majority of those transactions are over a 3G >> network which is notoriously latent.**** >> >> **** >> >> It’s possible that this may begin to change over the next few year, but >> it’s surely not going to be in the lifetime of our 1.0 spec (at least I >> hope we can come to agreement before then J). Thus the problem can >> hardly be called engine specific.**** >> >> **** >> >> Yes, the semantics are unclear, but that wouldn’t be any different than a >> quasi-standard which would undoubtedly emerge in the absence of a >> specification.**** >> >> **** >> >> **** >> >> **** >> >> *From:* Satish S [mailto:satish@google.com] >> *Sent:* Wednesday, May 23, 2012 6:28 AM >> *To:* Young, Milan >> *Cc:* public-speech-api@w3.org >> *Subject:* Re: Confidence property**** >> >> **** >> >> Hi Milan,**** >> >> **** >> >> Summarizing previous discussion, we have:**** >> >> Pros: 1) Aids efficient application design, 2) minimizes deaf periods, >> 3) avoids a proliferation of semi-standard custom parameters.**** >> >> Cons: 1) Semantics of the value are not precisely defined, and 2) >> Novice users may not understand how confidence differs from maxnbest.**** >> >> **** >> >> My responses to the cons are: 1) Precedent from the speech industry, and >> 2) Thousands of VoiceXML developers do understand the difference and will >> balk at an API that does not accommodate their needs.**** >> >> **** >> >> This was well debated in the earlier thread and it is clear that >> confidence threshold semantics are tied to the recognizer (not portable). >> The benefit of minimizing deaf periods is therefore again recognizer >> specific and not portable. This is a well suited use case for custom >> parameters and I'd suggest we start with that.**** >> >> **** >> >> Thousands of VoiceXML developers do understand the difference and will >> balk at an API that does not accommodate their needs.**** >> >> **** >> >> I hope we aren't trying to replicate VoiceXML in the browser. If it is >> indeed a must have feature for web developers we'll be receiving requests >> for it from them very soon, so it would be easy to discuss and add it in >> future.**** >> >> **** >> >> **** >> >> **** >> >> **** >> >> **** >> >> ** ** >> >> ** ** >> > >
Received on Thursday, 14 June 2012 21:29:17 UTC