- From: Young, Milan <Milan.Young@nuance.com>
- Date: Fri, 15 Jun 2012 01:04:04 +0000
- To: Glen Shires <gshires@google.com>
- CC: Satish S <satish@google.com>, "public-speech-api@w3.org" <public-speech-api@w3.org>
- Message-ID: <B236B24082A4094A85003E8FFB8DDC3C1A473537@SOM-EXCH04.nuance.com>
You argue that there exists some recognizer that is NOT capable of giving a meaningful native interpretation to thresholds like '0.5'. I will accept that. You further suggest that these same recognizer(s) have some magic ability to transform these thresholds to something that IS meaningful. I will accept that too. Let's call that magic transformation webToInternal() and it's inverse internalToWeb(). Without requiring this engine to expose internalToWeb() a developer could set a threshold like "0.5" and get back score like "0.1". If you were a developer, would that make sense to you? What practical use would you even have for such a number? It may as well be a Chinese character. Wouldn't it be a lot more useful to developers and consistent with mainstream engines to simply require support for internalToWeb()? I'm sure folks that are capable of building something as complicated as a recognizer can solve an math equation. I'll even offer to include my phone number in the spec so that they can call me for help :). Thanks From: Glen Shires [mailto:gshires@google.com] Sent: Thursday, June 14, 2012 4:39 PM To: Young, Milan Cc: Satish S; public-speech-api@w3.org Subject: Re: Confidence property Wow, that's wonderful news! If I'm interpreting this correctly, what you're saying is that, for at least all the recognizer implementations that you know and care about, that these "recognizers are going to do their best to make thresholds like '0.5' as meaningful as possible" so that a threshold of 0.5 provides good balance for firing onnomatch. That's great! That means that none of these recognizers need to do any mapping or skewing, and there's no need for Group 3 developers to do any "JS function mapping nonsense" for these recognizers. The wording in my "Proposal C" is such that no mapping or skewing is necessary for such recognizers. For these recognition vendors, and for developers that write applications that use these recognition vendors, life is simple. Now let's suppose there's also a few recognizers out that that some developers care about that do not natively map a 0.5 threshold to provide a good balance. The wording in my "Proposal C" is such that mapping or skewing is necessary for such recognizers, and I believe this is a very good thing, in that it offers substantial benefits to Group 2 developers and no significant disadvantages to Group 1 or Group 3 developers. For these recognition vendors, and for developers that write applications that use these recognition vendors, life isn't quite so simple, but it's a whole lot better compared to any of the alternatives. So can we agree on Proposal C? * For all those recognizers that you know and care about, which provide a good balance for 0.5, there's no mapping or skewing, and all 3 groups of web developers are happy. * For the set of recognizers that others may care about, which may not provide a good balance for 0.5, this is also a great solution that benefits all 3 groups of web developers. Thanks Glen On Thu, Jun 14, 2012 at 3:26 PM, Young, Milan <Milan.Young@nuance.com<mailto:Milan.Young@nuance.com>> wrote: Let's assume: * All recognizers are going to do their best to make thresholds like '0.5' as meaningful as possible. That's their job. * There is some function called webToInternal() that maps [0-1] thresholds to an internal domain. That function may be extremely complex or extremely simple, but it exists in some form or another. The fundamental issue between us, is whether engines are also required to support an internalToWeb() function. Let's break this down: Group 1 - Neutral Group 2 - Some advantage. It's awkward to set a threshold of '0.5' and get back a result that has a score of '0.1' or '0.9'. Having this feature allows them to perform casual tuning. Group 3 - Clear advantage over this JS function mapping nonsense. So the real question boils down: Are the advantages to groups 2 and 3 enough to warrant requesting recognizers to support the internalToWeb() function? On this point I'll admit that I have a bias, because *every* enterprise grade recognizer already has this feature. Thanks From: Glen Shires [mailto:gshires@google.com<mailto:gshires@google.com>] Sent: Thursday, June 14, 2012 2:28 PM To: Young, Milan Cc: Satish S; public-speech-api@w3.org<mailto:public-speech-api@w3.org> Subject: Re: Confidence property To clarify: with Proposal C, Group 3 developers do NOT have to translate any scores and they can use "direct import". On Thu, Jun 14, 2012 at 2:20 PM, Glen Shires <gshires@google.com<mailto:gshires@google.com>> wrote: Yes, good suggestion. Looking at how the proposals affect these three groups of web developers is a great way to evaluate them. Here's how my proposal affects these three groups: Analysis: Group 1: No advantage or disadvantage. Group 2: This is a perfect solution. These developers only care about setting the input confidenceThreshold. Since 0.0 is the default confidenceThreshold, if they want to get reasonable rejection behavior for nomatch, they can simply do: recognizer.confidenceThreshold = 0.5 Since the confidenceThreshold is skewed, they get reasonable nomatch behavior with 0.5 which enables them to write recognizer independent code (at least to some extent). If they're still getting too many results, they can simply increment or decrement this value, again as recognizer independent code. Conversely, if the confidenceThreshold were not skewed and instead used native recognizer confidence values, there would be no way for these developers to write recognizer independent code and get reasonable rejection behavior for nomatch. Since Group 2 developers don't process or analyze at the output confidence values (in results or emma), so they don't care whether these output values are skewed or not. Summary: This proposal offers big benefits for Group 2. Conversely, not having skewing would be a major hinderance for Group 2 developers because they couldn't reliably use nomatch behavior. Group 3: This is a perfect solution. These developers want to use native recognizer confidence values for both input (setting the threshold) and output (processing the results in SpeechRecognitionAlternative.confidence or SpeechRecognitionResult.emma). They don't want any skewing that can complicate things, and this solution allows them to only use native values everywhere, they never have to worry about skewing. The only thing they have to do is cut and paste a simple JavaScript function (which I presume most recognizer vendors would gladly post on their website) in to their code. For example, they could simply cut and paste the following: function SetNativeConfidenceThreshold(conf) { if (conf < 0.7) recognizer.confidenceThreshold = conf / 1.4; else recognizer.confidenceThreshold = 0.5 + ((c - 0.7) / 0.6); } Now, all the Group 3 developer has to do to set the confidence threshold using a native confidence value is: SetNativeConfidenceThreshold(value); Copying-and-pasting a short function is a trivial amount of effort, particularly when compared to all the effort that Group 3 is doing by definition to review, tune, process and tweak confidence values. That is, this proposal has trivial impact on the effort required for Group 3 developers. Summary: This proposal has virtually no impact on Group 3 developers. In contrast, a proposal that skews the results and emma confidence value would have a major, negative impact on Group 3 developers. Now, to compare how all three proposals affect these groups, let's label them: Proposal A: No skewing. Use native recognizer confidence values for both input (confidenceThreshold) and output (results and emma) Proposal B: Skew both input (confidenceThreshold) and output (results and emma) in the same manner. Proposal C: (my proposal) Skew only input (confidenceThreshold). Use recognizer confidence values for output (results and emma) in the same manner. Group 1: All proposals are fine, they provide no advantage or disadvantage. Group 2: Proposal A is problematic. Proposal B and C both provide a huge advantage. Group 3: Proposal B is problematic. Proposal A and C both provide a huge advantage. The intersection of these is Proposal C - provides huge advantages and is not problematic for any of the 3 groups of developers. Glen On Thu, Jun 14, 2012 at 12:23 PM, Young, Milan <Milan.Young@nuance.com<mailto:Milan.Young@nuance.com>> wrote: Let's try this another way. The most obvious/simplest solution is to report results on the same scale as the threshold. Can we agree on that? Assuming yes, then we should only entertain alternate/complicated suggestions if there is a clear and significant advantage. Let's break down this analysis to the three target audiences we've used before: 1) Developers who just want the default behavior. 2) Developers who think confidence is a neat feature, but they do not run offline experiments or have any preference for a speech engine. This class will probably either use incremental adjustments to the threshold or pick round numbers like ".5" as arbitrary thresholds. They are aware that confidence thresholds do not mean the same thing to different engines, but they do know: i) By default they get all results, and ii) If they want to limit the number of results they should use larger thresholds. 3) Power developers that either run offline experiments or have a port of the application on some other modality (e.g. IVR). These developers leave nothing to chance, and have a custom confidence score for each application state. If they do support multiple engines, each engine will have a distinct set of thresholds. Analysis: Group 1: No advantage or disadvantage. Group 2: There could be some advantage here, but as of yet I do not see it. Please make your case. Group 3: Your solution is a disadvantage because they must translate the scores on a per recognizer basis. These developers would prefer to use the much simpler solution of a direct import. From: Glen Shires [mailto:gshires@google.com<mailto:gshires@google.com>] Sent: Thursday, June 14, 2012 12:19 PM To: Young, Milan Cc: Satish S; public-speech-api@w3.org<mailto:public-speech-api@w3.org> Subject: Re: Confidence property Perhaps a more intuitive name for that wrapper function would be SetNativeConfidenceThreshold. Also, I realize the logic was wrong as it used a different scale. In this wrapper function, both the input (conf) and the output (recognizer.confidenceThreshold) use a 0.0 - 1.0 scale. For example, the following works for when recognizer.confidenceThreshold of 0.5 is skewed to a native value of 0.7. function SetNativeConfidenceThreshold(conf) { if (conf < 0.7) recognizer.confidenceThreshold = conf / 0.7; else recognizer.confidenceThreshold = 0.5 + ((c - 0.7) / 0.3); } On Thu, Jun 14, 2012 at 11:43 AM, Glen Shires <gshires@google.com<mailto:gshires@google.com>> wrote: Yes, the confidenceThreshold is on a 0.0 - 1.0 scale. Yes, the confidence reported in the results are on a 0.0 - 1.0 scale. Yes, the confidence reported in the EMMA are on a 0.0 - 1.0 scale. What I am saying is that: - The recognizer may skew the confidenceThreshold such that 0.5 maps to something reasonable for nomatch. - The recognizer is not required to skew to the reported results or EMMA results. (The recognizer may skew them, or it may not.) Simply put: the input must be skewed for 0.5, the output is not required to be skewed in a similar manner. I've added additional comments inline below... On Thu, Jun 14, 2012 at 11:04 AM, Young, Milan <Milan.Young@nuance.com<mailto:Milan.Young@nuance.com>> wrote: I requested "If the threshold is set to X, all alternative.confidence values will be >= X." I'd like to address your listed disadvantages: [Glen] It would require remapping all the results [Milan] Every modern recognizer that I know of is capable of reporting results on a [0-1] scale. That's really the only relevant requirement to this part of the request. Which alternate scale are you suggesting? [Glen] Scale remains 0.0 - 1.0. [Glen] It would require re-writing EMMA with the new results. [Milan] EMMA is natively on a [0-1] scale. [Glen] Yes, scale remains 0.0 - 1.0 [Glen] Nearly all developers who do process these results will simply be comparing relative values, skewing the output could mask the differences between alternatives. [Milan] A significant portion of developers and a *majority* of consumers will be using absolute thresholds derived from offline tuning experiments. Let's address the "skew" part of your statement as part of the first question/response. [Glen] I believe my proposal is particularly advantageous for these developers and customers. Most likely their offline tuning experiments will be using backend logs and these backend logs use the recognizer's native 0.0 - 1.0 confidence scale (not a skewed scale). In fact, some customers may have multiple applications/implementations (not just those using a browser with our Speech Javascript API) and/or may have prior experience with other applications, certainly these tuning experiments or logs would be using the recognizer's native 0.0 - 1.0 confidence scale (not a skewed scale). So the advantage these developers and customers have is that all the tuning data and logs they have ever gathered over years and multiple applications, all use, and continue to use, the same native scale. When they write Javascript code to process results with the Speech Javascript API, they continue to use the same native scale. The only thing that these developers and customers and customers must do to use these results directly in their Javascript code, is to set confidenceThreshold through a simple Javascript wrapper function. For example, that wrapper function might look like the following. Recognizer vendors may wish to document a suggested wrapper function like this, so that their developers and customers can tune applications without any additional effort or skewing concerns. function SetConfidenceAbsolute(conf) { var c = conf - 0.7; if (c > 0) recognizer.confidenceAdjustment = c / 0.3; else recognizer.confidenceAdjustment = c / 0.7; } Thanks, Glen Thanks From: Glen Shires [mailto:gshires@google.com<mailto:gshires@google.com>] Sent: Wednesday, June 13, 2012 11:09 PM To: Young, Milan Cc: Satish S; public-speech-api@w3.org<mailto:public-speech-api@w3.org> Subject: Re: Confidence property Milan, Great, I believe we are almost fully in agreement. Here's the key points that I think should be in the specification. Most of this is your wording. {The portions in curly braces are things I agree with, but that I don't think need to be specified in the spec.} - Engines must support a default confidenceThreshold of 0.0 on a range of [0.0-1.0]. - A 0.0 confidenceThreshold means that engines should aggressively retain all speech candidates limited only by the length of the nbest list. How this is defined in practice, however, is still vendor specific. Some engines may throw nomatch, other engines may never throw nomatch with a 0.0 confidenceThreshold. {I would think that some engines might want to still generate nomatch events on select classes of noise input even with a threshold of 0.0.} - When the confidenceThreshold is set to 0.5, nomatch should be thrown when there are no speech candidates found with good/reasonable confidence. {The developer can have a reasonable expectation that nomatch will be thrown if there is no likely match, and have reasonable expectation that nomatch will be not thrown if there is a likely match. In other words, if nomatch is thrown, it's likely that any results (if any) are garbage, and if nomatch is not thrown, it's likely that the results are useful.} - Engines are free to meet the above requirements through internally skewing. { Adjustments to this threshold could be made in either absolute terms (eg recognizer.confidence = .72) or relative terms (eg recognizer.confidence +=.2). { The confidence property can be read, the UA keeps track of the value and sends it to the recognizer along with the recognition request.} 1) The reported confidence property on the SpeechRecognitionAlternatives must report on a [0.0-1.0] scale 2) If the UA is generating EMMA because the engine does not supply EMMA, and if the confidence is included in EMMA, then it must be identical to the alternative.confidence property(s). {If instead, the EMMA is generated by the engine, the UA should pass the EMMA through verbatim...it's the engine's job to ensure that these two match, not the UA's.) confidenceThreshold is monotonically increasing such that larger values will return an equal or fewer number of results than lower values. The only significant way in which I disagree with your description is that I don't believe there is a significant benefit for developers in specifying the following, in fact I believe this can be detrimental in some cases: 3) If the threshold is set to X, all alternative.confidence values will be >= X. Doing so would have these disadvantages: - It would require remapping all the results - It would require re-writing EMMA with the new results - Nearly all developers who do process these results will simply be comparing relative values, skewing the output could mask the differences between alternatives. Based on all of the above, heres' the specific working I propose for the spec: attribute float confidenceThreshold; - confidenceThreshold attribute - This attribute defines a threshold for rejecting recognition results based on the estimated confidence score that they are correct. The value of confidenceThreshold ranges from 0.0 (least confidence) to 1.0 (most confidence), with 0.0 as the default value. A 0.0 confidenceThreshold will aggressively return many results limited only by the length of the maxNBest parameter. It is implementation-dependent whether onnomatch is ever fired when the confidenceThreshold is 0.0. confidenceThreshold is monotonically increasing such that larger values will return an equal or fewer number of results than lower values. Also, with larger values of confidenceThreshold, onnomatch is more likely, or just as likely, to be fired than with lower values. Unlike maxNBest, there is no defined mapping between the value of the threshold and how many results will be returned. If the confidenceThreshold is set to 0.5, the recognize should provide a good balance between firing onnomatch when it is unlikely that any of the return values are correct and firing onresult instead when it is likely that at least one return value is valid. The precise behavior is implementation dependent, but it should provide a reasonable mechanism that enables the developer to accept or reject responses based on whether onnomatch fires. It is implementation dependent how confidenceThreshold is mapped, and its relation (if any) to the confidence values returned in results. Glen Shires On Wed, Jun 6, 2012 at 2:54 PM, Young, Milan <Milan.Young@nuance.com<mailto:Milan.Young@nuance.com>> wrote: Inline... From: Glen Shires [mailto:gshires@google.com<mailto:gshires@google.com>] Sent: Wednesday, June 06, 2012 1:40 PM To: Young, Milan Cc: Satish S; public-speech-api@w3.org<mailto:public-speech-api@w3.org> Subject: Re: Confidence property Milan, It seems we are converging on a solution, however before I respond to your proposal, I'd like some clarifications: [Milan] I expected you would like this proposal. It's my favorite of the bunch so far as well. 1. You wrote: "Engines must support a default confidence of 0.5" and then: "The default threshold should be 0 which means accept all candidates" So I presume you're proposing that your first sentence reads: "Engines must support a default confidence of 0.0 on a range of [0.0-1.0]" If so, does this mean that there is is no possibility of an onnomatch event if the developer never sets confidence? (Or that onnomatch only occurs if there are no possible results at all, such as for complete silence?) [Milan] Silence should result in a timeout event of some sort (commonly noinput). I mentioned that on this thread earlier, but somehow it fell of the dashboard. I'll start a new thread. But to answer the main question, the "0.0" threshold means that engines should aggressively retain all speech candidates limited only by the length of the nbest list. How this is defined in practice, however, is still vendor specific. I would think that some engines might want to still generate nomatch events on select classes of noise input even with a threshold of "0.0". The only assert able point we could make here is that if nomatch events are generated on a threshold of "0.0", then they must not contain an interpretation property. This is in contrast to regular nomatch events which can contain an interpretation. 2. I agree, defining that defining that all engines must support the same confidence value is very beneficial. It also means that the UA can keep track of the setting (without a round trip to the recognizer), which means that relative adjustments can be made using float values (rather than strings). So do you agree with the following: in either absolute terms (eg recognizer.confidence = .72) or relative terms (eg recognizer.confidence += .2) [Milan] Good catch. Let's stay with floats and have the UA maintain the value. 3. While I agree that all engines must support the same confidence value (as an input to the recognizer), and that "engines are free to meet the above requirement through internally skewing", I don't agree that it is necessary, or even beneficial, to (as an output from the recognizer) "ensure that all results are reported on the external scale", because (a) nearly all developers who do process these results will simply be comparing relative values, (b) skewing the output could mask the differences between alternatives, (c) it's extra overhead to substitute all the output values. [Milan] Internally, all recognition engines that I know of must skew in order to achieve a 0-1 range. The native scales are going to be a function of grammar size, type (rule or statistical), and acoustic modeling. If you ask around with the Google speech team, they are probably going to tell you the same. But let's put aside that detail for now and focus on the observable (assertable) upshots of my request: 1) The reported confidence property on the SpeechRecognitionAlternatives must report on a 0-1 scale 2) If confidence is included in EMMA, it must be identical to the alternative.confidence property(s). 3) If the threshold is set to X, all alternative.confidence values will be >= X. Can we agree on that? Thanks, Glen Shires On Tue, Jun 5, 2012 at 11:41 AM, Young, Milan <Milan.Young@nuance.com<mailto:Milan.Young@nuance.com>> wrote: One minor adjustment to the proposal below. The default threshold should be 0 which means accept all candidates. This will provide a better out of the box experience across the largest range of grammars. Power users who are concerned with performance/latency can adjust as needed. Thanks From: Young, Milan Sent: Tuesday, June 05, 2012 11:00 AM To: 'Glen Shires' Cc: Satish S; public-speech-api@w3.org<mailto:public-speech-api@w3.org> Subject: RE: Confidence property Glen, I suggest the needs of all groups would be best served by the following new hybrid proposal: * Engines must support a default confidence of 0.5 on a range of [0.0-1.0]. * Engines are free to meet the above requirement through internally skewing, but they must ensure that all results are reported on the external scale. For example, if the developer sets a threshold of 0.8, then no result should be returned with a score of less than 0.8. * Adjustments to this threshold could be made in either absolute terms (eg recognizer.confidence = .72) or relative terms (eg recognizer.confidence = "+.2"). The UA enforces syntax. * Relative adjustments that index out of bounds are silently truncated. * The confidence property can be read, but applications that care about latency could avoid the hit by keeping track of the value themselves with a local shadow. Thoughts? From: Glen Shires [mailto:gshires@google.com<mailto:gshires@google.com>] Sent: Monday, June 04, 2012 7:23 PM To: Young, Milan Cc: Satish S; public-speech-api@w3.org<mailto:public-speech-api@w3.org> Subject: Re: Confidence property Milan, I think we agree that different web developers have different needs: 1: Some web developers don't want to adjust confidence at all (they just use the default value). 2: Some web developers want to adjust confidence in a recognizer-independent manner (realizing performance will vary between recognizers). 3: Some web developers want to fine-tune confidence in a recognizer-specific manner (optimizing using engine logs and tuning tools). If none of these specific recognizers are available, their app will either not function, or function but perform no confidence adjustments. 2.5: Some developers are a mix of 2 and 3: they want to fine-tune confidence in a recognizer-specific manner for certain recognizers, and for all other recognizers (such as when the recognizers of choice are not available) they want to adjust confidence in a recognizer-independent manner. I believe it's our job, in defining and in implementing the spec, to make things work as well as possible for all 4 types of developers. I believe the confidenceThresholdAdjustment proposal [1] accomplishes this: 1: This first group doesn't use confidence. 2: For this second group, it enables adjusting confidence in the most recognizer-independent manner that we know of. 3: For this third group, it allows precise, recognizer-specific setting of confidence (so absolute confidence values obtained from engine logs and tuning tools can be used directly) with just a trivial bit more effort. 2.5: This group gains all the benefits of both 2 and 3. Our various proposals vary in two ways: - Whether the confidence is specified as an absolute value or a relative value. - Whether there is any mapping to inflate/deflate ranges. Specifying the attribute as an absolute value and making it readable entails major complications: - If a new recognizer is selected, it's default threshold needs to be retrieved, an operation that may have latency. If the developer then reads the confidenceThreshold attribute, the read can't stall until the threshold is read (because it is illegal for JavaScript to stall). Fixing this would require defining an asynchronous event to indicate that the confidenceThreshold value is now available to be read. All very messy for both the web developer and the UA implementer. - The semantics are unclear and recognizer-dependent. If the developer set the confidenceThreshold = 0.4, then selects a new recognizer (or perhaps a new task or grammar), does the confidenceThreshold change? When, and if so, how does the developer know to what value - does it get reset to the recognizer's default? If not, what does 0.4 now mean in this new context? In contrast, using a relative value has these advantages: - It avoids all issues of latency and asynchrony issues. The UA does not have to inquire the recognizer's default threshold value from the [potentially remote] recognizer before the UA returns the value when this JavaScript attribute is read. Instead, the UA maintains the value of this attribute, and simply sends it to the recognizer along with the recognition request. - It avoids all issues of threshold values change due to changes in the selected recognizer or task or grammar. Most importantly, from the point of view of web developers (group 2 and group 3), the advantages of using a relative value include: - Semantics are clear and simple. - The attribute is directly readable at any time, with no latency. - Changing the selected recognizer or task or grammar has no unexpected affect: the relative value does not change. In addition, web developers in group 2 get the following benefits: - Developers can easily adjust the threshold for certain tasks. For example, to confirm a transaction, the developer may increase the threshold to be more stringent than the recognizer's default, e.g. confidenceThresholdAdjustment = 0.3 - Developer can adjust the threshold based on prior usage. For example, if not getting enough (or any) results, he may bump down the confidence to be more lenient, e.g: confidenceThreshold -= 0.1 - (As Milan wrote "I suggest the recognizer internally truncate on the range" to saturate at the min/max values.) The only downside for this is that developers in group 3 (who are by definition writing recognizer-specific code) must maintain an offset for each recognizer they are specifically optimizing for. For example, if the default confidence value is 0.7 for the recognizer they're writing for, they simply write: recognizer.confidenceAdjustment = confidence - 0.7; or alternatively maintain a global that changes when they switch recognizers: recognizer.confidenceAdjustment = confidence - defaultConfidenceOfCurrentRecognizer; or alternatively, create a JavaScript function: function SetConfidenceAbsolute(conf) { recognizer.confidenceAdjustment = conf - 0.7; } The point being, there's a lot of very simple ways to handle this, all very trivial, particularly when compared to the extensive effort they're already investing to fine-tune confidence values for each recognizer using engine logs or tuning tools. Further, the group 2.5 developers get the advantages of all of the above. For all these reasons, I believe that defining this as a relative value is clearly preferable over an absolute value. The remaining question is whether there should also be some mapping, or just a purely linear scale. I believe a trivial mapping is preferable because it is very beneficial for group 2 and group 2.5 developers (because it provides a greater level of recognizer-independent adjustment), and adds trivial overhead for group 3 developers. For example, here's one method that allows group 3 developers to directly use absolute confidence values from engine logs or tuning tools: function SetConfidenceAbsolute(conf) { var c = conf - 0.7; if (c > 0) recognizer.confidenceAdjustment = c / 0.3; else recognizer.confidenceAdjustment = c / 0.7; } Here I'm assuming that 0.7 is the current recognizer's default confidence value. This function linearly maps the values above 0.7 to between 0.0 and 1.0 and the values below 0.7 to between -1.0 and 0.0. Conversely, the un-mapping that the engine would have to do would be equally trivial: function MapConfidence(c) { if (c > 0) return c * 0.3 + 0.7; else return c * 0.7 + 0.7; } /Glen Shires [1] http://lists.w3.org/Archives/Public/public-speech-api/2012Jun/0000.html On Mon, Jun 4, 2012 at 12:09 PM, Young, Milan <Milan.Young@nuance.com<mailto:Milan.Young@nuance.com>> wrote: Comments inline... From: Glen Shires [mailto:gshires@google.com<mailto:gshires@google.com>] Sent: Friday, June 01, 2012 6:46 PM To: Young, Milan Cc: Satish S; public-speech-api@w3.org<mailto:public-speech-api@w3.org> Subject: Re: Confidence property Milan, Can you please clarify your proposal: - Does it pass a string or a float to the recognizer? [Milan] String. - Can the developer inquire (read) the current confidence value? Is the value returned relative (with plus/minus prefix) or absolute? A string or a float? [Milan] Yes, the property could be read and it would return the absolute value. We would just document that if the recognizer is remote, this would trigger a trip to the server. Developers would choose whether the cost is worth the reward. - If the developer sets recognizer.confidence = "+.1", then later sets recognizer.confidence = "+.2", would the result be summed "+.3" or overwritten "+.2" ? [Milan] I figured they would be cumulative, but could be swayed. The main question is what to do with an out of bounds event (eg default value is 0.5 and developer sets +0.6). I suggest the recognizer internally truncate on the [0.0-1.0] range (essentially a scaling operation similar to your proposal). The important thing is that higher thresholds must always generate >= number of results than lower thresholds. - Is there a defined range for the increments? (Example, is "+0.5" valid? is "+1.0" valid? is "+10.0" valid?) [Milan] The UA would enforce syntax and limit the range to [-1.0,1.0]. - It seems that what you are defining is an offset from a recognizer-dependent default value, which seems very similar to the confidenceThresholdAdjustment I propose. What are the advantages of your proposal over the syntax I proposed? [Milan] Yes, the functionality from a developer perspective is essentially the same. The advantage of my proposal: * Minimize work on the engine side with the implementation of a scaling system. * Confidence scores in the result have a direct correspondence to the values pushed through the UA. * Tuning tools can continue to use the actual threshold instead of having to special case applications developed for HTML Speech. I disagree with your contention that confidenceThresholdAdjustment that I proposed "is just as recognizer-dependent as the much simpler mechanism of just setting the value". Because the range is defined, a confidenceThresholdAdjustment = 0.3 indicates, in a recognizer-independent manner, that the confidence is substantially greater than the recognizer's default, but still far from the maximum possible setting. In contrast, the meaning of recognizer.confidence = "+.3" may vary greatly, for example, the recognizer's default may be 0.2 (meaning the new setting is still nowhere near maximum confidence) or it may be 0.7 (meaning the new setting is the maximum confidence.) [Milan] All true, but at the end of the day calling it a "adjustment" instead of a "threshold" doesn't add any testable assertions. I agree that confidenceThresholdAdjustment is not perfect, but it's the most recognizer-independent solution I have seen to date, and I believe that the majority of web developers will be able to use it to accomplish the majority of tasks without resorting to any recognition-dependent programming. [Milan] I think this is the fundamental disconnect between us. A developer who sets an adjustment of 0.3 on recognizer A must not assume that behavior will be the same on recognizer B. If they want to support multiple engines they must test on each engine and tune accordingly. Otherwise they risk undefined/incorrect behavior. I also agree that for the subset of developers that want to fine-tune their application for specific recognizers by using engine logs and training tools, this introduces an abstraction. However, for this subset of developers, either of two simple solutions can be used: (a) the recognition vendor could provide the engine-specific mapping so that the developer can easily convert the values, or (b) the vendor could provide a recognizer-specific custom setting that overrides confidenceThresholdAdjustment. [Milan] These work-abounds would be worth the cost if we were defining a truly recognizer-independent solution. But since we are not, I view the proposal as a pointless exercise in semantic juggling. I believe it's crucial that we define all attributes in the spec in a recognizer-independent manner, or at least recognizer-independent enough that most developers don't have to resort to recognizer-dependent programming. If there are attributes that cannot be defined in a recognizer-independent manner, then I believe such inherently recognizer-specific settings should be just that, recognizer-specific custom settings. [Milan] I could point to 100s of examples in W3C and IETF specifications where expected behavior is not 100% clear and I assure you these ambiguities were not the product of careless editing. There is good reason and precedent behind the industry definition of confidence. Please don't throw the baby out with the bathwater. Thanks, Glen Shires [Milan] Thank you too for keeping this discussion active. On Fri, Jun 1, 2012 at 5:20 PM, Young, Milan <Milan.Young@nuance.com<mailto:Milan.Young@nuance.com>> wrote: Glen, it's clear that you put a lot of thought into trying to come up with a compromise. I appreciate the effort. My contention, however, is that this new mechanism for manipulating confidence is just as recognizer dependent as the much simpler mechanism of just setting the value. All you have done is precisely define a new term using existing terminology that has no precise definition. An "adjustment" of 0.3 doesn't have any more of grounded or recognizer independent meaning than a "threshold" of 0.3. Furthermore, you've introduced yet another parameter to jiggle, and this will cause all sorts of headaches during the tuning phase. That's because the engine, logged results, and training tools will all be based on absolute confidence thresholds, and the user will need to figure out how to map those absolute thresholds onto the relative scale. And they still need to perform this exercise independently for each engine. One of the things I do like about your proposal is that it circumvents the need to read the confidence threshold before setting it in incremental mode. But this could just as easily be accomplished with syntax such as recognizer.confidence = "+.1". If I added such a plus/minus prefix to my previous proposal would you be satisfied? Thanks From: Glen Shires [mailto:gshires@google.com<mailto:gshires@google.com>] Sent: Friday, June 01, 2012 9:01 AM To: Young, Milan Cc: Satish S; public-speech-api@w3.org<mailto:public-speech-api@w3.org> Subject: Re: Confidence property I propose the following definition: attribute float confidenceThresholdAdjustment; - confidenceThresholdAdjustment attribute - This attribute defines a relative threshold for rejecting recognition results based on the estimated confidence score that they are correct. The value of confidenceThresholdAdjustment ranges from -1.0 (least confidence) to 1.0 (most confidence), with 0.0 mapping to the default confidence threshold as defined by the recognizer. confidenceThresholdAdjustment is monotonically increasing such that larger values will return an equal or fewer number of results than lower values. (Note that the confidence scores reported within the SpeechRecognitionResult and within the EMMA results use a 0.0 - 1.0 scale, and the correspondence between these scores and confidenceThresholdAdjustment may vary across UAs, recognition engines, and even task to task.) Unlike maxNBest, there is no defined mapping between the value of the threshold and how many results will be returned. This definition has these advantages: For web developers, it provides flexibility and simplicity in a recognizer-independent manner. It covers the vast majority of the ways in which developers use confidence values: - Developers can easily adjust the threshold for certain tasks. For example, to confirm a transaction, the developer may increase the threshold to be more stringent than the recognizer's default, e.g. confidenceThresholdAdjustment = 0.3 - Developer can adjust the threshold based on prior usage. For example, if not getting enough (or any) results, he may bump down the confidence to be more lenient, e.g: confidenceThreshold -= 0.1 (Developers should ensure they don't underflow/overflow the -1.0 - 1.0 scale.) - Developers can perform their own processing of the results by comparing confidence scores in the normal manner. (The confidence scores in the results use the recognizer's native scale, so they are not mapped or skewed and so relative comparisons are not affected by "inflated" or "deflated" ranges.) It provides clear semantics that are recognizer-independent: - It avoids all issues of latency and asynchrony issues. The UA does not have to inquire the recognizer's default threshold value from the [potentially remote] recognizer before the UA returns the value when this JavaScript attribute is read. Instead, the UA maintains the value of this attribute, and simply sends it to the recognizer along with the recognition request. - It avoids all issues of threshold values change due to changes in the selected recognizer or task or grammar. - It allows recognition engines the freedom to define any mapping that is appropriate, and use any internal default threshold value they choose (which may vary from engine to engine and/or from task to task). The one drawback is that the confidenceThresholdAdjustment mapping may "require significant skewing of the range" and "squeeze" and "inflate". However, I see this as a minimal disadvantage, particularly when weighed against all the advantages above. Earlier in this thread we looked at four different options [1]. This solution is a variation of option 1 in that list. All the other options in that list have significant drawbacks: Option 2) Let speech recognizers define the default: has these disadvantages: - If a new recognizer is selected, it's default threshold needs to be retrieved, an operation that may have latency. If the developer then reads the confidenceThreshold attribute, the read can't stall until the threshold is read. Fixing this requires defining an asynchronous event to indicate that the confidenceThreshold value is now available to be read. All very messy for both the web developer and the UA implementer. - The semantics are unclear and recognizer-dependent. If the developer set the confidenceThreshold = 0.4, then selects a new recognizer (or perhaps a new task or grammar), does the confidenceThreshold change? When, and if so, how does the developer know to what value - does it get reset to the recognizer's default? If not, what does 0.4 now mean in this new context? Option 3) Make it write-only (not readable): has these disadvantages: - A developer must write recognizer-dependent code. Since he can't read the value, he can't increment/decrement it, so he must blindly set it. He must know what set confidenceThreshold = 0.4 means for the current recognizer. Thus I propose the solution above, with it's many advantages and only a minor drawback. [1] http://lists.w3.org/Archives/Public/public-speech-api/2012Apr/0051.html On Wed, May 23, 2012 at 3:56 PM, Young, Milan <Milan.Young@nuance.com<mailto:Milan.Young@nuance.com>> wrote: >> The benefit of minimizing deaf periods is therefore again recognizer specific Most (all?) of the recognition engines which can be embedded within an HTML browser currently operate over a network. In fact if you study the use cases, you'd find that the majority of those transactions are over a 3G network which is notoriously latent. It's possible that this may begin to change over the next few year, but it's surely not going to be in the lifetime of our 1.0 spec (at least I hope we can come to agreement before then :)). Thus the problem can hardly be called engine specific. Yes, the semantics are unclear, but that wouldn't be any different than a quasi-standard which would undoubtedly emerge in the absence of a specification. From: Satish S [mailto:satish@google.com<mailto:satish@google.com>] Sent: Wednesday, May 23, 2012 6:28 AM To: Young, Milan Cc: public-speech-api@w3.org<mailto:public-speech-api@w3.org> Subject: Re: Confidence property Hi Milan, Summarizing previous discussion, we have: Pros: 1) Aids efficient application design, 2) minimizes deaf periods, 3) avoids a proliferation of semi-standard custom parameters. Cons: 1) Semantics of the value are not precisely defined, and 2) Novice users may not understand how confidence differs from maxnbest. My responses to the cons are: 1) Precedent from the speech industry, and 2) Thousands of VoiceXML developers do understand the difference and will balk at an API that does not accommodate their needs. This was well debated in the earlier thread and it is clear that confidence threshold semantics are tied to the recognizer (not portable). The benefit of minimizing deaf periods is therefore again recognizer specific and not portable. This is a well suited use case for custom parameters and I'd suggest we start with that. Thousands of VoiceXML developers do understand the difference and will balk at an API that does not accommodate their needs. I hope we aren't trying to replicate VoiceXML in the browser. If it is indeed a must have feature for web developers we'll be receiving requests for it from them very soon, so it would be easy to discuss and add it in future.
Received on Friday, 15 June 2012 01:05:03 UTC