Re: Confidence property from Jerry Carter on 2012-06-20 (public-speech-api@w3.org from June 2012)

From: Jerry Carter <jerry@jerrycarter.org>
Date: Wed, 20 Jun 2012 01:25:00 -0400
To: Glen Shires <gshires@google.com>, Milan Young <Milan.Young@nuance.com>
Cc: public-speech-api@w3.org
Message-Id: <438DD685-2DE6-4EDC-95B9-5550D2DC7670@jerrycarter.org>
If I may step back a bit, I'd like to understand why confidence, maxNBest, and nomatch appear in the specification at all.

There are two reasons that are often given for confidence in a recognition API.  Both concern resource utilization.  The first is that a confidence threshold allows the recognizer to prune its search, disregarding options which are unlikely to produce results with high scores.  The second is that a confidence threshold may limit the size of the recognition result and thereby the processing on the client side of the API.  This latter case is better addressed by maxNBest or similar attributes.  I should point out that confidence has an additional in dialog specifications such as VoiceXML or Nuance DialogModules which is to determine when 'no match' events are generated; this purpose is inappropriate for a recognition API.

It is worth noting that not all recognizers produce N-best lists or n-best confidences.  A recognizer may produce a lattice result in which the confidence values are associated with individual nodes or arcs.  While it is possible to build an N-Best from a lattice, there is no standard for producing a confidence score for that N-best entry.  It is often the case in these systems that applications are focused on certain hot spots in the response rather than on the details of the entire utterance.

It is also worth noting that confidence scores depend on the similarity of entries in the grammar.  An example grammar might have a command 'help' and the Beetle's album 'Help!'.  The recognizer might be very confident that one of the two was spoken but have very little confidence as to which it actually was.  In contrast, the recognizer would likely have high confidence that 'help' was spoken as opposed to 'let me talk to an agent'.  No single confidence threshold can capture the complexity.

And it is worth mentioning that confidence scores are even more difficult to assemble when multiple resources are used in combination for a recognition, e.g. a recognition engine and a speaker verification system.


I would like to consider three changes.

(1) Drop nomatch.  This is not a dialog API.  When necessary, the developer can examine the confidence of the result and proceed accordingly.

(2) Drop maxNBest and confidence.
 
(3) Add a processing directive with limited options.  Values might indicate that an application wants only the very best result(s), a short list of potential candidates with nearly equivalent scores, or a longer list of hypotheses.  Such categories are descriptive and also testable, as each subsequent value grows in average result size and decreases in average confidence.

It is the responsibility of each speech recognition service to understand the ROC curves of their engines and the variation for typical grammars and to then pick appropriate default values.  This greatly simplifies the task for application authors and for specification drafters.  Applications employing grammars which are tuned for specific recognition engines would then be free to use vendor-specific parameters to specify more exact criteria.

-=- Jerry





On Jun 19, 2012, at 7:31 PM, Glen Shires wrote:

> Let's assume the top three results have confidence levels of 0.50, 0.49, 0.48
> If I understand your proposed wording, then in this example, if confidenceThreshold is 0.50, then onresult should return one result: 0.50.
> However, if confidenceThreshold is 0.51, then onnomatch should return all three results: 0.50, 0.49, 0.48.
> 
> Let's assume in another example, the top three results have confidence levels of 0.60, 0.59, 0.58.
> With the confidenceThreshold of 0.50, then onresults should return all three results: 0.60, 0.59, 0.58.
> 
> In this case, since the three results are so close, the web-developer may ask the user to disambiguate.
> However, in the first example above, when only one result 0.50 is returned, the developer cannot ask for disambiguation, even though the confidence is lower.
> 
> 
> We need to decide what is most helpful to expose to the developer: providing finer control for when onresult vs onnomatch fires, or providing finer control for the bandwidth. I believe finer control for onresult / onnomatch is what most developers seek.
> 
> 
> On Tue, Jun 19, 2012 at 4:08 PM, Young, Milan <Milan.Young@nuance.com> wrote:
> There are many parallels between maxAlternatives and confidenceThreshold, and this is one of them.  With maxAlternatives, we provide a hard cut-off on the size of the list even if the confidence scores of the last item and first item on the overflow are the same.  I don’t see any reason to deviate from that policy for confidence.
> 
>  
> 
> At some point, the developer needs to draw the line between junk and legitimate speech candidates.  Confidence threshold is that line.  If the threshold is .5, then .49 will not do.  For if .49 were OK, then why not .48?  It too is only .01 from “line”.  Then why not .47 and so forth.  In the end you’ll just end up with a second “hard” threshold like .4 which is a totally arbitrary number mostly based on the fact that humans have 10 fingers.
> 
>  
> 
> I suggest that such a developer should simply just set the threshold to .4 or whatever that “hard” line resides.  If they are not comfortable making any arbitrary decisions they should just leave the confidence at 0.0 and generate their nomatch events manually.
> 
>  
> 
> Make sense?
> 
>  
> 
>  
> 
> From: Glen Shires [mailto:gshires@google.com] 
> Sent: Tuesday, June 19, 2012 3:19 PM
> 
> 
> To: Young, Milan
> Cc: public-speech-api@w3.org
> Subject: Re: Confidence property
> 
>  
> 
> Milan,
> 
> I like the idea of using SHOULD and I think we are really close to complete agreement here. The one bit that's unclear is that it seems that confidenceThreshold is being used to do two things that, although related, are not identical:
> 
>  
> 
> - To define the threshold at which onresult vs onnomatch events fire.
> 
> - To limit the number of results returned (e.g. to save bandwidth)
> 
>  
> 
> It seems to me that from a web developer's perspective, the most important thing is to define the onresult / onnomatch threshold in as recognizer-independent way as possible. 
> 
>  
> 
> In the event of onnomatch, I agree with this statement (I made "scores" plural)
> 
>  
> 
> Conversely, if results are returned in an nomatch event, the confidence scores SHOULD be less than the confidenceThreshold.
> 
>  
> 
> In the event of onresult, it's a bit trickier. For example, suppose only one candidate had a confidence greater than confidenceThreshold, and other candidates had confidence scores just slightly less than the threshold. I believe the web developer would want each of these returned, because this might be a case where the developer would want the user to disambiguate. Thus, I think the converse should be something like:
> 
>  
> 
> if results are returned in an onresult event, the maximum confidence score SHOULD be greater or equal to the confidenceThreshold
> 
>  
> 
> /Glen
> 
>  
> 
> On Mon, Jun 18, 2012 at 12:45 PM, Young, Milan <Milan.Young@nuance.com> wrote:
> 
> Glen, from the feedback we have received thus far (yours included), it’s clear that behavior A is universally preferred over B from a developer and spec authoring perspective.  In other words, aside from the impact of preventing select engines from participation, there are only advantages to A and no disadvantages.  That said, I agree that if the spec is to be successful, we’ll need to accommodate recognizers that cannot support A.
> 
>  
> 
> This class of dichotomy between desirable behavior and what’s implementable is typically handled with SHOULD requirements.  In short, implementations SHOULD ensure their confidence threshold is on the same scale as the returned results, but failing to do this would not represent a point of non-compliance.  This would result in language similar to the following (changes in green, deletions in red):
> 
>  
> 
>  
> 
> - confidenceThreshold attribute - This attribute Defines a threshold for rejecting recognition results based on the estimated confidence score that they are correct.  The value of confidenceThreshold ranges from 0.0 (least confidence) to 1.0 (most confidence), with 0.0 as the default value. A 0.0 nomatchThreshold will aggressively return many speech results limited only by the length of the maxNBest parameter.
> 
>  
> 
> confidenceThreshold is monotonically increasing such that larger values will return an equal or fewer number of results than lower values. Also, with larger values of confidenceThreshold, onnomatch is more likely, or just as likely, to be fired than with lower values.
> 
>  
> 
> A threshold of 0.0 suggests that implementations should be as aggressive in returning speech result candidates.  In such cases, the number of results will only be limited by the length of the maxNBest parameter.  If a nomatch event occurs when the condifenceTreshold is set to 0.0, the ‘result’ property of the event SHOULD be null, otherwise the result would have been a candidate and SHOULD have been returned.  It is implementation-dependent whether onnomatch is ever fired when the nomatchThreshold is 0.0.
> 
>  
> 
> Unlike maxNBest, there is no defined mapping between the value of the threshold and how many results will be returned.
> 
>  
> 
> If the confidenceThreshold is set to 0.5, the recognition should provide a good balance between firing onnomatch when it is unlikely that any of the return values are correct and firing onresult instead when it is likely that at least one return value is valid. The precise behavior is implementation dependent, but it should provide a reasonable mechanism that enables the developer to accept or reject responses based on whether onnomatch fires.
> 
>  
> 
> It is implementation dependent how nomatchThreshold is mapped, and its relation (if any) to the confidence values returned in results.
> 
>  
> 
> Implementations SHOULD only return result candidates that have a confidence value  greater than or equal to the confidenceThreshold.  Conversely, if results are returned in an nomatch event, the confidence score SHOULD be less than the confidenceThreshold.
> 
>  
> 
>  
> 
>  
> 
> Note that section 5.1.3 which defines nomatch needs to have its language similarly adjusted.  Right now it implies a MUST for behavior A, and that should be relaxed to a SHOULD.
> 
>  
> 
>  
> 
> Thanks
> 
>  
> 
>  
> 
>  
> 
> From: Glen Shires [mailto:gshires@google.com] 
> Sent: Saturday, June 16, 2012 11:55 PM
> 
> 
> To: Young, Milan
> Cc: public-speech-api@w3.org
> Subject: Re: Confidence property
> 
>  
> 
> Milan,
> 
> We have converged and agree on many crucial aspects of this, including:
> 
>  
> 
> - Using 0.0 - 1.0 scale for outputs from the recognizer as reported in SpeechRecognitionAlternative.confidence and in EMMA. The scale is monotonically increasing with 0.0 representing least confidence and 1.0 representing most confidence.
> 
>  
> 
> - Using 0.0 - 1.0 scale for input to the recognizer as set by the threshold. The scale is monotonically increasing such that larger values will return an equal or fewer number of results than lower values.  Also, with larger values of threshold, onnomatch is more likely, or just as likely, to be fired than with lower values. 
> 
>  
> 
> - The default threshold is 0.0.
> 
>  
> 
> - A threshold of 0.5 provides a good balance between firing onnomatch when it is unlikely that any of the return values are correct and firing onresult instead when it is likely that at least one return value is valid. 
> 
>  
> 
>  
> 
> The one area in which we have not yet converged is whether "all nomatch events that contain confidence scores are guaranteed to be < threshold", or stated another way, whether there is a guarantee of direct one-to-one "correlation between thresholds and scores in the results".
> 
>  
> 
> - You propose that any recognizer that does not guarantee this cannot be used with this Speech Javascript API.
> 
>  
> 
> - I propose that for recognizers that guarantee this, that they do so. I also propose that we support recognizers that do not guarantee this, and that those recognizers must still meet all the criteria above that we have agreed upon. The affect this has on developers is almost negligible (it only affects a small group of developers, and the primary affect it has on them is the requirement to cut-and-paste a snippet of JavaScript into their code.) The affect this has on our Speech Javascript API is that more recognizers can be supported.
> 
>  
> 
>  
> 
> In response to the 3 questions you pose:
> 
> Question 1)
> 
>  For recognizers that support it, A is preferable, and my proposal supports this. For recognizers that don't support A, B is acceptable.
> 
>  
> 
>  I believe the number of developers for whom this distinction matters is quite small. Group 1 and Group 2 developers don't care. Group 3 developers developing recognizer-dependent code are likely using a recognizer that supports A (so they're unaffected by this distinction). The distinction only affects Group 3 developers writing in a recognizer-independent manner, quite a small group. With my proposal, this group can solve this by cutting-and-pasting a short JavaScript snippet, SetConfidenceThreshold() function, into their code.
> 
>  
> 
> Question 2)
> 
>  We agree that 0.5 is a good starting point. Such tweaking implies that finding a better recognition-dependent value for a particular implementation. Since it's recognizer dependent, developers can use either method A or method B as appropriate. (The spec shouldn't dictate this, instead the developer should decide.)
> 
>  
> 
> Question 3)
> 
>  This question combines two separate issues:
> 
>  a) We have agreed that a threshold of 0.5 MUST provide meaningful results.
> 
>  b) My proposal allows recognizers that support it to use a direct "correlation between thresholds and scores in the results" but it doesn't restrict the use of recognizers that don't support this. This enables developers to write recognizer-independent code for all types of recognizers, and provides substantial benefits for Group 1, 2 and 3 developers. I describe this in detail here [1]
> 
>  
> 
>  
> 
> I encourage everyone to voice their opinions on questions that you and any other CG member posts, but I do not think it's appropriate to call for a vote, and particularly without any prior discussion. Also, simple A/B questions imply that the answer is A or B, when in fact it may be that some developers prefer A, some prefer B, some use both or neither, and there may be an option C that's even more preferable. There are many interdependent factors, so I think it's much better for the group to holistically evaluate each of the specific proposals that we have, and the benefits they provide to the various groups of web developers.
> 
> Here again is my specific proposal.  I invite everyone to evaluate and comment on it. (We could rename the attribute confidenceThreshold instead of nomatchThreshold).
> 
>  
> 
> attribute float nomatchThreshold;
> 
>  
> 
> - nomatchThreshold attribute - This attribute defines a threshold for rejecting recognition results based on the estimated confidence score that they are correct.  The value of nomatchThreshold ranges from 0.0 (least confidence) to 1.0 (most confidence), with 0.0 as the default value. A 0.0 nomatchThreshold will aggressively return many speech results limited only by the length of the maxNBest parameter.
> 
> nomatchThreshold is monotonically increasing such that larger values will return an equal or fewer number of results than lower values. Also, with larger values of nomatchThreshold, onnomatch is more likely, or just as likely, to be fired than with lower values.  It is implementation-dependent whether onnomatch is ever fired when the nomatchThreshold is 0.0.  Unlike maxNBest, there is no defined mapping between the value of the threshold and how many results will be returned.
> 
> If the nomatchThreshold is set to 0.5, the recognition should provide a good balance between firing onnomatch when it is unlikely that any of the return values are correct and firing onresult instead when it is likely that at least one return value is valid. The precise behavior is implementation dependent, but it should provide a reasonable mechanism that enables the developer to accept or reject responses based on whether onnomatch fires.
> 
> It is implementation dependent how nomatchThreshold is mapped, and its relation (if any) to the confidence values returned in results.
> 
>  
> 
> Thanks
> 
> Glen
> 
>  
> 
> [1] http://lists.w3.org/Archives/Public/public-speech-api/2012Jun/0100.html
> 
>  
> 
>  
> 
> On Fri, Jun 15, 2012 at 1:05 PM, Young, Milan <Milan.Young@nuance.com> wrote:
> 
> Glen and I don’t appear to be converging on a solution and I believe it’s time to turn to the community for help.  Rather than plunge into the details of the proposals, I’d like to invite all interested parties to start by voting A/B on this short questionnaire:
> 
>  
> 
>  
> 
> Question-1) You are a web developer and set a confidence threshold to .75.  Would you prefer:
> 
> A)      Results will be returned only if the confidence is >= to .75.  All nomatch events that contain confidence scores are guaranteed to be < .75.
> 
> B)      Results of various confidence are returned (i.e. no direct correlation to the specified threshold).  Nomatch events also lack correlation (eg a score of .9) could occur.
> 
>  
> 
>  
> 
> Question-2) You are a web developer in class 2 (intelligent, motivated, but lacks speech science background).  You are currently using a confidence value of .5 on a mobile application, but too many results are being returned which is causing latency.  You want to improve the performance of your system and by limiting the number of results.  You start by looking at the list of results and try to find an inflection point between reasonable and unreasonable values.  Perhaps running a few informal trials with a live microphone.  You now need to choose the new threshold.  Which methodology seems easier?
> 
> A)      Specify a threshold just below the inflection point.
> 
> B)      Add .1 to the threshold, run all your trials again looking to see if unreasonable values were returned, add another .1 to the threshold, repeat.
> 
>  
> 
>  
> 
> Quesiton-3) You are part of the team authoring a new specification for a HTML/Speech marriage (think hard J).  It’s come time to write the text for how confidence thresholds affect results.  Which design seems like the best way to promote a uniform experience across UAs and engines:
> 
> A)      Require engines to report results on the same scale as the developer-specified threshold.  If the engine knows that 0.5, for example, does not provide meaningful results for a particular dialog type, they should either fix that problem or risk users/developers going elsewhere.
> 
> B)      Specify that there is only a casual correlation between thresholds and scores in the results.  Some engines might provide a consistent scale, some engines may use various skews and choose not to map back onto the threshold scale.
> 
>  
> 
> Thanks
> 
>  
> 
>  
> 
> From: Glen Shires [mailto:gshires@google.com] 
> Sent: Friday, June 15, 2012 12:00 PM
> To: Young, Milan
> Cc: public-speech-api@w3.org
> Subject: Re: Confidence property
> 
>  
> 
> It may be that we have a misunderstanding in how we both define "native confidence values".  I have been using that term, and continue to use that term to indicate a 0.0 - 1.0 scale that has not had any skew applied to make 0.5 reasonable.  I have not been using that term to refer to any internal recognizer scale that is other than 0.0 - 1.0.
> 
>  
> 
> Comments inline below...
> 
>  
> 
> On Thu, Jun 14, 2012 at 6:04 PM, Young, Milan <Milan.Young@nuance.com> wrote:
> 
> You argue that there exists some recognizer that is NOT capable of giving a meaningful native interpretation to thresholds like ‘0.5’.  I will accept that.
> 
> [Glen] Thank you 
> 
>  
> 
> You further suggest that these same recognizer(s) have some magic ability to transform these thresholds to something that IS meaningful.  I will accept that too.  Let’s call that magic transformation webToInternal() and it’s inverse internalToWeb().
> 
>  [Glen] OK
> 
>  
> 
> Without requiring this engine to expose internalToWeb() a developer could set a threshold like “0.5” and get back score like “0.1”.  If you were a developer, would that make sense to you?
> 
> [Glen] Yes
> 
>  
> 
>   What practical use would you even have for such a number? 
> 
> [Glen] I believe most Group 2 web developers don't care to look at confidence values:
> 
>  
> 
>  - Some will simply set nomatchThreshold = 0.5 and control their application based on whether onresult or onnomatch fires.
> 
>  
> 
>  - Some more sophisticate Group 2 developers will set nomatchThreshold = 0.5 and may increment it up or down based on if onresult or onnomatch is firing too often or rarely.
> 
>  
> 
>  - Only the most sophisticated Group 2 developers will look at the confidence values returned in the results or in emma. Since they are processing them in a recognition-dependent manner, they must only compare relative values. For example, if they find that the second alternative has a confidence value relatively near the first, the app may ask the user to disambiguate.  Using the example you give, if the top result is 0.1 and the second result is 0.085, the app could ask the user to disambiguate.
> 
>  
> 
> For Group 3 developers that do process these values, getting back the 0.1 result is invaluable, because it matches the native levels in their tuning tools, logs and other applications.
> 
>  
> 
> So yes, this has very practical uses and benefits for Group 2 and Group 3 developers. 
> 
>  
> 
> It may as well be a Chinese character.
> 
> [Glen] Fortunately, it is a float, and can easily be compared against other float values. 
> 
>  
> 
> Wouldn’t it be a lot more useful to developers and consistent with mainstream engines to simply require support for internalToWeb()?  I’m sure folks that are capable of building something as complicated as a recognizer can solve an math equation.  I’ll even offer to include my phone number in the spec so that they can call me for help J.
> 
> [Glen] No. This would be very problematic for Group 3 developers that use these recognizers. Their tuning tools, their logs, their other applications all may be based on native confidence values, and this complicates their implementation, as you have pointed out. Instead, Group 3 developers would much prefer to only use native values, which they can do because the native values are returned in the results and in emma. Yes, they do have to copy-and-paste a short JavaScript function for this, but that's trivial.  For Group 2 and Group 1 developers, there's no difference whether these recognizers support internalToWeb().
> 
>  
> 
> Thanks
> 
> Thank you
> 
>  
> 
>  
> 
>  
> 
>
Received on Wednesday, 20 June 2012 05:25:37 UTC