Re: Confidence property from Glen Shires on 2012-06-17 (public-speech-api@w3.org from June 2012)

From: Glen Shires <gshires@google.com>
Date: Sat, 16 Jun 2012 23:55:20 -0700
To: "Young, Milan" <Milan.Young@nuance.com>
Cc: "public-speech-api@w3.org" <public-speech-api@w3.org>
Message-ID: <CAEE5bci+45NXxOojzp0LgctEtdKk0R=rex_pE5NY5BF5AJwedA@mail.gmail.com>
Milan,
We have converged and agree on many crucial aspects of this, including:

- Using 0.0 - 1.0 scale for outputs from the recognizer as reported
in SpeechRecognitionAlternative.confidence and in EMMA. The scale is
monotonically increasing with 0.0 representing least confidence and 1.0
representing most confidence.

- Using 0.0 - 1.0 scale for input to the recognizer as set by the
threshold. The scale is monotonically increasing such that larger values
will return an equal or fewer number of results than lower values.  Also,
with larger values of threshold, onnomatch is more likely, or just as
likely, to be fired than with lower values.

- The default threshold is 0.0.

- A threshold of 0.5 provides a good balance between firing onnomatch when
it is unlikely that any of the return values are correct and firing
onresult instead when it is likely that at least one return value is valid.


The one area in which we have not yet converged is whether "all nomatch
events that contain confidence scores are guaranteed to be < threshold", or
stated another way, whether there is a guarantee of direct one-to-one
"correlation between thresholds and scores in the results".

- You propose that any recognizer that does not guarantee this cannot be
used with this Speech Javascript API.

- I propose that for recognizers that guarantee this, that they do so. I
also propose that we support recognizers that do not guarantee this, and
that those recognizers must still meet all the criteria above that we have
agreed upon. The affect this has on developers is almost negligible (it
only affects a small group of developers, and the primary affect it has on
them is the requirement to cut-and-paste a snippet of JavaScript into their
code.) The affect this has on our Speech Javascript API is that more
recognizers can be supported.


In response to the 3 questions you pose:
Question 1)
 For recognizers that support it, A is preferable, and my proposal supports
this. For recognizers that don't support A, B is acceptable.

 I believe the number of developers for whom this distinction matters is
quite small. Group 1 and Group 2 developers don't care. Group 3 developers
developing recognizer-dependent code are likely using a recognizer that
supports A (so they're unaffected by this distinction). The distinction
only affects Group 3 developers writing in a recognizer-independent manner,
quite a small group. With my proposal, this group can solve this by
cutting-and-pasting a short JavaScript
snippet, SetConfidenceThreshold() function, into their code.

Question 2)
 We agree that 0.5 is a good starting point. Such tweaking implies that
finding a better recognition-dependent value for a particular
implementation. Since it's recognizer dependent, developers can use either
method A or method B as appropriate. (The spec shouldn't dictate this,
instead the developer should decide.)

Question 3)
 This question combines two separate issues:
 a) We have agreed that a threshold of 0.5 MUST provide meaningful results.
 b) My proposal allows recognizers that support it to use a direct
"correlation between thresholds and scores in the results" but it doesn't
restrict the use of recognizers that don't support this. This enables
developers to write recognizer-independent code for all types of
recognizers, and provides substantial benefits for Group 1, 2 and 3
developers. I describe this in detail here [1]


I encourage everyone to voice their opinions on questions that you and any
other CG member posts, but I do not think it's appropriate to call for a
vote, and particularly without any prior discussion. Also, simple A/B
questions imply that the answer is A or B, when in fact it may be that some
developers prefer A, some prefer B, some use both or neither, and there may
be an option C that's even more preferable. There are many interdependent
factors, so I think it's much better for the group to holistically evaluate
each of the specific proposals that we have, and the benefits they provide
to the various groups of web developers.

Here again is my specific proposal.  I invite everyone to evaluate and
comment on it. (We could rename the attribute confidenceThreshold instead
of nomatchThreshold).

attribute float nomatchThreshold;

- nomatchThreshold attribute - This attribute defines a threshold for
rejecting recognition results based on the estimated confidence score that
they are correct.  The value of nomatchThreshold ranges from 0.0 (least
confidence) to 1.0 (most confidence), with 0.0 as the default value. A
0.0 nomatchThreshold will aggressively return many speech results limited
only by the length of the maxNBest parameter.
nomatchThreshold is monotonically increasing such that larger values will
return an equal or fewer number of results than lower values. Also, with
larger values of nomatchThreshold, onnomatch is more likely, or just as
likely, to be fired than with lower values.  It is implementation-dependent
whether onnomatch is ever fired when the nomatchThreshold is 0.0.  Unlike
maxNBest, there is no defined mapping between the value of the threshold
and how many results will be returned.
If the nomatchThreshold is set to 0.5, the recognition should provide a
good balance between firing onnomatch when it is unlikely that any of the
return values are correct and firing onresult instead when it is likely
that at least one return value is valid. The precise behavior is
implementation dependent, but it should provide a reasonable mechanism that
enables the developer to accept or reject responses based on whether
onnomatch fires.
It is implementation dependent how nomatchThreshold is mapped, and its
relation (if any) to the confidence values returned in results.

Thanks
Glen

[1] http://lists.w3.org/Archives/Public/public-speech-api/2012Jun/0100.html


On Fri, Jun 15, 2012 at 1:05 PM, Young, Milan <Milan.Young@nuance.com>wrote:

>  Glen and I don’t appear to be converging on a solution and I believe
> it’s time to turn to the community for help.  Rather than plunge into the
> details of the proposals, I’d like to invite all interested parties to
> start by voting A/B on this short questionnaire:****
>
> ** **
>
> ** **
>
> *Question-1)* You are a web developer and set a confidence threshold to
> .75.  Would you prefer:****
>
> **A)      **Results will be returned only if the confidence is >= to
> .75.  All nomatch events that contain confidence scores are guaranteed to
> be < .75.****
>
> **B)      **Results of various confidence are returned (i.e. no direct
> correlation to the specified threshold).  Nomatch events also lack
> correlation (eg a score of .9) could occur.****
>
> ** **
>
> ** **
>
> *Question-2)* You are a web developer in class 2 (intelligent, motivated,
> but lacks speech science background).  You are currently using a confidence
> value of .5 on a mobile application, but too many results are being
> returned which is causing latency.  You want to improve the performance of
> your system and by limiting the number of results.  You start by looking at
> the list of results and try to find an inflection point between reasonable
> and unreasonable values.  Perhaps running a few informal trials with a live
> microphone.  You now need to choose the new threshold.  Which methodology
> seems easier?****
>
> **A)      **Specify a threshold just below the inflection point.****
>
> **B)      **Add .1 to the threshold, run all your trials again looking to
> see if unreasonable values were returned, add another .1 to the threshold,
> repeat.****
>
> ** **
>
> ** **
>
> *Quesiton-3)* You are part of the team authoring a new specification for
> a HTML/Speech marriage (think hard J).  It’s come time to write the text
> for how confidence thresholds affect results.  Which design seems like the
> best way to promote a uniform experience across UAs and engines:****
>
> **A)      **Require engines to report results on the same scale as the
> developer-specified threshold.  If the engine knows that 0.5, for example,
> does not provide meaningful results for a particular dialog type, they
> should either fix that problem or risk users/developers going elsewhere.**
> **
>
> **B)      **Specify that there is only a casual correlation between
> thresholds and scores in the results.  Some engines might provide a
> consistent scale, some engines may use various skews and choose not to map
> back onto the threshold scale.****
>
> ** **
>
> Thanks****
>
> ** **
>
> ** **
>
> *From:* Glen Shires [mailto:gshires@google.com]
> *Sent:* Friday, June 15, 2012 12:00 PM
> *To:* Young, Milan
> *Cc:* public-speech-api@w3.org
> *Subject:* Re: Confidence property****
>
> ** **
>
> It may be that we have a misunderstanding in how we both define "native
> confidence values".  I have been using that term, and continue to use that
> term to indicate a 0.0 - 1.0 scale that has not had any skew applied to
> make 0.5 reasonable.  I have not been using that term to refer to any
> internal recognizer scale that is other than 0.0 - 1.0.****
>
> ** **
>
> Comments inline below...****
>
> ** **
>
> On Thu, Jun 14, 2012 at 6:04 PM, Young, Milan <Milan.Young@nuance.com
> > wrote:****
>
> You argue that there exists some recognizer that is NOT capable of giving
> a meaningful native interpretation to thresholds like ‘0.5’.  I will accept
> that.****
>
> [Glen] Thank you ****
>
>   ****
>
> You further suggest that these same recognizer(s) have some magic ability
> to transform these thresholds to something that IS meaningful.  I will
> accept that too.  Let’s call that magic transformation webToInternal() and
> it’s inverse internalToWeb().****
>
>   [Glen] OK****
>
>   ****
>
> Without requiring this engine to expose internalToWeb() a developer could
> set a threshold like “0.5” and get back score like “0.1”.  If you were a
> developer, would that make sense to you?****
>
>  [Glen] Yes****
>
> ** **
>
>    What practical use would you even have for such a number? ****
>
>  [Glen] I believe most Group 2 web developers don't care to look at
> confidence values:****
>
> ** **
>
>  - Some will simply set nomatchThreshold = 0.5 and control their
> application based on whether onresult or onnomatch fires.****
>
> ** **
>
>  - Some more sophisticate Group 2 developers will set nomatchThreshold =
> 0.5 and may increment it up or down based on if onresult or onnomatch is
> firing too often or rarely.****
>
> ** **
>
>  - Only the most sophisticated Group 2 developers will look at the
> confidence values returned in the results or in emma. Since they are
> processing them in a recognition-dependent manner, they must only compare
> relative values. For example, if they find that the second alternative has
> a confidence value relatively near the first, the app may ask the user to
> disambiguate.  Using the example you give, if the top result is 0.1 and the
> second result is 0.085, the app could ask the user to disambiguate.****
>
> ** **
>
> For Group 3 developers that do process these values, getting back the 0.1
> result is invaluable, because it matches the native levels in their tuning
> tools, logs and other applications.****
>
> ** **
>
> So yes, this has very practical uses and benefits for Group 2 and Group 3
> developers. ****
>
>  ****
>
>  It may as well be a Chinese character.****
>
>  [Glen] Fortunately, it is a float, and can easily be compared against
> other float values. ****
>
>   ****
>
> Wouldn’t it be a lot more useful to developers and consistent with
> mainstream engines to simply require support for internalToWeb()?  I’m sure
> folks that are capable of building something as complicated as a recognizer
> can solve an math equation.  I’ll even offer to include my phone number in
> the spec so that they can call me for help J.****
>
>  [Glen] No. This would be very problematic for Group 3 developers that
> use these recognizers. Their tuning tools, their logs, their other
> applications all may be based on native confidence values, and this
> complicates their implementation, as you have pointed out. Instead, Group 3
> developers would much prefer to only use native values, which they can do
> because the native values are returned in the results and in emma. Yes,
> they do have to copy-and-paste a short JavaScript function for this, but
> that's trivial.  For Group 2 and Group 1 developers, there's no difference
> whether these recognizers support internalToWeb().****
>
> ** **
>
>  Thanks****
>
>  Thank you****
>
> ** **
>
Received on Sunday, 17 June 2012 06:56:33 UTC