RE: Confidence property from Jim Barnett on 2012-06-16 (public-speech-api@w3.org from June 2012)

From: Jim Barnett <Jim.Barnett@genesyslab.com>
Date: Sat, 16 Jun 2012 06:12:58 -0700
To: "Young, Milan" <Milan.Young@nuance.com>, "Glen Shires" <gshires@google.com>
Cc: <public-speech-api@w3.org>
Message-ID: <E17CAD772E76C742B645BD4DC602CD8106581BDE@NAHALD.us.int.genesyslab.com>
Question 1.  A - I have never seen a recognizer or IVR system that
didn't work this way 

 

Question 2.  A - that's  how I've always seen it done.

 

Question 3.  A - if I understand this correctly the issue is whether
recognizers should be expected to report their scores on the 0-1 scale.
Every recognizer I've ever worked with has done that  (thinking hard, I
may have met one guy once who didn't do this, but he was a crank and his
company disappeared a long time ago.)

 

It seems to me that there is a very clear industry standard on all these
issues - and not just because Nuance has bought the industry...

 

-          Jim

 

 

From: Young, Milan [mailto:Milan.Young@nuance.com] 
Sent: Friday, June 15, 2012 4:06 PM
To: Glen Shires
Cc: public-speech-api@w3.org
Subject: RE: Confidence property

 

Glen and I don't appear to be converging on a solution and I believe
it's time to turn to the community for help.  Rather than plunge into
the details of the proposals, I'd like to invite all interested parties
to start by voting A/B on this short questionnaire:

 

 

Question-1) You are a web developer and set a confidence threshold to
.75.  Would you prefer:

A)     Results will be returned only if the confidence is >= to .75.
All nomatch events that contain confidence scores are guaranteed to be <
.75.

B)      Results of various confidence are returned (i.e. no direct
correlation to the specified threshold).  Nomatch events also lack
correlation (eg a score of .9) could occur.

 

 

Question-2) You are a web developer in class 2 (intelligent, motivated,
but lacks speech science background).  You are currently using a
confidence value of .5 on a mobile application, but too many results are
being returned which is causing latency.  You want to improve the
performance of your system and by limiting the number of results.  You
start by looking at the list of results and try to find an inflection
point between reasonable and unreasonable values.  Perhaps running a few
informal trials with a live microphone.  You now need to choose the new
threshold.  Which methodology seems easier?

A)     Specify a threshold just below the inflection point.

B)      Add .1 to the threshold, run all your trials again looking to
see if unreasonable values were returned, add another .1 to the
threshold, repeat.

 

 

Quesiton-3) You are part of the team authoring a new specification for a
HTML/Speech marriage (think hard J).  It's come time to write the text
for how confidence thresholds affect results.  Which design seems like
the best way to promote a uniform experience across UAs and engines:

A)     Require engines to report results on the same scale as the
developer-specified threshold.  If the engine knows that 0.5, for
example, does not provide meaningful results for a particular dialog
type, they should either fix that problem or risk users/developers going
elsewhere.

B)      Specify that there is only a casual correlation between
thresholds and scores in the results.  Some engines might provide a
consistent scale, some engines may use various skews and choose not to
map back onto the threshold scale.

 

Thanks

 

 

From: Glen Shires [mailto:gshires@google.com] 
Sent: Friday, June 15, 2012 12:00 PM
To: Young, Milan
Cc: public-speech-api@w3.org
Subject: Re: Confidence property

 

It may be that we have a misunderstanding in how we both define "native
confidence values".  I have been using that term, and continue to use
that term to indicate a 0.0 - 1.0 scale that has not had any skew
applied to make 0.5 reasonable.  I have not been using that term to
refer to any internal recognizer scale that is other than 0.0 - 1.0.

 

Comments inline below...

 

On Thu, Jun 14, 2012 at 6:04 PM, Young, Milan <Milan.Young@nuance.com>
wrote:

You argue that there exists some recognizer that is NOT capable of
giving a meaningful native interpretation to thresholds like '0.5'.  I
will accept that.

[Glen] Thank you 

	 

	You further suggest that these same recognizer(s) have some
magic ability to transform these thresholds to something that IS
meaningful.  I will accept that too.  Let's call that magic
transformation webToInternal() and it's inverse internalToWeb().

 [Glen] OK

	 

	Without requiring this engine to expose internalToWeb() a
developer could set a threshold like "0.5" and get back score like
"0.1".  If you were a developer, would that make sense to you?

[Glen] Yes

 

	  What practical use would you even have for such a number? 

[Glen] I believe most Group 2 web developers don't care to look at
confidence values:

 

 - Some will simply set nomatchThreshold = 0.5 and control their
application based on whether onresult or onnomatch fires.

 

 - Some more sophisticate Group 2 developers will set nomatchThreshold =
0.5 and may increment it up or down based on if onresult or onnomatch is
firing too often or rarely.

 

 - Only the most sophisticated Group 2 developers will look at the
confidence values returned in the results or in emma. Since they are
processing them in a recognition-dependent manner, they must only
compare relative values. For example, if they find that the second
alternative has a confidence value relatively near the first, the app
may ask the user to disambiguate.  Using the example you give, if the
top result is 0.1 and the second result is 0.085, the app could ask the
user to disambiguate.

 

For Group 3 developers that do process these values, getting back the
0.1 result is invaluable, because it matches the native levels in their
tuning tools, logs and other applications.

 

So yes, this has very practical uses and benefits for Group 2 and Group
3 developers. 

 

	It may as well be a Chinese character.

[Glen] Fortunately, it is a float, and can easily be compared against
other float values. 

	 

	Wouldn't it be a lot more useful to developers and consistent
with mainstream engines to simply require support for internalToWeb()?
I'm sure folks that are capable of building something as complicated as
a recognizer can solve an math equation.  I'll even offer to include my
phone number in the spec so that they can call me for help J.

[Glen] No. This would be very problematic for Group 3 developers that
use these recognizers. Their tuning tools, their logs, their other
applications all may be based on native confidence values, and this
complicates their implementation, as you have pointed out. Instead,
Group 3 developers would much prefer to only use native values, which
they can do because the native values are returned in the results and in
emma. Yes, they do have to copy-and-paste a short JavaScript function
for this, but that's trivial.  For Group 2 and Group 1 developers,
there's no difference whether these recognizers support internalToWeb().

 

	Thanks

Thank you
Received on Saturday, 16 June 2012 13:13:51 UTC