Re: Confidence Scoring Weirdness from Robert Stewart on 2007-11-01 (www-voice@w3.org from October to December 2007)

From: Robert Stewart <robert@wombatnation.com>
Date: Thu, 01 Nov 2007 09:20:20 -0700
To: Shane Smith <safarishane@gmail.com>
CC: www-voice@w3.org
Message-ID: <4729FCC4.50802@wombatnation.com>
Shane,

The confidence scores should not be thought of as percentages. As you
point out, viewing the scores as percentages when setting maxnbest > 1
is problematic. The ASR won't be scoring all possible matches unless
your ASR supports maxnbest = infinity (just kidding). Actually, I guess
it could make them sum to 100% for each possible semantic interpretation
plus a no match. But, I don't think you would really want confidence
scores that vary for a recognition based on the number of possible
interpretations, e.g., if you had a dynamically generated grammar that
varied greatly in size based on the result of the previous prompt. Even
if the percentages had to sum to 100 only for the nbest results, I still
don't think you would want the scores varying based on the current
maxnbest setting.

Instead, you should just view them literally as scores that can be used
in an absolute (e.g., is it above a minimum threshold) and relative
sense (e.g., is the score for this interpretation enough greater than 
another that I don't need to disambiguate with the caller). The
confidence scores are calculated independently of each other, leaving
you to decide their relevance in comparison to each other. This is
actually a good thing.

Also, you are quite right that the confidence scores can vary
dramatically between ASR engines. In my experience with Nuance 8.5, a
confidence score of 0.6 is often a pretty good match. By good, I mean
likely correct. On Nuance/ScanSoft OSR 3.x, a 0.6 is often a very poor
match. To make things even more complicated, at some point in OSR's
past, a significant change was made to the confidence scores that are
generated for the same utterance, so a 0.9 with the old version of OSR
might often be a poor match while a 0.9 on the newer version of OSR
might be a good match. See the OSR documentation for details.

So, what's  a voice (web) app developer to do? First, you have to decide
which VXML (HTML) browsers you are going support. Just like web sites
often have HTML and CSS code that is browser specific, often so must
speech apps have browser specific code, whether manually generated or
handled by your development tool/runtime. We have the additional problem
of needing to take into account which ASR and TTS engines are sitting
behind the browser. Spend a little time with different TTS engines and
you will also discover that the same rate and volume settings can have
significantly different results on different engines.

If you know you want to support more than one ASR, then I recommend you
set up some application wide confidence-related properties, e.g., min
confidence, passive versus active confirmation required, disamb
confidence band, etc. Then, you need to set those defaults based on
which ASR is being used. One crude way to do this is to use the user
agent string in the first HTTP request your app receives. The problem,
of course, is you really need to know the ASR version. But, if you can
are controlling where your app is running you can create a mapping table
that maps user agents to ASR identifiers.

In your example with maxnbest = 3 and each interpretation having a score
of 0.75, I would code the app so that it disambiguated the results with
the caller. Depending on the wording of the prompt and the
interpretations, I might use phonetic disambiguation like "I found a few
matches for that. Say 1 for ...." As I hinted at above, I recommend also
using a confidence band. For example, you might decide that on one ASR
when scores are within 0.15 of each other, the likelihood that that the
lower confidence interpretation is correct is significant. For another
ASR, you might require that they be within 0.1 before you disambiguate.

Finally, you will want to be able to override your application wide
setting for individual prompts. When tuning your app, you may discover
that in one particular prompt utterances that are clear to you when you
listen to them are commonly receiving confidence scores just below the
min confidence level that works well elsewhere in your app. By the way,
a quick and dirty tuning trick is to temporarily set maxnbest > 1, set
the min confidence level you send to the ASR lower than you would
normally use, and then code up a recognition result filter in your app
that logs all the results, but keeps only the highest result, assuming
it is greater than your real min confidence level. Then your app behaves
the same as before for prompts with maxnbest=1, but you can easily see
all the near matches. This can help you determine if you should adjust
the min confidence level, support nbest results or rewrite your grammar
to effectively reduce confidence scores for mismatches.

Hope this is helpful,
Robert Stewart
Voxify

Shane Smith wrote:
>
> I'm working with a platform that handles confidence scoring a bit
> differently than I'm used to. 
>
> From their guide:
> "You may find that the above filtering algorithm is not fully
> satisfying for your specific application. If so, you may want your
> system to look at your confidence scores, but also look at the
> confidence score distance between the first result and the second
> result of your N-best list.Indeed, if two results roughly have the
> same confidence scores, the first one may not be the right one."
>
> The vxml2.0 spec definitely leaves room for interpretation on how
> individual platforms can determine confidence scoring of utterances. 
> But after speaking with the engineers of this engine, I've found it
> wouldn't be uncommon to expect an n-best list with multiple scores
> above your confidence threshold.  In fact, you could conceivably get
> back an n-best list with multiple scores all over 90%!  I understand
> the wiggle room allowed for platforms in the spec, but this goes
> against the spirit of the spec.  Many examples in the spec show the
> use of the confidence score to determine whether or not to reprompt or
> confirm the callers input. 
>
>            <if cond="application.lastresult$.confidence &lt; 0.7">
>               <goto nextitem="confirmlinkdialog"/>
>            <else/>
>               <goto next="./main_menu.html"/>
>            </if>
>
> That code (from the spec) gives an example of confirmation when the
> top utterance confidence score is below 70%.  Now image what would
> happen if you have an n-best list 3 items long, all with 75%
> confidence.  The application wouldn't confirm, even though you can't
> be 'confident' of the entry.  (you are in fact only 33% sure the
> caller said what you think they said) This also means that an
> application you develop for one engine, would indeed behave very
> differently on this engine (and vice versa).  While one expects
> different degrees of accuracy amongst the different ASR vendors, this
> actually causes change in functionality of the application itself. 
> (I'd have to write an algorithm in javascript to score based on the
> delta between different entries on the n-best list)
>
> Does anyone have any insight (or potentially an algorithm) to work
> around this platform inconsistency?
>
> Thanks,
> Shane Smith
>
>
Received on Thursday, 1 November 2007 16:20:37 UTC