Re: Confidence Scoring Weirdness from Shane Smith on 2007-11-01 (www-voice@w3.org from October to December 2007)

From: Shane Smith <safarishane@gmail.com>
Date: Thu, 1 Nov 2007 10:11:46 -0700
To: "Robert Stewart" <robert@wombatnation.com>
Cc: www-voice@w3.org
Message-ID: <8fc15e140711011011r2454fdtcc6103058d688d04@mail.gmail.com>
Good information Robert, but let me clarify...

The ASR in question doesn't change scores depending on maxnbest, as you
suggest.  In fact, maxnbest can be ignored for the purposes of my question.

Assume maxnbest of 1.  Raw confidence scores for all three plausible
interpretations come back between .90 and .95, but only the first one
(because of maxnbest) is returned the application.  Now you have one result
with a confidence score of .95.  From an application perspective, I would
look at that result and assume that the ASR engine had easily identified
from all possible utterances what the caller actually matched, with a 95%
confidence score.

But this couldn't be further from the truth.  The ASR engine is actually
returning raw scoring, in no way weighted based on other possible
utterances. While it's 95% sure it said one thing, it's also 92.5% sure it
said something else, and 90% sure of a third possible utterance.  The way
the confidence score is described in the spec, it is a score of how 'sure'
the ASR engine is it got it right.  If there are really three plausible
possibilities, then how could the ASR engine know with a 95% confidence
which one is right?

In other words, the confidence score is determined completely independently
of other possibilities.

Let's say for a moment the grammar I'm using that is returning those 3 high
confidence possibilities has 100 possible matching utterances.  Now, pretend
I just added 900 more utterances to that grammar.  With 1000 possible
matches, the ASR engine should have more trouble disambiguating on the
correct utterance, no?  I should expect my confidence scores to drop,
right?  Wrong.  I'll still get the same 90, 92.5, and 95's I got previously,
because each utterance is scored independently.

I strongly believe this is not a good use of confidence scores.  And no,
this isn't a fly by night ASR engine company... it's one of the major
players.  I do not argue that raw scoring can't be useful.  It can be,
especially when tuning an application.  But if the ASR engine won't weight
scores based on all reasonable possibilities, developers themselves are
getting behavior completely unexpected based on what the spec says
confidence scoring should be and examples on the w3 site on how confidence
scoring should be used.

Regards,
Shane Smith



On 11/1/07, Robert Stewart <robert@wombatnation.com> wrote:
>
> Shane,
>
> The confidence scores should not be thought of as percentages. As you
> point out, viewing the scores as percentages when setting maxnbest > 1
> is problematic. The ASR won't be scoring all possible matches unless
> your ASR supports maxnbest = infinity (just kidding). Actually, I guess
> it could make them sum to 100% for each possible semantic interpretation
> plus a no match. But, I don't think you would really want confidence
> scores that vary for a recognition based on the number of possible
> interpretations, e.g., if you had a dynamically generated grammar that
> varied greatly in size based on the result of the previous prompt. Even
> if the percentages had to sum to 100 only for the nbest results, I still
> don't think you would want the scores varying based on the current
> maxnbest setting.
>
> Instead, you should just view them literally as scores that can be used
> in an absolute (e.g., is it above a minimum threshold) and relative
> sense (e.g., is the score for this interpretation enough greater than
> another that I don't need to disambiguate with the caller). The
> confidence scores are calculated independently of each other, leaving
> you to decide their relevance in comparison to each other. This is
> actually a good thing.
>
> Also, you are quite right that the confidence scores can vary
> dramatically between ASR engines. In my experience with Nuance 8.5, a
> confidence score of 0.6 is often a pretty good match. By good, I mean
> likely correct. On Nuance/ScanSoft OSR 3.x, a 0.6 is often a very poor
> match. To make things even more complicated, at some point in OSR's
> past, a significant change was made to the confidence scores that are
> generated for the same utterance, so a 0.9 with the old version of OSR
> might often be a poor match while a 0.9 on the newer version of OSR
> might be a good match. See the OSR documentation for details.
>
> So, what's  a voice (web) app developer to do? First, you have to decide
> which VXML (HTML) browsers you are going support. Just like web sites
> often have HTML and CSS code that is browser specific, often so must
> speech apps have browser specific code, whether manually generated or
> handled by your development tool/runtime. We have the additional problem
> of needing to take into account which ASR and TTS engines are sitting
> behind the browser. Spend a little time with different TTS engines and
> you will also discover that the same rate and volume settings can have
> significantly different results on different engines.
>
> If you know you want to support more than one ASR, then I recommend you
> set up some application wide confidence-related properties, e.g., min
> confidence, passive versus active confirmation required, disamb
> confidence band, etc. Then, you need to set those defaults based on
> which ASR is being used. One crude way to do this is to use the user
> agent string in the first HTTP request your app receives. The problem,
> of course, is you really need to know the ASR version. But, if you can
> are controlling where your app is running you can create a mapping table
> that maps user agents to ASR identifiers.
>
> In your example with maxnbest = 3 and each interpretation having a score
> of 0.75, I would code the app so that it disambiguated the results with
> the caller. Depending on the wording of the prompt and the
> interpretations, I might use phonetic disambiguation like "I found a few
> matches for that. Say 1 for ...." As I hinted at above, I recommend also
> using a confidence band. For example, you might decide that on one ASR
> when scores are within 0.15 of each other, the likelihood that that the
> lower confidence interpretation is correct is significant. For another
> ASR, you might require that they be within 0.1 before you disambiguate.
>
> Finally, you will want to be able to override your application wide
> setting for individual prompts. When tuning your app, you may discover
> that in one particular prompt utterances that are clear to you when you
> listen to them are commonly receiving confidence scores just below the
> min confidence level that works well elsewhere in your app. By the way,
> a quick and dirty tuning trick is to temporarily set maxnbest > 1, set
> the min confidence level you send to the ASR lower than you would
> normally use, and then code up a recognition result filter in your app
> that logs all the results, but keeps only the highest result, assuming
> it is greater than your real min confidence level. Then your app behaves
> the same as before for prompts with maxnbest=1, but you can easily see
> all the near matches. This can help you determine if you should adjust
> the min confidence level, support nbest results or rewrite your grammar
> to effectively reduce confidence scores for mismatches.
>
> Hope this is helpful,
> Robert Stewart
> Voxify
>
> Shane Smith wrote:
> >
> > I'm working with a platform that handles confidence scoring a bit
> > differently than I'm used to.
> >
> > From their guide:
> > "You may find that the above filtering algorithm is not fully
> > satisfying for your specific application. If so, you may want your
> > system to look at your confidence scores, but also look at the
> > confidence score distance between the first result and the second
> > result of your N-best list.Indeed, if two results roughly have the
> > same confidence scores, the first one may not be the right one."
> >
> > The vxml2.0 spec definitely leaves room for interpretation on how
> > individual platforms can determine confidence scoring of utterances.
> > But after speaking with the engineers of this engine, I've found it
> > wouldn't be uncommon to expect an n-best list with multiple scores
> > above your confidence threshold.  In fact, you could conceivably get
> > back an n-best list with multiple scores all over 90%!  I understand
> > the wiggle room allowed for platforms in the spec, but this goes
> > against the spirit of the spec.  Many examples in the spec show the
> > use of the confidence score to determine whether or not to reprompt or
> > confirm the callers input.
> >
> >            <if cond="application.lastresult$.confidence &lt; 0.7">
> >               <goto nextitem="confirmlinkdialog"/>
> >            <else/>
> >               <goto next="./main_menu.html"/>
> >            </if>
> >
> > That code (from the spec) gives an example of confirmation when the
> > top utterance confidence score is below 70%.  Now image what would
> > happen if you have an n-best list 3 items long, all with 75%
> > confidence.  The application wouldn't confirm, even though you can't
> > be 'confident' of the entry.  (you are in fact only 33% sure the
> > caller said what you think they said) This also means that an
> > application you develop for one engine, would indeed behave very
> > differently on this engine (and vice versa).  While one expects
> > different degrees of accuracy amongst the different ASR vendors, this
> > actually causes change in functionality of the application itself.
> > (I'd have to write an algorithm in javascript to score based on the
> > delta between different entries on the n-best list)
> >
> > Does anyone have any insight (or potentially an algorithm) to work
> > around this platform inconsistency?
> >
> > Thanks,
> > Shane Smith
> >
> >
>
>
Received on Thursday, 1 November 2007 17:11:55 UTC