W3C home > Mailing lists > Public > www-voice@w3.org > October to December 2007

RE: Confidence Scoring Weirdness

From: Andrew Hunt <andrew.hunt@holly.com.au>
Date: Thu, 1 Nov 2007 23:02:53 -0700
Message-ID: <14F99AAB26DBE04E87B64C2525879482046317CA@ehost005-2.exch005intermedia.net>
To: "Robert Stewart" <robert@wombatnation.com>, "Shane Smith" <safarishane@gmail.com>
Cc: <www-voice@w3.org>


It's accurate that the confidence scores varies according to the speech
recognizer used.  So, a score of 0.5 does not necessarily mean the same
degree of confidence for different vendors and nor does it mean
necessaily a 50% chance of being right.

I agree that this is a consideration for the portability of VoiceXML
applications.  Robert's suggestions for providing application-level
configure are good.

It's years since the VB Working Group addressed the topic of providing
guidelines on confidence scores to increase portability.  Perhaps they
could revisit the subject?

By way of explanation different ASR vendors view confidence scores as
having quite different meanings.  One approach is that the score
reflects the "acoustic match" -- that is how closely the input matches
the phonetic patterns of the recognizer.  Another approach tries to
determine how likely the guess is to be correct taking into account the
acoustic match and other factors such as the probability of the word (in
some case this model tries to predict % correctness but there are
intrinsic limits to this prediction).  There are many other variations
and many bright people have spent good time debating the relative merits
and it probably still be an area of active research/investment.

Take the following contrived example:

<rule id="homophones">
  <item weight="100"> bare </item>
  <item weight="1"> bear </item>

Since "bare" and "bear" are spoken the same the acoustic match will
usually be identical and the confidence scores will be the same.  For a
recognizer that incorporates the probability into the score, "bare" is
100 times more likely (at least that's what the grammar writer says) so
it will probably get a higher confidence score.  (Lesson: homophone
ambiguity needs attention)

For acoustic score recognizers a common approach in apps is to compare
the top two scores - the higher the score AND the wider the difference
the less reason to confirm with the caller.

For recognizers that aim to create a probability score a factor to
consider is grammar sizes -- the more choices there are the more
competition there is and so *sometimes* scores are lower.

These are generalisations and you will see variation by vendor.  So have
a chat with your supplier(s) and see what they suggest.

I hope that folks with deeper understanding of the topic can clarify and
correct what's above.



-----Original Message-----
From: www-voice-request@w3.org [mailto:www-voice-request@w3.org] On
Behalf Of Robert Stewart
Sent: Friday, 2 November 2007 6:06 AM
To: Shane Smith
Cc: www-voice@w3.org
Subject: Re: Confidence Scoring Weirdness

Sorry if I wasn't clear in my explanation, but I didn't mean to imply
the ASR in question changed scores based on maxnbest. Actually, I meant
imply the opposite. I'm not aware of any ASRs that change scores based

Looking back through the VXML 2.0 spec section that covers the
level for the name field element, I found the following definition:

The confidence level for the name field and may range from 0.0-1.0. A
value of 0.0 indicates minimum confidence, and a value of 1.0 indicates
maximum confidence.

A platform may use the utterance confidence (the value of
application.lastresult$.confidence) as the value of name$.confidence.
distinction between field and utterance level confidence is

More specific interpretation of a confidence value is platform-dependent
since its computation is likely to differ between platforms.

The definition for application.lastresult$[i].confidence has similar

I'm with you on the value of understanding the confidence of the top
recognition result versus other possible matches. It's just that I have
found that it is a lot more flexible for me to work with the raw scores
determine how I want to handle the recognition results.

I searched the VXML 2.0 spec for the word "percentage", but didn't find
anywhere. Is there somewhere else on the w3c site that implies that
confidence scores can be interpreted as percentages? I haven't looked at
the examples in a long time, so I could easily be missing something.

One area where I would have a problem with the confidence score
representing the likelihood of an interpretation being the correct one
the case where I have a large grammar where a lot of entries are
The ASR might find several entries that are very good matches. Let's say
have four very good matches and a few not so good matches. The
that the highest confidence match is the correct one then might be not
much more than 25%. I would then have had to set my min confidence level
very low to get this result. I definitely would not want to no match on
the recognition, since the reason for the likelihood of the top
being the correct one is not due to an unclear utterance but due to very
similar grammar entries.


> Good information Robert, but let me clarify...
> The ASR in question doesn't change scores depending on maxnbest, as
> suggest.  In fact, maxnbest can be ignored for the purposes of my
> question.
> Assume maxnbest of 1.  Raw confidence scores for all three plausible
> interpretations come back between .90 and .95, but only the first one
> (because of maxnbest) is returned the application.  Now you have one
> result
> with a confidence score of .95.  From an application perspective, I
> look at that result and assume that the ASR engine had easily
> from all possible utterances what the caller actually matched, with a
> confidence score.
> But this couldn't be further from the truth.  The ASR engine is
> returning raw scoring, in no way weighted based on other possible
> utterances. While it's 95% sure it said one thing, it's also 92.5%
sure it
> said something else, and 90% sure of a third possible utterance.  The
> the confidence score is described in the spec, it is a score of how
> the ASR engine is it got it right.  If there are really three
> possibilities, then how could the ASR engine know with a 95%
> which one is right?
> In other words, the confidence score is determined completely
> independently
> of other possibilities.
> Let's say for a moment the grammar I'm using that is returning those 3
> high
> confidence possibilities has 100 possible matching utterances.  Now,
> pretend
> I just added 900 more utterances to that grammar.  With 1000 possible
> matches, the ASR engine should have more trouble disambiguating on the
> correct utterance, no?  I should expect my confidence scores to drop,
> right?  Wrong.  I'll still get the same 90, 92.5, and 95's I got
> previously,
> because each utterance is scored independently.
> I strongly believe this is not a good use of confidence scores.  And
> this isn't a fly by night ASR engine company... it's one of the major
> players.  I do not argue that raw scoring can't be useful.  It can be,
> especially when tuning an application.  But if the ASR engine won't
> scores based on all reasonable possibilities, developers themselves
> getting behavior completely unexpected based on what the spec says
> confidence scoring should be and examples on the w3 site on how
> scoring should be used.
> Regards,
> Shane Smith
> On 11/1/07, Robert Stewart <robert@wombatnation.com> wrote:
>> Shane,
>> The confidence scores should not be thought of as percentages. As you
>> point out, viewing the scores as percentages when setting maxnbest >
>> is problematic. The ASR won't be scoring all possible matches unless
>> your ASR supports maxnbest = infinity (just kidding). Actually, I
>> it could make them sum to 100% for each possible semantic
>> plus a no match. But, I don't think you would really want confidence
>> scores that vary for a recognition based on the number of possible
>> interpretations, e.g., if you had a dynamically generated grammar
>> varied greatly in size based on the result of the previous prompt.
>> if the percentages had to sum to 100 only for the nbest results, I
>> don't think you would want the scores varying based on the current
>> maxnbest setting.
>> Instead, you should just view them literally as scores that can be
>> in an absolute (e.g., is it above a minimum threshold) and relative
>> sense (e.g., is the score for this interpretation enough greater than
>> another that I don't need to disambiguate with the caller). The
>> confidence scores are calculated independently of each other, leaving
>> you to decide their relevance in comparison to each other. This is
>> actually a good thing.
>> Also, you are quite right that the confidence scores can vary
>> dramatically between ASR engines. In my experience with Nuance 8.5, a
>> confidence score of 0.6 is often a pretty good match. By good, I mean
>> likely correct. On Nuance/ScanSoft OSR 3.x, a 0.6 is often a very
>> match. To make things even more complicated, at some point in OSR's
>> past, a significant change was made to the confidence scores that are
>> generated for the same utterance, so a 0.9 with the old version of
>> might often be a poor match while a 0.9 on the newer version of OSR
>> might be a good match. See the OSR documentation for details.
>> So, what's  a voice (web) app developer to do? First, you have to
>> which VXML (HTML) browsers you are going support. Just like web sites
>> often have HTML and CSS code that is browser specific, often so must
>> speech apps have browser specific code, whether manually generated or
>> handled by your development tool/runtime. We have the additional
>> of needing to take into account which ASR and TTS engines are sitting
>> behind the browser. Spend a little time with different TTS engines
>> you will also discover that the same rate and volume settings can
>> significantly different results on different engines.
>> If you know you want to support more than one ASR, then I recommend
>> set up some application wide confidence-related properties, e.g., min
>> confidence, passive versus active confirmation required, disamb
>> confidence band, etc. Then, you need to set those defaults based on
>> which ASR is being used. One crude way to do this is to use the user
>> agent string in the first HTTP request your app receives. The
>> of course, is you really need to know the ASR version. But, if you
>> are controlling where your app is running you can create a mapping
>> that maps user agents to ASR identifiers.
>> In your example with maxnbest = 3 and each interpretation having a
>> of 0.75, I would code the app so that it disambiguated the results
>> the caller. Depending on the wording of the prompt and the
>> interpretations, I might use phonetic disambiguation like "I found a
>> matches for that. Say 1 for ...." As I hinted at above, I recommend
>> using a confidence band. For example, you might decide that on one
>> when scores are within 0.15 of each other, the likelihood that that
>> lower confidence interpretation is correct is significant. For
>> ASR, you might require that they be within 0.1 before you
>> Finally, you will want to be able to override your application wide
>> setting for individual prompts. When tuning your app, you may
>> that in one particular prompt utterances that are clear to you when
>> listen to them are commonly receiving confidence scores just below
>> min confidence level that works well elsewhere in your app. By the
>> a quick and dirty tuning trick is to temporarily set maxnbest > 1,
>> the min confidence level you send to the ASR lower than you would
>> normally use, and then code up a recognition result filter in your
>> that logs all the results, but keeps only the highest result,
>> it is greater than your real min confidence level. Then your app
>> the same as before for prompts with maxnbest=1, but you can easily
>> all the near matches. This can help you determine if you should
>> the min confidence level, support nbest results or rewrite your
>> to effectively reduce confidence scores for mismatches.
>> Hope this is helpful,
>> Robert Stewart
>> Voxify
>> Shane Smith wrote:
>> >
>> > I'm working with a platform that handles confidence scoring a bit
>> > differently than I'm used to.
>> >
>> > From their guide:
>> > "You may find that the above filtering algorithm is not fully
>> > satisfying for your specific application. If so, you may want your
>> > system to look at your confidence scores, but also look at the
>> > confidence score distance between the first result and the second
>> > result of your N-best list.Indeed, if two results roughly have the
>> > same confidence scores, the first one may not be the right one."
>> >
>> > The vxml2.0 spec definitely leaves room for interpretation on how
>> > individual platforms can determine confidence scoring of
>> > But after speaking with the engineers of this engine, I've found it
>> > wouldn't be uncommon to expect an n-best list with multiple scores
>> > above your confidence threshold.  In fact, you could conceivably
>> > back an n-best list with multiple scores all over 90%!  I
>> > the wiggle room allowed for platforms in the spec, but this goes
>> > against the spirit of the spec.  Many examples in the spec show the
>> > use of the confidence score to determine whether or not to reprompt
>> > confirm the callers input.
>> >
>> >            <if cond="application.lastresult$.confidence &lt; 0.7">
>> >               <goto nextitem="confirmlinkdialog"/>
>> >            <else/>
>> >               <goto next="./main_menu.html"/>
>> >            </if>
>> >
>> > That code (from the spec) gives an example of confirmation when the
>> > top utterance confidence score is below 70%.  Now image what would
>> > happen if you have an n-best list 3 items long, all with 75%
>> > confidence.  The application wouldn't confirm, even though you
>> > be 'confident' of the entry.  (you are in fact only 33% sure the
>> > caller said what you think they said) This also means that an
>> > application you develop for one engine, would indeed behave very
>> > differently on this engine (and vice versa).  While one expects
>> > different degrees of accuracy amongst the different ASR vendors,
>> > actually causes change in functionality of the application itself.
>> > (I'd have to write an algorithm in javascript to score based on the
>> > delta between different entries on the n-best list)
>> >
>> > Does anyone have any insight (or potentially an algorithm) to work
>> > around this platform inconsistency?
>> >
>> > Thanks,
>> > Shane Smith
>> >
>> >
Received on Friday, 2 November 2007 06:02:25 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:07:40 UTC