Re: Confidence Scoring Weirdness from Robert Stewart on 2007-11-01 (www-voice@w3.org from October to December 2007)

From: Robert Stewart <robert@wombatnation.com>
Date: Thu, 1 Nov 2007 12:06:18 -0700 (PDT)
To: "Shane Smith" <safarishane@gmail.com>
Cc: www-voice@w3.org
Message-ID: <21419.64.2.18.241.1193943978.squirrel@webmail.wombatnation.com>
Sorry if I wasn't clear in my explanation, but I didn't mean to imply that
the ASR in question changed scores based on maxnbest. Actually, I meant to
imply the opposite. I'm not aware of any ASRs that change scores based on
maxnbest.

Looking back through the VXML 2.0 spec section that covers the confidence
level for the name field element, I found the following definition:

*********
The confidence level for the name field and may range from 0.0-1.0. A
value of 0.0 indicates minimum confidence, and a value of 1.0 indicates
maximum confidence.

A platform may use the utterance confidence (the value of
application.lastresult$.confidence) as the value of name$.confidence. This
distinction between field and utterance level confidence is
platform-dependent.

More specific interpretation of a confidence value is platform-dependent
since its computation is likely to differ between platforms.
*********

The definition for application.lastresult$[i].confidence has similar wording.

I'm with you on the value of understanding the confidence of the top
recognition result versus other possible matches. It's just that I have
found that it is a lot more flexible for me to work with the raw scores to
determine how I want to handle the recognition results.

I searched the VXML 2.0 spec for the word "percentage", but didn't find it
anywhere. Is there somewhere else on the w3c site that implies that
confidence scores can be interpreted as percentages? I haven't looked at
the examples in a long time, so I could easily be missing something.

One area where I would have a problem with the confidence score
representing the likelihood of an interpretation being the correct one is
the case where I have a large grammar where a lot of entries are similar.
The ASR might find several entries that are very good matches. Let's say I
have four very good matches and a few not so good matches. The likelihood
that the highest confidence match is the correct one then might be not
much more than 25%. I would then have had to set my min confidence level
very low to get this result. I definitely would not want to no match on
the recognition, since the reason for the likelihood of the top utterance
being the correct one is not due to an unclear utterance but due to very
similar grammar entries.

Robert

> Good information Robert, but let me clarify...
>
> The ASR in question doesn't change scores depending on maxnbest, as you
> suggest.  In fact, maxnbest can be ignored for the purposes of my
> question.
>
> Assume maxnbest of 1.  Raw confidence scores for all three plausible
> interpretations come back between .90 and .95, but only the first one
> (because of maxnbest) is returned the application.  Now you have one
> result
> with a confidence score of .95.  From an application perspective, I would
> look at that result and assume that the ASR engine had easily identified
> from all possible utterances what the caller actually matched, with a 95%
> confidence score.
>
> But this couldn't be further from the truth.  The ASR engine is actually
> returning raw scoring, in no way weighted based on other possible
> utterances. While it's 95% sure it said one thing, it's also 92.5% sure it
> said something else, and 90% sure of a third possible utterance.  The way
> the confidence score is described in the spec, it is a score of how 'sure'
> the ASR engine is it got it right.  If there are really three plausible
> possibilities, then how could the ASR engine know with a 95% confidence
> which one is right?
>
> In other words, the confidence score is determined completely
> independently
> of other possibilities.
>
> Let's say for a moment the grammar I'm using that is returning those 3
> high
> confidence possibilities has 100 possible matching utterances.  Now,
> pretend
> I just added 900 more utterances to that grammar.  With 1000 possible
> matches, the ASR engine should have more trouble disambiguating on the
> correct utterance, no?  I should expect my confidence scores to drop,
> right?  Wrong.  I'll still get the same 90, 92.5, and 95's I got
> previously,
> because each utterance is scored independently.
>
> I strongly believe this is not a good use of confidence scores.  And no,
> this isn't a fly by night ASR engine company... it's one of the major
> players.  I do not argue that raw scoring can't be useful.  It can be,
> especially when tuning an application.  But if the ASR engine won't weight
> scores based on all reasonable possibilities, developers themselves are
> getting behavior completely unexpected based on what the spec says
> confidence scoring should be and examples on the w3 site on how confidence
> scoring should be used.
>
> Regards,
> Shane Smith
>
>
>
> On 11/1/07, Robert Stewart <robert@wombatnation.com> wrote:
>>
>> Shane,
>>
>> The confidence scores should not be thought of as percentages. As you
>> point out, viewing the scores as percentages when setting maxnbest > 1
>> is problematic. The ASR won't be scoring all possible matches unless
>> your ASR supports maxnbest = infinity (just kidding). Actually, I guess
>> it could make them sum to 100% for each possible semantic interpretation
>> plus a no match. But, I don't think you would really want confidence
>> scores that vary for a recognition based on the number of possible
>> interpretations, e.g., if you had a dynamically generated grammar that
>> varied greatly in size based on the result of the previous prompt. Even
>> if the percentages had to sum to 100 only for the nbest results, I still
>> don't think you would want the scores varying based on the current
>> maxnbest setting.
>>
>> Instead, you should just view them literally as scores that can be used
>> in an absolute (e.g., is it above a minimum threshold) and relative
>> sense (e.g., is the score for this interpretation enough greater than
>> another that I don't need to disambiguate with the caller). The
>> confidence scores are calculated independently of each other, leaving
>> you to decide their relevance in comparison to each other. This is
>> actually a good thing.
>>
>> Also, you are quite right that the confidence scores can vary
>> dramatically between ASR engines. In my experience with Nuance 8.5, a
>> confidence score of 0.6 is often a pretty good match. By good, I mean
>> likely correct. On Nuance/ScanSoft OSR 3.x, a 0.6 is often a very poor
>> match. To make things even more complicated, at some point in OSR's
>> past, a significant change was made to the confidence scores that are
>> generated for the same utterance, so a 0.9 with the old version of OSR
>> might often be a poor match while a 0.9 on the newer version of OSR
>> might be a good match. See the OSR documentation for details.
>>
>> So, what's  a voice (web) app developer to do? First, you have to decide
>> which VXML (HTML) browsers you are going support. Just like web sites
>> often have HTML and CSS code that is browser specific, often so must
>> speech apps have browser specific code, whether manually generated or
>> handled by your development tool/runtime. We have the additional problem
>> of needing to take into account which ASR and TTS engines are sitting
>> behind the browser. Spend a little time with different TTS engines and
>> you will also discover that the same rate and volume settings can have
>> significantly different results on different engines.
>>
>> If you know you want to support more than one ASR, then I recommend you
>> set up some application wide confidence-related properties, e.g., min
>> confidence, passive versus active confirmation required, disamb
>> confidence band, etc. Then, you need to set those defaults based on
>> which ASR is being used. One crude way to do this is to use the user
>> agent string in the first HTTP request your app receives. The problem,
>> of course, is you really need to know the ASR version. But, if you can
>> are controlling where your app is running you can create a mapping table
>> that maps user agents to ASR identifiers.
>>
>> In your example with maxnbest = 3 and each interpretation having a score
>> of 0.75, I would code the app so that it disambiguated the results with
>> the caller. Depending on the wording of the prompt and the
>> interpretations, I might use phonetic disambiguation like "I found a few
>> matches for that. Say 1 for ...." As I hinted at above, I recommend also
>> using a confidence band. For example, you might decide that on one ASR
>> when scores are within 0.15 of each other, the likelihood that that the
>> lower confidence interpretation is correct is significant. For another
>> ASR, you might require that they be within 0.1 before you disambiguate.
>>
>> Finally, you will want to be able to override your application wide
>> setting for individual prompts. When tuning your app, you may discover
>> that in one particular prompt utterances that are clear to you when you
>> listen to them are commonly receiving confidence scores just below the
>> min confidence level that works well elsewhere in your app. By the way,
>> a quick and dirty tuning trick is to temporarily set maxnbest > 1, set
>> the min confidence level you send to the ASR lower than you would
>> normally use, and then code up a recognition result filter in your app
>> that logs all the results, but keeps only the highest result, assuming
>> it is greater than your real min confidence level. Then your app behaves
>> the same as before for prompts with maxnbest=1, but you can easily see
>> all the near matches. This can help you determine if you should adjust
>> the min confidence level, support nbest results or rewrite your grammar
>> to effectively reduce confidence scores for mismatches.
>>
>> Hope this is helpful,
>> Robert Stewart
>> Voxify
>>
>> Shane Smith wrote:
>> >
>> > I'm working with a platform that handles confidence scoring a bit
>> > differently than I'm used to.
>> >
>> > From their guide:
>> > "You may find that the above filtering algorithm is not fully
>> > satisfying for your specific application. If so, you may want your
>> > system to look at your confidence scores, but also look at the
>> > confidence score distance between the first result and the second
>> > result of your N-best list.Indeed, if two results roughly have the
>> > same confidence scores, the first one may not be the right one."
>> >
>> > The vxml2.0 spec definitely leaves room for interpretation on how
>> > individual platforms can determine confidence scoring of utterances.
>> > But after speaking with the engineers of this engine, I've found it
>> > wouldn't be uncommon to expect an n-best list with multiple scores
>> > above your confidence threshold.  In fact, you could conceivably get
>> > back an n-best list with multiple scores all over 90%!  I understand
>> > the wiggle room allowed for platforms in the spec, but this goes
>> > against the spirit of the spec.  Many examples in the spec show the
>> > use of the confidence score to determine whether or not to reprompt or
>> > confirm the callers input.
>> >
>> >            <if cond="application.lastresult$.confidence &lt; 0.7">
>> >               <goto nextitem="confirmlinkdialog"/>
>> >            <else/>
>> >               <goto next="./main_menu.html"/>
>> >            </if>
>> >
>> > That code (from the spec) gives an example of confirmation when the
>> > top utterance confidence score is below 70%.  Now image what would
>> > happen if you have an n-best list 3 items long, all with 75%
>> > confidence.  The application wouldn't confirm, even though you can't
>> > be 'confident' of the entry.  (you are in fact only 33% sure the
>> > caller said what you think they said) This also means that an
>> > application you develop for one engine, would indeed behave very
>> > differently on this engine (and vice versa).  While one expects
>> > different degrees of accuracy amongst the different ASR vendors, this
>> > actually causes change in functionality of the application itself.
>> > (I'd have to write an algorithm in javascript to score based on the
>> > delta between different entries on the n-best list)
>> >
>> > Does anyone have any insight (or potentially an algorithm) to work
>> > around this platform inconsistency?
>> >
>> > Thanks,
>> > Shane Smith
>> >
>> >
>>
>>
>
Received on Thursday, 1 November 2007 19:06:39 UTC