Confidence Scoring Weirdness from Shane Smith on 2007-10-31 (www-voice@w3.org from October to December 2007)

From: Shane Smith <safarishane@gmail.com>
Date: Wed, 31 Oct 2007 13:45:33 -0700
To: www-voice@w3.org
Message-ID: <8fc15e140710311345q1c947520x2a1fd67ac4bdf706@mail.gmail.com>

I'm working with a platform that handles confidence scoring a bit
differently than I'm used to.

>From their guide:
"You may find that the above filtering algorithm is not fully satisfying for
your specific application. If so, you may want your system to look at your
confidence scores, but also look at the confidence score distance between
the first result and the second result of your N-best list.Indeed, if two
results roughly have the same confidence scores, the first one may not be
the right one."

The vxml2.0 spec definitely leaves room for interpretation on how individual
platforms can determine confidence scoring of utterances.  But after
speaking with the engineers of this engine, I've found it wouldn't be
uncommon to expect an n-best list with multiple scores above your confidence
threshold.  In fact, you could conceivably get back an n-best list with
multiple scores all over 90%!  I understand the wiggle room allowed for
platforms in the spec, but this goes against the spirit of the spec.  Many
examples in the spec show the use of the confidence score to determine
whether or not to reprompt or confirm the callers input.

           <if cond="application.lastresult$.confidence &lt; 0.7">
              <goto nextitem="confirmlinkdialog"/>
           <else/>
              <goto next="./main_menu.html"/>
           </if>

That code (from the spec) gives an example of confirmation when the top
utterance confidence score is below 70%.  Now image what would happen if you
have an n-best list 3 items long, all with 75% confidence.  The application
wouldn't confirm, even though you can't be 'confident' of the entry.  (you
are in fact only 33% sure the caller said what you think they said) This
also means that an application you develop for one engine, would indeed
behave very differently on this engine (and vice versa).  While one expects
different degrees of accuracy amongst the different ASR vendors, this
actually causes change in functionality of the application itself.  (I'd
have to write an algorithm in javascript to score based on the delta between
different entries on the n-best list)

Does anyone have any insight (or potentially an algorithm) to work around
this platform inconsistency?

Thanks,
Shane Smith

Received on Wednesday, 31 October 2007 20:45:41 UTC