- From: Shane Smith <safarishane@gmail.com>
- Date: Wed, 31 Oct 2007 13:45:33 -0700
- To: www-voice@w3.org
- Message-ID: <8fc15e140710311345q1c947520x2a1fd67ac4bdf706@mail.gmail.com>
I'm working with a platform that handles confidence scoring a bit differently than I'm used to. >From their guide: "You may find that the above filtering algorithm is not fully satisfying for your specific application. If so, you may want your system to look at your confidence scores, but also look at the confidence score distance between the first result and the second result of your N-best list.Indeed, if two results roughly have the same confidence scores, the first one may not be the right one." The vxml2.0 spec definitely leaves room for interpretation on how individual platforms can determine confidence scoring of utterances. But after speaking with the engineers of this engine, I've found it wouldn't be uncommon to expect an n-best list with multiple scores above your confidence threshold. In fact, you could conceivably get back an n-best list with multiple scores all over 90%! I understand the wiggle room allowed for platforms in the spec, but this goes against the spirit of the spec. Many examples in the spec show the use of the confidence score to determine whether or not to reprompt or confirm the callers input. <if cond="application.lastresult$.confidence < 0.7"> <goto nextitem="confirmlinkdialog"/> <else/> <goto next="./main_menu.html"/> </if> That code (from the spec) gives an example of confirmation when the top utterance confidence score is below 70%. Now image what would happen if you have an n-best list 3 items long, all with 75% confidence. The application wouldn't confirm, even though you can't be 'confident' of the entry. (you are in fact only 33% sure the caller said what you think they said) This also means that an application you develop for one engine, would indeed behave very differently on this engine (and vice versa). While one expects different degrees of accuracy amongst the different ASR vendors, this actually causes change in functionality of the application itself. (I'd have to write an algorithm in javascript to score based on the delta between different entries on the n-best list) Does anyone have any insight (or potentially an algorithm) to work around this platform inconsistency? Thanks, Shane Smith
Received on Wednesday, 31 October 2007 20:45:41 UTC