- From: Robert Stewart <robert@wombatnation.com>
- Date: Thu, 01 Nov 2007 09:20:20 -0700
- To: Shane Smith <safarishane@gmail.com>
- CC: www-voice@w3.org
Shane, The confidence scores should not be thought of as percentages. As you point out, viewing the scores as percentages when setting maxnbest > 1 is problematic. The ASR won't be scoring all possible matches unless your ASR supports maxnbest = infinity (just kidding). Actually, I guess it could make them sum to 100% for each possible semantic interpretation plus a no match. But, I don't think you would really want confidence scores that vary for a recognition based on the number of possible interpretations, e.g., if you had a dynamically generated grammar that varied greatly in size based on the result of the previous prompt. Even if the percentages had to sum to 100 only for the nbest results, I still don't think you would want the scores varying based on the current maxnbest setting. Instead, you should just view them literally as scores that can be used in an absolute (e.g., is it above a minimum threshold) and relative sense (e.g., is the score for this interpretation enough greater than another that I don't need to disambiguate with the caller). The confidence scores are calculated independently of each other, leaving you to decide their relevance in comparison to each other. This is actually a good thing. Also, you are quite right that the confidence scores can vary dramatically between ASR engines. In my experience with Nuance 8.5, a confidence score of 0.6 is often a pretty good match. By good, I mean likely correct. On Nuance/ScanSoft OSR 3.x, a 0.6 is often a very poor match. To make things even more complicated, at some point in OSR's past, a significant change was made to the confidence scores that are generated for the same utterance, so a 0.9 with the old version of OSR might often be a poor match while a 0.9 on the newer version of OSR might be a good match. See the OSR documentation for details. So, what's a voice (web) app developer to do? First, you have to decide which VXML (HTML) browsers you are going support. Just like web sites often have HTML and CSS code that is browser specific, often so must speech apps have browser specific code, whether manually generated or handled by your development tool/runtime. We have the additional problem of needing to take into account which ASR and TTS engines are sitting behind the browser. Spend a little time with different TTS engines and you will also discover that the same rate and volume settings can have significantly different results on different engines. If you know you want to support more than one ASR, then I recommend you set up some application wide confidence-related properties, e.g., min confidence, passive versus active confirmation required, disamb confidence band, etc. Then, you need to set those defaults based on which ASR is being used. One crude way to do this is to use the user agent string in the first HTTP request your app receives. The problem, of course, is you really need to know the ASR version. But, if you can are controlling where your app is running you can create a mapping table that maps user agents to ASR identifiers. In your example with maxnbest = 3 and each interpretation having a score of 0.75, I would code the app so that it disambiguated the results with the caller. Depending on the wording of the prompt and the interpretations, I might use phonetic disambiguation like "I found a few matches for that. Say 1 for ...." As I hinted at above, I recommend also using a confidence band. For example, you might decide that on one ASR when scores are within 0.15 of each other, the likelihood that that the lower confidence interpretation is correct is significant. For another ASR, you might require that they be within 0.1 before you disambiguate. Finally, you will want to be able to override your application wide setting for individual prompts. When tuning your app, you may discover that in one particular prompt utterances that are clear to you when you listen to them are commonly receiving confidence scores just below the min confidence level that works well elsewhere in your app. By the way, a quick and dirty tuning trick is to temporarily set maxnbest > 1, set the min confidence level you send to the ASR lower than you would normally use, and then code up a recognition result filter in your app that logs all the results, but keeps only the highest result, assuming it is greater than your real min confidence level. Then your app behaves the same as before for prompts with maxnbest=1, but you can easily see all the near matches. This can help you determine if you should adjust the min confidence level, support nbest results or rewrite your grammar to effectively reduce confidence scores for mismatches. Hope this is helpful, Robert Stewart Voxify Shane Smith wrote: > > I'm working with a platform that handles confidence scoring a bit > differently than I'm used to. > > From their guide: > "You may find that the above filtering algorithm is not fully > satisfying for your specific application. If so, you may want your > system to look at your confidence scores, but also look at the > confidence score distance between the first result and the second > result of your N-best list.Indeed, if two results roughly have the > same confidence scores, the first one may not be the right one." > > The vxml2.0 spec definitely leaves room for interpretation on how > individual platforms can determine confidence scoring of utterances. > But after speaking with the engineers of this engine, I've found it > wouldn't be uncommon to expect an n-best list with multiple scores > above your confidence threshold. In fact, you could conceivably get > back an n-best list with multiple scores all over 90%! I understand > the wiggle room allowed for platforms in the spec, but this goes > against the spirit of the spec. Many examples in the spec show the > use of the confidence score to determine whether or not to reprompt or > confirm the callers input. > > <if cond="application.lastresult$.confidence < 0.7"> > <goto nextitem="confirmlinkdialog"/> > <else/> > <goto next="./main_menu.html"/> > </if> > > That code (from the spec) gives an example of confirmation when the > top utterance confidence score is below 70%. Now image what would > happen if you have an n-best list 3 items long, all with 75% > confidence. The application wouldn't confirm, even though you can't > be 'confident' of the entry. (you are in fact only 33% sure the > caller said what you think they said) This also means that an > application you develop for one engine, would indeed behave very > differently on this engine (and vice versa). While one expects > different degrees of accuracy amongst the different ASR vendors, this > actually causes change in functionality of the application itself. > (I'd have to write an algorithm in javascript to score based on the > delta between different entries on the n-best list) > > Does anyone have any insight (or potentially an algorithm) to work > around this platform inconsistency? > > Thanks, > Shane Smith > >
Received on Thursday, 1 November 2007 16:20:37 UTC