W3C home > Mailing lists > Public > public-speech-api@w3.org > June 2012

Re: Confidence property

From: Glen Shires <gshires@google.com>
Date: Thu, 14 Jun 2012 14:28:01 -0700
Message-ID: <CAEE5bcg9qLjQn9M0+WOnswBDZH7oaD=Khon+YDh7Krskkpw-aw@mail.gmail.com>
To: "Young, Milan" <Milan.Young@nuance.com>
Cc: Satish S <satish@google.com>, "public-speech-api@w3.org" <public-speech-api@w3.org>
To clarify: with Proposal C, Group 3 developers do NOT have to translate
any scores and they can use "direct import".



On Thu, Jun 14, 2012 at 2:20 PM, Glen Shires <gshires@google.com> wrote:

> Yes, good suggestion. Looking at how the proposals affect these three
> groups of web developers is a great way to evaluate them.
>
> Here's how my proposal affects these three groups:
>
> Analysis:
>
> Group 1: No advantage or disadvantage.
>
>
>
> Group 2: This is a perfect solution. These developers only care about
> setting the input confidenceThreshold. Since 0.0 is the
> default confidenceThreshold, if they want to get reasonable rejection
> behavior for nomatch, they can simply do:
>
>    recognizer.confidenceThreshold = 0.5
>
> Since the confidenceThreshold is skewed, they get reasonable nomatch
> behavior with 0.5 which enables them to write recognizer independent
> code (at least to some extent). If they're still getting too many results,
> they can simply increment or decrement this value, again as recognizer
> independent code.
>
> Conversely, if the confidenceThreshold were not skewed and instead used
> native recognizer confidence values, there would be no way for these
> developers to write recognizer independent code and get reasonable
> rejection behavior for nomatch.
>
> Since Group 2 developers don't process or analyze at the output confidence
> values (in results or emma), so they don't care whether these output values
> are skewed or not.
>
> Summary: This proposal offers big benefits for Group 2.  Conversely, not
> having skewing would be a major hinderance for Group 2 developers because
> they couldn't reliably use nomatch behavior.
>
>
>
> Group 3: This is a perfect solution. These developers want to use native
> recognizer confidence values for both input (setting the threshold) and
> output (processing the results in SpeechRecognitionAlternative.confidence
> or SpeechRecognitionResult.emma). They don't want any skewing that can
> complicate things, and this solution allows them to only use native values
> everywhere, they never have to worry about skewing.  The only thing they
> have to do is cut and paste a simple JavaScript function (which I presume
> most recognizer vendors would gladly post on their website) in to their
> code. For example, they could simply cut and paste the following:
>
>   function SetNativeConfidenceThreshold(conf) {
>     if (conf < 0.7)
>       recognizer.confidenceThreshold = conf / 1.4;
>     else
>       recognizer.confidenceThreshold = 0.5 + ((c - 0.7) / 0.6);
>   }
>
> Now, all the Group 3 developer has to do to set the confidence threshold
> using a native confidence value is:
>
>   SetNativeConfidenceThreshold(value);
>
> Copying-and-pasting a short function is a trivial amount of effort,
> particularly when compared to all the effort that Group 3 is doing by
> definition to review, tune, process and tweak confidence values.  That is,
> this proposal has trivial impact on the effort required for Group 3
> developers.
>
> Summary: This proposal has virtually no impact on Group 3 developers. In
> contrast, a proposal that skews the results and emma confidence value would
> have a major, negative impact on Group 3 developers.
>
>
>
>
> Now, to compare how all three proposals affect these groups, let's label
> them:
>
> Proposal A: No skewing. Use native recognizer confidence values for both
> input (confidenceThreshold) and output (results and emma)
>
> Proposal B: Skew both input (confidenceThreshold) and output (results and
> emma) in the same manner.
>
> Proposal C: (my proposal) Skew only input (confidenceThreshold).
> Use recognizer confidence values for output (results and emma) in the same
> manner.
>
>
> Group 1: All proposals are fine, they provide no advantage or disadvantage.
>
> Group 2: Proposal A is problematic. Proposal B and C both provide a huge
> advantage.
>
> Group 3: Proposal B is problematic. Proposal A and C both provide a huge
> advantage.
>
>
> The intersection of these is Proposal C - provides huge advantages and is
> not problematic for any of the 3 groups of developers.
>
> Glen
>
>
>
>
> On Thu, Jun 14, 2012 at 12:23 PM, Young, Milan <Milan.Young@nuance.com>wrote:
>
>>  Let’s try this another way.  The most obvious/simplest solution is to
>> report results on the same scale as the threshold.  Can we agree on that?
>> ****
>>
>> ** **
>>
>> Assuming yes, then we should only entertain alternate/complicated
>> suggestions if there is a clear and significant advantage.  Let’s break
>> down this analysis to the three target audiences we’ve used before:****
>>
>> ** **
>>
>> **1)      **Developers who just want the default behavior.****
>>
>> **2)      **Developers who think confidence is a neat feature, but they
>> do not run offline experiments or have any preference for a speech engine.
>> This class will probably either use incremental adjustments to the
>> threshold or pick round numbers like “.5” as arbitrary thresholds.  They
>> are aware that confidence thresholds do not mean the same thing to
>> different engines, but they do know: i) By default they get all results,
>> and ii) If they want to limit the number of results they should use larger
>> thresholds.****
>>
>> **3)      **Power developers that either run offline experiments or have
>> a port of the application on some other modality (e.g. IVR).  These
>> developers leave nothing to chance, and have a custom confidence score for
>> each application state.  If they do support multiple engines, each engine
>> will have a distinct set of thresholds.****
>>
>> ** **
>>
>> ** **
>>
>> Analysis:****
>>
>> ** **
>>
>> Group 1: No advantage or disadvantage.****
>>
>> ** **
>>
>> Group 2: There could be some advantage here, but as of yet I do not see
>> it.  Please make your case.****
>>
>> ** **
>>
>> Group 3: Your solution is a disadvantage because they must translate the
>> scores on a per recognizer basis.  These developers would prefer to use the
>> much simpler solution of a direct import.****
>>
>> ** **
>>
>> ** **
>>
>> *From:* Glen Shires [mailto:gshires@google.com]
>> *Sent:* Thursday, June 14, 2012 12:19 PM
>> *To:* Young, Milan
>> *Cc:* Satish S; public-speech-api@w3.org
>> *Subject:* Re: Confidence property****
>>
>> ** **
>>
>> Perhaps a more intuitive name for that wrapper function would
>> be SetNativeConfidenceThreshold.****
>>
>> Also, I realize the logic was wrong as it used a different scale. In this
>> wrapper function, both the input (conf) and the output
>> (recognizer.confidenceThreshold) use a 0.0 - 1.0 scale. For example, the
>> following works for when recognizer.confidenceThreshold of 0.5 is skewed to
>> a native value of 0.7.****
>>
>> ** **
>>
>> function SetNativeConfidenceThreshold(conf) {****
>>
>>   if (conf < 0.7)****
>>
>>     recognizer.confidenceThreshold = conf / 0.7;****
>>
>>   else****
>>
>>     recognizer.confidenceThreshold = 0.5 + ((c - 0.7) / 0.3);****
>>
>> }****
>>
>> ** **
>>
>> On Thu, Jun 14, 2012 at 11:43 AM, Glen Shires <gshires@google.com> wrote:
>> ****
>>
>> Yes, the confidenceThreshold is on a 0.0 - 1.0 scale.****
>>
>> Yes, the confidence reported in the results are on a 0.0 - 1.0 scale.****
>>
>> Yes, the confidence reported in the EMMA are on a 0.0 - 1.0 scale.****
>>
>> ** **
>>
>> What I am saying is that:****
>>
>> - The recognizer may skew the confidenceThreshold such that 0.5 maps to
>> something reasonable for nomatch.****
>>
>> - The recognizer is not required to skew to the reported results or EMMA
>> results. (The recognizer may skew them, or it may not.)****
>>
>> ** **
>>
>> Simply put: the input must be skewed for 0.5, the output is not required
>> to be skewed in a similar manner.****
>>
>> ** **
>>
>> I've added additional comments inline below...****
>>
>> On Thu, Jun 14, 2012 at 11:04 AM, Young, Milan <Milan.Young@nuance.com>
>> wrote:****
>>
>> I requested “If the threshold is set to X, all alternative.confidence
>> values will be >= X.”  I’d like to address your listed disadvantages:****
>>
>>  ****
>>
>> [Glen] It would require remapping all the results****
>>
>> [Milan] Every modern recognizer that I know of is capable of reporting
>> results on a [0-1] scale.  That’s really the only relevant requirement to
>> this part of the request.  Which alternate scale are you suggesting?****
>>
>> [Glen] Scale remains 0.0 - 1.0. ****
>>
>>   ****
>>
>> [Glen] It would require re-writing EMMA with the new results.****
>>
>> [Milan] EMMA is natively on a [0-1] scale.****
>>
>>  [Glen] Yes, scale remains 0.0 - 1.0 ****
>>
>>   ****
>>
>> [Glen] Nearly all developers who do process these results will simply be
>> comparing relative values, skewing the output could mask the differences
>> between alternatives.****
>>
>> [Milan] A significant portion of developers and a **majority** of
>> consumers will be using absolute thresholds derived from offline tuning
>> experiments.  Let’s address the “skew” part of your statement as part of
>> the first question/response.****
>>
>>  ** **
>>
>> [Glen] I believe my proposal is particularly advantageous for these
>> developers and customers. Most likely their offline tuning experiments will
>> be using backend logs and these backend logs use the recognizer's native
>> 0.0 - 1.0 confidence scale (not a skewed scale). In fact, some customers
>> may have multiple applications/implementations (not just those using a
>> browser with our Speech Javascript API) and/or may have prior experience
>> with other applications, certainly these tuning experiments or logs would
>> be using the recognizer's native 0.0 - 1.0 confidence scale (not a skewed
>> scale). So the advantage these developers and customers have is that all
>> the tuning data and logs they have ever gathered over years and multiple
>> applications, all use, and continue to use, the same native scale. When
>> they write Javascript code to process results with the Speech Javascript
>> API, they continue to use the same native scale.****
>>
>> ** **
>>
>> The only thing that these developers and customers and customers must do
>> to use these results directly in their Javascript code, is to set
>> confidenceThreshold through a simple Javascript wrapper function. For
>> example, that wrapper function might look like the following. Recognizer
>> vendors may wish to document a suggested wrapper function like this, so
>> that their developers and customers can tune applications without any
>> additional effort or skewing concerns.****
>>
>> ** **
>>
>> function SetConfidenceAbsolute(conf) {****
>>
>>   var c = conf - 0.7;****
>>
>>   if (c > 0)****
>>
>>     recognizer.confidenceAdjustment = c / 0.3;****
>>
>>   else****
>>
>>     recognizer.confidenceAdjustment = c / 0.7;****
>>
>> }****
>>
>>  ****
>>
>> ** **
>>
>> Thanks,****
>>
>> Glen****
>>
>> ** **
>>
>>  Thanks****
>>
>>  ****
>>
>>  ****
>>
>> *From:* Glen Shires [mailto:gshires@google.com]
>> *Sent:* Wednesday, June 13, 2012 11:09 PM****
>>
>>
>> *To:* Young, Milan
>> *Cc:* Satish S; public-speech-api@w3.org
>> *Subject:* Re: Confidence property****
>>
>>  ****
>>
>> Milan,****
>>
>> Great, I believe we are almost fully in agreement. Here's the key points
>> that I think should be in the specification. Most of this is your wording.
>>  {The portions in curly braces are things I agree with, but that I don't
>> think need to be specified in the spec.}****
>>
>>  ****
>>
>>  ****
>>
>> - Engines must support a default confidenceThreshold of 0.0 on a range of
>> [0.0-1.0].****
>>
>>  ****
>>
>> - A 0.0 confidenceThreshold means that engines should aggressively retain
>> all speech candidates limited only by the length of the nbest list.  How
>> this is defined in practice, however, is still vendor specific.  Some
>> engines may throw nomatch, other engines may never throw nomatch with a 0.0
>> confidenceThreshold.****
>>
>>    {I would think that some engines might want to still generate nomatch
>> events on select classes of noise input even with a threshold of 0.0.}***
>> *
>>
>>  ****
>>
>> - When the confidenceThreshold is set to 0.5, nomatch should be thrown
>> when there are no speech candidates found with good/reasonable confidence.
>> ****
>>
>>    {The developer can have a reasonable expectation that nomatch will be
>> thrown if there is no likely match, and have reasonable expectation
>> that nomatch will be not thrown if there is a likely match.  In other
>> words, if nomatch is thrown, it's likely that any results (if any) are
>> garbage, and if nomatch is not thrown, it's likely that the results are
>> useful.}****
>>
>>  ****
>>
>> - Engines are free to meet the above requirements through internally
>> skewing.****
>>
>>  ****
>>
>> { Adjustments to this threshold could be made in either absolute terms
>> (eg recognizer.confidence = .72) or relative terms (eg
>> recognizer.confidence +=.2).****
>>
>>  ****
>>
>> { The confidence property can be read, the UA keeps track of the value
>> and sends it to the recognizer along with the recognition request.}****
>>
>>  ****
>>
>>   1) The reported confidence property on the
>> SpeechRecognitionAlternatives must report on a [0.0-1.0] scale****
>>
>>   2) If the UA is generating EMMA because the engine does not supply
>> EMMA, and if the confidence is included in EMMA, then it must be identical
>> to the alternative.confidence property(s).  {If instead, the EMMA is
>> generated by the engine, the UA should pass the EMMA through
>> verbatim...it's the engine's job to ensure that these two match, not the
>> UA's.)****
>>
>>  ****
>>
>> confidenceThreshold is monotonically increasing such that larger values
>> will return an equal or fewer number of results than lower values. ****
>>
>>  ****
>>
>>  ****
>>
>>  ****
>>
>> The only significant way in which I disagree with your description is
>> that I don't believe there is a significant benefit for developers in
>> specifying the following, in fact I believe this can be detrimental in some
>> cases:****
>>
>>  ****
>>
>>   3) If the threshold is set to X, all alternative.confidence values will
>> be >= X.****
>>
>>  ****
>>
>> Doing so would have these disadvantages:****
>>
>>  ****
>>
>> - It would require remapping all the results****
>>
>> - It would require re-writing EMMA with the new results****
>>
>> - Nearly all developers who do process these results will simply be
>> comparing relative values, skewing the output could mask the differences
>> between alternatives.****
>>
>>  ****
>>
>>  ****
>>
>>  ****
>>
>> Based on all of the above, heres' the specific working I propose for the
>> spec:****
>>
>>  ****
>>
>> attribute float confidenceThreshold;****
>>
>>  ****
>>
>> - confidenceThreshold attribute - This attribute defines a threshold for
>> rejecting recognition results based on the estimated confidence score that
>> they are correct.  The value of confidenceThreshold ranges from 0.0 (least
>> confidence) to 1.0 (most confidence), with 0.0 as the default value. A 0.0
>> confidenceThreshold will aggressively return many results limited only by
>> the length of the maxNBest parameter.  It is implementation-dependent
>> whether onnomatch is ever fired when the confidenceThreshold is 0.0.****
>>
>> confidenceThreshold is monotonically increasing such that larger values
>> will return an equal or fewer number of results than lower values. Also,
>> with larger values of confidenceThreshold, onnomatch is more likely, or
>> just as likely, to be fired than with lower values. Unlike maxNBest, there
>> is no defined mapping between the value of the threshold and how many
>> results will be returned.****
>>
>> If the confidenceThreshold is set to 0.5, the recognize should provide a
>> good balance between firing onnomatch when it is unlikely that any of the
>> return values are correct and firing onresult instead when it is likely
>> that at least one return value is valid. The precise behavior is
>> implementation dependent, but it should provide a reasonable mechanism that
>> enables the developer to accept or reject responses based on whether
>> onnomatch fires.****
>>
>> It is implementation dependent how confidenceThreshold is mapped, and its
>> relation (if any) to the confidence values returned in results.****
>>
>>  ****
>>
>> Glen Shires****
>>
>>  ****
>>
>>  ****
>>
>> On Wed, Jun 6, 2012 at 2:54 PM, Young, Milan <Milan.Young@nuance.com>
>> wrote:****
>>
>> Inline…****
>>
>>  ****
>>
>> *From:* Glen Shires [mailto:gshires@google.com]
>> *Sent:* Wednesday, June 06, 2012 1:40 PM****
>>
>>
>> *To:* Young, Milan
>> *Cc:* Satish S; public-speech-api@w3.org
>> *Subject:* Re: Confidence property****
>>
>>  ****
>>
>> Milan,****
>>
>> It seems we are converging on a solution, however before I respond to
>> your proposal, I'd like some clarifications:****
>>
>> [Milan] I expected you would like this proposal.  It’s my favorite of the
>> bunch so far as well.****
>>
>>  ****
>>
>> 1.****
>>
>> You wrote: "Engines must support a default confidence of 0.5"****
>>
>> and then: "The default threshold should be 0 which means accept all
>> candidates"****
>>
>> So I presume you're proposing that your first sentence reads: "Engines
>> must support a default confidence of 0.0 on a range of [0.0-1.0]"****
>>
>> If so, does this mean that there is is no possibility of an onnomatch
>> event if the developer never sets confidence? (Or that onnomatch only
>> occurs if there are no possible results at all, such as for complete
>> silence?)****
>>
>> [Milan] Silence should result in a timeout event of some sort (commonly
>> noinput).  I mentioned that on this thread earlier, but somehow it fell of
>> the dashboard.  I’ll start a new thread.****
>>
>>  ****
>>
>> But to answer the main question, the “0.0” threshold means that engines
>> should aggressively retain all speech candidates limited only by the length
>> of the nbest list.  How this is defined in practice, however, is still
>> vendor specific.  I would think that some engines might want to still
>> generate nomatch events on select classes of noise input even with a
>> threshold of “0.0”.****
>>
>>  ****
>>
>> The only assert able point we could make here is that if nomatch events
>> are generated on a threshold of “0.0”, then they must not contain an
>> interpretation property.  This is in contrast to regular nomatch events
>> which can contain an interpretation.****
>>
>>  ****
>>
>>  ****
>>
>> 2.****
>>
>> I agree, defining that defining that all engines must support the same
>> confidence value is very beneficial. It also means that the UA can keep
>> track of the setting (without a round trip to the recognizer), which means
>> that relative adjustments can be made using float values (rather than
>> strings).  So do you agree with the following: in either absolute terms (eg
>> recognizer.confidence = .72) or relative terms (eg recognizer.confidence +=
>> .2)****
>>
>> [Milan] Good catch.  Let’s stay with floats and have the UA maintain the
>> value.****
>>
>>  ****
>>
>> 3.****
>>
>> While I agree that all engines must support the same confidence value (as
>> an input to the recognizer), and that "engines are free to meet the above
>> requirement through internally skewing", I don't agree that it is
>> necessary, or even beneficial, to (as an output from the recognizer)
>> "ensure that all results are reported on the external scale", because (a)
>> nearly all developers who do process these results will simply be comparing
>> relative values, (b) skewing the output could mask the differences between
>> alternatives, (c) it's extra overhead to substitute all the output values.
>> ****
>>
>> [Milan] Internally, all recognition engines that I know of must skew in
>> order to achieve a 0-1 range.  The native scales are going to be a function
>> of grammar size, type (rule or statistical), and acoustic modeling.  If you
>> ask around with the Google speech team, they are probably going to tell you
>> the same.  ****
>>
>>  ****
>>
>> But let’s put aside that detail for now and focus on the observable
>> (assertable) upshots of my request: ****
>>
>>   1) The reported confidence property on the
>> SpeechRecognitionAlternatives must report on a 0-1 scale****
>>
>>   2) If confidence is included in EMMA, it must be identical to the
>> alternative.confidence property(s).****
>>
>>   3) If the threshold is set to X, all alternative.confidence values will
>> be >= X.****
>>
>>  ****
>>
>> Can we agree on that?****
>>
>>  ****
>>
>>  ****
>>
>> Thanks,****
>>
>> Glen Shires****
>>
>>  ****
>>
>>  ****
>>
>> On Tue, Jun 5, 2012 at 11:41 AM, Young, Milan <Milan.Young@nuance.com>
>> wrote:****
>>
>> One minor adjustment to the proposal below.  The default threshold should
>> be 0 which means accept all candidates.  This will provide a better out of
>> the box experience across the largest range of grammars.  Power users who
>> are concerned with performance/latency can adjust as needed.****
>>
>> Thanks****
>>
>>  ****
>>
>>  ****
>>
>> *From:* Young, Milan
>> *Sent:* Tuesday, June 05, 2012 11:00 AM
>> *To:* 'Glen Shires'
>> *Cc:* Satish S; public-speech-api@w3.org
>> *Subject:* RE: Confidence property****
>>
>>  ****
>>
>> Glen, ****
>>
>>  ****
>>
>> I suggest the needs of all groups would be best served by the following
>> new hybrid proposal:****
>>
>> ·       Engines must support a default confidence of 0.5 on a range of
>> [0.0-1.0].****
>>
>> ·       Engines are free to meet the above requirement through
>> internally skewing, but they must ensure that all results are reported on
>> the external scale.  For example, if the developer sets a threshold of 0.8,
>> then no result should be returned with a score of less than 0.8.****
>>
>> ·       Adjustments to this threshold could be made in either absolute
>> terms (eg recognizer.confidence = .72) or relative terms (eg
>> recognizer.confidence = “+.2”).  The UA enforces syntax.****
>>
>> ·       Relative adjustments that index out of bounds are silently
>> truncated.****
>>
>> ·       The confidence property can be read, but applications that care
>> about latency could avoid the hit by keeping track of the value themselves
>> with a local shadow.****
>>
>>  ****
>>
>> Thoughts?****
>>
>>  ****
>>
>>  ****
>>
>> *From:* Glen Shires [mailto:gshires@google.com]
>> *Sent:* Monday, June 04, 2012 7:23 PM****
>>
>>
>> *To:* Young, Milan
>> *Cc:* Satish S; public-speech-api@w3.org
>> *Subject:* Re: Confidence property****
>>
>>  ****
>>
>> Milan,****
>>
>> I think we agree that different web developers have different needs:****
>>
>>  ****
>>
>> 1: Some web developers don't want to adjust confidence at all (they just
>> use the default value).****
>>
>>  ****
>>
>> 2: Some web developers want to adjust confidence in a
>> recognizer-independent manner (realizing performance will vary between
>> recognizers).****
>>
>>  ****
>>
>> 3: Some web developers want to fine-tune confidence in a
>> recognizer-specific manner (optimizing using engine logs and tuning tools).
>>  If none of these specific recognizers are available, their app will either
>> not function, or function but perform no confidence adjustments.****
>>
>>  ****
>>
>> 2.5: Some developers are a mix of 2 and 3: they want to fine-tune
>> confidence in a recognizer-specific manner for certain recognizers, and
>> for all other recognizers (such as when the recognizers of choice are not
>> available) they want to adjust confidence in a recognizer-independent
>> manner.****
>>
>>  ****
>>
>>  ****
>>
>> I believe it's our job, in defining and in implementing the spec, to make
>> things work as well as possible for all 4 types of developers.  I believe
>> the confidenceThresholdAdjustment proposal [1] accomplishes this:****
>>
>>  ****
>>
>> 1: This first group doesn't use confidence.****
>>
>>  ****
>>
>> 2: For this second group, it enables adjusting confidence in the most
>> recognizer-independent manner that we know of.****
>>
>>  ****
>>
>> 3: For this third group, it allows precise, recognizer-specific setting
>> of confidence (so absolute confidence values obtained from engine logs and
>> tuning tools can be used directly) with just a trivial bit more effort.**
>> **
>>
>>  ****
>>
>> 2.5: This group gains all the benefits of both 2 and 3.****
>>
>>  ****
>>
>>  ****
>>
>> Our various proposals vary in two ways:****
>>
>>  ****
>>
>> - Whether the confidence is specified as an absolute value or a relative
>> value.****
>>
>> - Whether there is any mapping to inflate/deflate ranges.****
>>
>>  ****
>>
>> Specifying the attribute as an absolute value and making it readable
>> entails major complications:****
>>
>>  ****
>>
>> - If a new recognizer is selected, it's default threshold needs to be
>> retrieved, an operation that may have latency. If the developer then reads
>> the confidenceThreshold attribute, the read can't stall until the threshold
>> is read (because it is illegal for JavaScript to stall). Fixing this would
>> require defining an asynchronous event to indicate that the
>> confidenceThreshold value is now available to be read. All very messy for
>> both the web developer and the UA implementer.****
>>
>>  ****
>>
>> - The semantics are unclear and recognizer-dependent. If the developer
>> set the confidenceThreshold = 0.4, then selects a new recognizer (or
>> perhaps a new task or grammar), does the confidenceThreshold change? When,
>> and if so, how does the developer know to what value - does it get reset to
>> the recognizer's default? If not, what does 0.4 now mean in this new
>> context?****
>>
>>  ****
>>
>> In contrast, using a relative value has these advantages:****
>>
>>  ****
>>
>> - It avoids all issues of latency and asynchrony issues. The UA does not
>> have to inquire the recognizer's default threshold value from the
>> [potentially remote] recognizer before the UA returns the value when
>> this JavaScript attribute is read. Instead, the UA maintains the value of
>> this attribute, and simply sends it to the recognizer along with the
>> recognition request.****
>>
>>  ****
>>
>> - It avoids all issues of threshold values change due to changes in the
>> selected recognizer or task or grammar.****
>>
>>  ****
>>
>> Most importantly, from the point of view of web developers (group 2 and
>> group 3), the advantages of using a relative value include:****
>>
>>  ****
>>
>> - Semantics are clear and simple.****
>>
>>  ****
>>
>> - The attribute is directly readable at any time, with no latency.****
>>
>>  ****
>>
>> - Changing the selected recognizer or task or grammar has no unexpected
>> affect: the relative value does not change.****
>>
>>  ****
>>
>> In addition, web developers in group 2 get the following benefits:****
>>
>>  ****
>>
>> - Developers can easily adjust the threshold for certain tasks. For
>> example, to confirm a transaction, the developer may increase the threshold
>> to be more stringent than the recognizer's default, e.g.
>> confidenceThresholdAdjustment = 0.3****
>>
>>  ****
>>
>> - Developer can adjust the threshold based on prior usage. For example,
>> if not getting enough (or any) results, he may bump down the confidence to
>> be more lenient, e.g: confidenceThreshold -= 0.1****
>>
>>  ****
>>
>> - (As Milan wrote "I suggest the recognizer internally truncate on the
>> range" to saturate at the min/max values.)****
>>
>>  ****
>>
>> The only downside for this is that developers in group 3 (who are by
>> definition writing recognizer-specific code) must maintain an offset for
>> each recognizer they are specifically optimizing for.  For example, if the
>> default confidence value is 0.7 for the recognizer they're writing for,
>> they simply write:****
>>
>>  ****
>>
>>     recognizer.confidenceAdjustment = confidence - 0.7;****
>>
>>  ****
>>
>> or alternatively maintain a global that changes when they switch
>> recognizers:****
>>
>>  ****
>>
>>     recognizer.confidenceAdjustment = confidence -
>> defaultConfidenceOfCurrentRecognizer;****
>>
>>  ****
>>
>> or alternatively, create a JavaScript function:****
>>
>>  ****
>>
>>     function SetConfidenceAbsolute(conf) {****
>>
>>       recognizer.confidenceAdjustment = conf - 0.7;****
>>
>>     }****
>>
>>  ****
>>
>> The point being, there's a lot of very simple ways to handle this, all
>> very trivial, particularly when compared to the extensive effort they're
>> already investing to fine-tune confidence values for each recognizer using
>> engine logs or tuning tools.  Further, the group 2.5 developers get the
>> advantages of all of the above.****
>>
>>  ****
>>
>>  ****
>>
>> For all these reasons, I believe that defining this as a relative value
>> is clearly preferable over an absolute value.****
>>
>>  ****
>>
>>  ****
>>
>> The remaining question is whether there should also be some mapping, or
>> just a purely linear scale.  I believe a trivial mapping is preferable
>> because it is very beneficial for group 2 and group 2.5 developers (because
>> it provides a greater level of recognizer-independent adjustment), and adds
>> trivial overhead for group 3 developers.  For example, here's one method
>> that allows group 3 developers to directly use absolute confidence values
>> from engine logs or tuning tools:****
>>
>>  ****
>>
>> function SetConfidenceAbsolute(conf) {****
>>
>>   var c = conf - 0.7;****
>>
>>   if (c > 0)****
>>
>>     recognizer.confidenceAdjustment = c / 0.3;****
>>
>>   else****
>>
>>     recognizer.confidenceAdjustment = c / 0.7;****
>>
>> }****
>>
>>  ****
>>
>> Here I'm assuming that 0.7 is the current recognizer's default confidence
>> value. This function linearly maps the values above 0.7 to between 0.0 and
>> 1.0 and the values below 0.7 to between -1.0 and 0.0. Conversely, the
>> un-mapping that the engine would have to do would be equally trivial:****
>>
>>  ****
>>
>> function MapConfidence(c) {****
>>
>>   if (c > 0)****
>>
>>     return c * 0.3 + 0.7;****
>>
>>   else****
>>
>>     return c * 0.7 + 0.7;****
>>
>> }****
>>
>>  ****
>>
>> /Glen Shires****
>>
>>  ****
>>
>> [1]
>> http://lists.w3.org/Archives/Public/public-speech-api/2012Jun/0000.html**
>> **
>>
>>  ****
>>
>>  ****
>>
>>  ****
>>
>> On Mon, Jun 4, 2012 at 12:09 PM, Young, Milan <Milan.Young@nuance.com>
>> wrote:****
>>
>> Comments inline…****
>>
>>  ****
>>
>> *From:* Glen Shires [mailto:gshires@google.com]
>> *Sent:* Friday, June 01, 2012 6:46 PM****
>>
>>
>> *To:* Young, Milan
>> *Cc:* Satish S; public-speech-api@w3.org
>> *Subject:* Re: Confidence property****
>>
>>  ****
>>
>> Milan,****
>>
>> Can you please clarify your proposal:****
>>
>>  ****
>>
>> - Does it pass a string or a float to the recognizer?****
>>
>> [Milan] String.****
>>
>>  ****
>>
>> - Can the developer inquire (read) the current confidence value? Is the
>> value returned relative (with plus/minus prefix) or absolute? A string or a
>> float?****
>>
>> [Milan] Yes, the property could be read and it would return the absolute
>> value.  We would just document that if the recognizer is remote, this would
>> trigger a trip to the server.  Developers would choose whether the cost is
>> worth the reward.****
>>
>>  ****
>>
>> - If the developer sets recognizer.confidence = “+.1”, then later sets
>> recognizer.confidence = “+.2”, would the result be summed "+.3" or
>> overwritten "+.2" ?****
>>
>> [Milan] I figured they would be cumulative, but could be swayed.****
>>
>>  ****
>>
>> The main question is what to do with an out of bounds event (eg default
>> value is 0.5 and developer sets +0.6).  I suggest the recognizer internally
>> truncate on the [0.0-1.0] range (essentially a scaling operation similar to
>> your proposal).  The important thing is that higher thresholds must always
>> generate >= number of results than lower thresholds.****
>>
>>  ****
>>
>> - Is there a defined range for the increments? (Example, is "+0.5" valid?
>> is "+1.0" valid? is "+10.0" valid?)****
>>
>> [Milan] The UA would enforce syntax and limit the range to [-1.0,1.0].***
>> *
>>
>>  ****
>>
>> - It seems that what you are defining is an offset from a
>> recognizer-dependent default value, which seems very similar to
>> the confidenceThresholdAdjustment I propose.  What are the advantages of
>> your proposal over the syntax I proposed?****
>>
>> [Milan] Yes, the functionality from a developer perspective is
>> essentially the same.  The advantage of my proposal:****
>>
>> ·       Minimize work on the engine side with the implementation of a
>> scaling system.****
>>
>> ·       Confidence scores in the result have a direct correspondence to
>> the values pushed through the UA.****
>>
>> ·       Tuning tools can continue to use the actual threshold instead of
>> having to special case applications developed for HTML Speech.****
>>
>>   ****
>>
>>  ****
>>
>> I disagree with your contention that confidenceThresholdAdjustment that I
>> proposed "is just as recognizer-dependent as the much simpler mechanism of
>> just setting the value".  Because the range is defined, a
>> confidenceThresholdAdjustment = 0.3 indicates, in a recognizer-independent
>> manner, that the confidence is substantially greater than the recognizer's
>> default, but still far from the maximum possible setting.  In contrast, the
>> meaning of recognizer.confidence = “+.3” may vary greatly, for example, the
>> recognizer's default may be 0.2 (meaning the new setting is still nowhere
>> near maximum confidence) or it may be 0.7 (meaning the new setting is the
>> maximum confidence.)****
>>
>> [Milan] All true, but at the end of the day calling it a “adjustment”
>> instead of a “threshold” doesn’t add any testable assertions.****
>>
>>  ****
>>
>> I agree that confidenceThresholdAdjustment is not perfect, but it's the
>> most recognizer-independent solution I have seen to date, and I believe
>> that the majority of web developers will be able to use it to accomplish
>> the majority of tasks without resorting to any recognition-dependent
>> programming.****
>>
>> [Milan] I think this is the fundamental disconnect between us.  A
>> developer who sets an adjustment of 0.3 on recognizer A must not assume
>> that behavior will be the same on recognizer B.  If they want to support
>> multiple engines they must test on each engine and tune accordingly.
>> Otherwise they risk undefined/incorrect behavior.****
>>
>>  ****
>>
>>  ****
>>
>> I also agree that for the subset of developers that want to fine-tune
>> their application for specific recognizers by using engine logs and
>> training tools, this introduces an abstraction. However, for this subset of
>> developers, either of two simple solutions can be used: (a) the recognition
>> vendor could provide the engine-specific mapping so that the developer can
>> easily convert the values, or (b) the vendor could provide a
>> recognizer-specific custom setting that
>> overrides confidenceThresholdAdjustment.****
>>
>> [Milan] These work-abounds would be worth the cost if we were defining a
>> truly recognizer-independent solution.  But since we are not, I view the
>> proposal as a pointless exercise in semantic juggling.****
>>
>>  ****
>>
>>  ****
>>
>> I believe it's crucial that we define all attributes in the spec in a
>> recognizer-independent manner, or at least recognizer-independent enough
>> that most developers don't have to resort to recognizer-dependent
>> programming.  If there are attributes that cannot be defined in a
>> recognizer-independent manner, then I believe such inherently
>> recognizer-specific settings should be just that,
>> recognizer-specific custom settings.  ****
>>
>> [Milan] I could point to 100s of examples in W3C and IETF specifications
>> where expected behavior is not 100% clear and I assure you these
>> ambiguities were not the product of careless editing.  There is good reason
>> and precedent behind the industry definition of confidence.  Please don’t
>> throw the baby out with the bathwater.  ****
>>
>>  ****
>>
>>  ****
>>
>> Thanks,****
>>
>> Glen Shires****
>>
>> [Milan] Thank you too for keeping this discussion active.****
>>
>>  ****
>>
>>  ****
>>
>> On Fri, Jun 1, 2012 at 5:20 PM, Young, Milan <Milan.Young@nuance.com>
>> wrote:****
>>
>> Glen, it’s clear that you put a lot of thought into trying to come up
>> with a compromise.  I appreciate the effort.****
>>
>>  ****
>>
>> My contention, however, is that this new mechanism for manipulating
>> confidence is just as recognizer dependent as the much simpler mechanism of
>> just setting the value.  All you have done is precisely define a new term
>> using existing terminology that has no precise definition.  An “adjustment”
>> of 0.3 doesn’t have any more of grounded or recognizer independent meaning
>> than a “threshold” of 0.3.****
>>
>>  ****
>>
>> Furthermore, you’ve introduced yet another parameter to jiggle, and this
>> will cause all sorts of headaches during the tuning phase.  That’s because
>> the engine, logged results, and training tools will all be based on
>> absolute confidence thresholds, and the user will need to figure out how to
>> map those absolute thresholds onto the relative scale.  And they still need
>> to perform this exercise independently for each engine.****
>>
>>  ****
>>
>> One of the things I do like about your proposal is that it circumvents
>> the need to read the confidence threshold before setting it in incremental
>> mode.  But this could just as easily be accomplished with syntax such as
>> recognizer.confidence = “+.1”.  If I added such a plus/minus prefix to my
>> previous proposal would you be satisfied?****
>>
>>  ****
>>
>> Thanks****
>>
>>  ****
>>
>>  ****
>>
>> *From:* Glen Shires [mailto:gshires@google.com]
>> *Sent:* Friday, June 01, 2012 9:01 AM
>> *To:* Young, Milan
>> *Cc:* Satish S; public-speech-api@w3.org
>> *Subject:* Re: Confidence property****
>>
>>  ****
>>
>> I propose the following definition:****
>>
>>  ****
>>
>> attribute float confidenceThresholdAdjustment;****
>>
>>  ****
>>
>> - confidenceThresholdAdjustment attribute - This attribute defines a
>> relative threshold for rejecting recognition results based on the estimated
>> confidence score that they are correct.  The value
>> of confidenceThresholdAdjustment ranges from -1.0 (least confidence) to 1.0
>> (most confidence), with 0.0 mapping to the default confidence threshold as
>> defined by the recognizer. confidenceThresholdAdjustment is monotonically
>> increasing such that larger values will return an equal or fewer number of
>> results than lower values.  (Note that the confidence scores reported
>> within the SpeechRecognitionResult and within the EMMA results use a 0.0 -
>> 1.0 scale, and the correspondence between these scores
>> and confidenceThresholdAdjustment may vary across UAs, recognition engines,
>> and even task to task.) Unlike maxNBest, there is no defined mapping
>> between the value of the threshold and how many results will be returned.
>> ****
>>
>>  ****
>>
>>  ****
>>
>>  ****
>>
>> This definition has these advantages:****
>>
>>  ****
>>
>> For web developers, it provides flexibility and simplicity in a
>> recognizer-independent manner. It covers the vast majority of the ways in
>> which developers use confidence values:****
>>
>>  ****
>>
>> - Developers can easily adjust the threshold for certain tasks. For
>> example, to confirm a transaction, the developer may increase the threshold
>> to be more stringent than the recognizer's default, e.g.
>> confidenceThresholdAdjustment = 0.3****
>>
>>  ****
>>
>> - Developer can adjust the threshold based on prior usage. For example,
>> if not getting enough (or any) results, he may bump down the confidence to
>> be more lenient, e.g: confidenceThreshold -= 0.1 (Developers should ensure
>> they don't underflow/overflow the -1.0 - 1.0 scale.)****
>>
>>  ****
>>
>> - Developers can perform their own processing of the results by comparing
>> confidence scores in the normal manner.  (The confidence scores in the
>> results use the recognizer's native scale, so they are not mapped or skewed
>> and so relative comparisons are not affected by "inflated" or "deflated"
>> ranges.)****
>>
>>  ****
>>
>> It provides clear semantics that are recognizer-independent:****
>>
>>  ****
>>
>> - It avoids all issues of latency and asynchrony issues. The UA does not
>> have to inquire the recognizer's default threshold value from the
>> [potentially remote] recognizer before the UA returns the value when
>> this JavaScript attribute is read. Instead, the UA maintains the value of
>> this attribute, and simply sends it to the recognizer along with the
>> recognition request.****
>>
>>  ****
>>
>> - It avoids all issues of threshold values change due to changes in the
>> selected recognizer or task or grammar.****
>>
>>  ****
>>
>> - It allows recognition engines the freedom to define any mapping that is
>> appropriate, and use any internal default threshold value they choose
>> (which may vary from engine to engine and/or from task to task).****
>>
>>  ****
>>
>> The one drawback is that the confidenceThresholdAdjustment mapping
>> may "require significant skewing of the range" and "squeeze" and "inflate".
>> However, I see this as a minimal disadvantage, particularly when weighed
>> against all the advantages above.****
>>
>>  ****
>>
>>  ****
>>
>>  ****
>>
>> Earlier in this thread we looked at four different options [1]. This
>> solution is a variation of option 1 in that list. All the other options in
>> that list have significant drawbacks:****
>>
>>  ****
>>
>> Option 2) Let speech recognizers define the default: has these
>> disadvantages:****
>>
>>  ****
>>
>> - If a new recognizer is selected, it's default threshold needs to be
>> retrieved, an operation that may have latency. If the developer then reads
>> the confidenceThreshold attribute, the read can't stall until the threshold
>> is read. Fixing this requires defining an asynchronous event to indicate
>> that the confidenceThreshold value is now available to be read. All very
>> messy for both the web developer and the UA implementer.****
>>
>>  ****
>>
>> - The semantics are unclear and recognizer-dependent. If the developer
>> set the confidenceThreshold = 0.4, then selects a new recognizer (or
>> perhaps a new task or grammar), does the confidenceThreshold change? When,
>> and if so, how does the developer know to what value - does it get reset to
>> the recognizer's default? If not, what does 0.4 now mean in this new
>> context?****
>>
>>  ****
>>
>> Option 3) Make it write-only (not readable): has these disadvantages:****
>>
>>  ****
>>
>> - A developer must write recognizer-dependent code. Since he can't read
>> the value, he can't increment/decrement it, so he must blindly set it. He
>> must know what set confidenceThreshold = 0.4 means for the current
>> recognizer.****
>>
>>  ****
>>
>>  ****
>>
>> Thus I propose the solution above, with it's many advantages and only a
>> minor drawback.****
>>
>>  ****
>>
>> [1]
>> http://lists.w3.org/Archives/Public/public-speech-api/2012Apr/0051.html**
>> **
>>
>>  ****
>>
>>  ****
>>
>> On Wed, May 23, 2012 at 3:56 PM, Young, Milan <Milan.Young@nuance.com>
>> wrote:****
>>
>> >> The benefit of minimizing deaf periods is therefore again recognizer
>> specific****
>>
>>  ****
>>
>> Most (all?) of the recognition engines which can be embedded within an
>> HTML browser currently operate over a network.  In fact if you study the
>> use cases, you’d find that the majority of those transactions are over a 3G
>> network which is notoriously latent.****
>>
>>  ****
>>
>> It’s possible that this may begin to change over the next few year, but
>> it’s surely not going to be in the lifetime of our 1.0 spec (at least I
>> hope we can come to agreement before then J).  Thus the problem can
>> hardly be called engine specific.****
>>
>>  ****
>>
>> Yes, the semantics are unclear, but that wouldn’t be any different than a
>> quasi-standard which would undoubtedly emerge in the absence of a
>> specification.****
>>
>>  ****
>>
>>  ****
>>
>>  ****
>>
>> *From:* Satish S [mailto:satish@google.com]
>> *Sent:* Wednesday, May 23, 2012 6:28 AM
>> *To:* Young, Milan
>> *Cc:* public-speech-api@w3.org
>> *Subject:* Re: Confidence property****
>>
>>  ****
>>
>> Hi Milan,****
>>
>>  ****
>>
>>  Summarizing previous discussion, we have:****
>>
>>   Pros:  1) Aids efficient application design, 2) minimizes deaf periods,
>> 3) avoids a proliferation of semi-standard custom parameters.****
>>
>>   Cons: 1) Semantics of the value are not precisely defined, and 2)
>> Novice users may not understand how confidence differs from maxnbest.****
>>
>>  ****
>>
>> My responses to the cons are: 1) Precedent from the speech industry, and
>> 2) Thousands of VoiceXML developers do understand the difference and will
>> balk at an API that does not accommodate their needs.****
>>
>>   ****
>>
>> This was well debated in the earlier thread and it is clear that
>> confidence threshold semantics are tied to the recognizer (not portable).
>> The benefit of minimizing deaf periods is therefore again recognizer
>> specific and not portable. This is a well suited use case for custom
>> parameters and I'd suggest we start with that.****
>>
>>  ****
>>
>> Thousands of VoiceXML developers do understand the difference and will
>> balk at an API that does not accommodate their needs.****
>>
>>   ****
>>
>> I hope we aren't trying to replicate VoiceXML in the browser. If it is
>> indeed a must have feature for web developers we'll be receiving requests
>> for it from them very soon, so it would be easy to discuss and add it in
>> future.****
>>
>>  ****
>>
>>  ****
>>
>>  ****
>>
>>  ****
>>
>>  ****
>>
>>   ** **
>>
>> ** **
>>
>
>
Received on Thursday, 14 June 2012 21:29:17 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 14 June 2012 21:29:17 GMT