Re: Confidence property from Glen Shires on 2012-06-05 (public-speech-api@w3.org from June 2012)

From: Glen Shires <gshires@google.com>
Date: Mon, 4 Jun 2012 19:23:17 -0700
To: "Young, Milan" <Milan.Young@nuance.com>
Cc: Satish S <satish@google.com>, "public-speech-api@w3.org" <public-speech-api@w3.org>
Message-ID: <CAEE5bciygq+in3E9EaUKNTyn74EMXCnVM9yafKqJ0+e8pjCiWA@mail.gmail.com>
Milan,
I think we agree that different web developers have different needs:

1: Some web developers don't want to adjust confidence at all (they just
use the default value).

2: Some web developers want to adjust confidence in a
recognizer-independent manner (realizing performance will vary between
recognizers).

3: Some web developers want to fine-tune confidence in a
recognizer-specific manner (optimizing using engine logs and tuning tools).
 If none of these specific recognizers are available, their app will either
not function, or function but perform no confidence adjustments.

2.5: Some developers are a mix of 2 and 3: they want to fine-tune
confidence in a recognizer-specific manner for certain recognizers, and
for all other recognizers (such as when the recognizers of choice are not
available) they want to adjust confidence in a recognizer-independent
manner.


I believe it's our job, in defining and in implementing the spec, to make
things work as well as possible for all 4 types of developers.  I believe
the confidenceThresholdAdjustment proposal [1] accomplishes this:

1: This first group doesn't use confidence.

2: For this second group, it enables adjusting confidence in the most
recognizer-independent manner that we know of.

3: For this third group, it allows precise, recognizer-specific setting of
confidence (so absolute confidence values obtained from engine logs and
tuning tools can be used directly) with just a trivial bit more effort.

2.5: This group gains all the benefits of both 2 and 3.


Our various proposals vary in two ways:

- Whether the confidence is specified as an absolute value or a relative
value.
- Whether there is any mapping to inflate/deflate ranges.

Specifying the attribute as an absolute value and making it readable
entails major complications:

- If a new recognizer is selected, it's default threshold needs to be
retrieved, an operation that may have latency. If the developer then reads
the confidenceThreshold attribute, the read can't stall until the threshold
is read (because it is illegal for JavaScript to stall). Fixing this would
require defining an asynchronous event to indicate that the
confidenceThreshold value is now available to be read. All very messy for
both the web developer and the UA implementer.

- The semantics are unclear and recognizer-dependent. If the developer set
the confidenceThreshold = 0.4, then selects a new recognizer (or perhaps a
new task or grammar), does the confidenceThreshold change? When, and if so,
how does the developer know to what value - does it get reset to the
recognizer's default? If not, what does 0.4 now mean in this new context?

In contrast, using a relative value has these advantages:

- It avoids all issues of latency and asynchrony issues. The UA does not
have to inquire the recognizer's default threshold value from the
[potentially remote] recognizer before the UA returns the value when
this JavaScript attribute is read. Instead, the UA maintains the value of
this attribute, and simply sends it to the recognizer along with the
recognition request.


- It avoids all issues of threshold values change due to changes in the
selected recognizer or task or grammar.

Most importantly, from the point of view of web developers (group 2 and
group 3), the advantages of using a relative value include:

- Semantics are clear and simple.

- The attribute is directly readable at any time, with no latency.

- Changing the selected recognizer or task or grammar has no unexpected
affect: the relative value does not change.

In addition, web developers in group 2 get the following benefits:

- Developers can easily adjust the threshold for certain tasks. For
example, to confirm a transaction, the developer may increase the threshold
to be more stringent than the recognizer's default, e.g.
confidenceThresholdAdjustment = 0.3

- Developer can adjust the threshold based on prior usage. For example, if
not getting enough (or any) results, he may bump down the confidence to be
more lenient, e.g: confidenceThreshold -= 0.1

- (As Milan wrote "I suggest the recognizer internally truncate on the
range" to saturate at the min/max values.)

The only downside for this is that developers in group 3 (who are by
definition writing recognizer-specific code) must maintain an offset for
each recognizer they are specifically optimizing for.  For example, if the
default confidence value is 0.7 for the recognizer they're writing for,
they simply write:

    recognizer.confidenceAdjustment = confidence - 0.7;

or alternatively maintain a global that changes when they switch
recognizers:

    recognizer.confidenceAdjustment = confidence -
defaultConfidenceOfCurrentRecognizer;

or alternatively, create a JavaScript function:

    function SetConfidenceAbsolute(conf) {
      recognizer.confidenceAdjustment = conf - 0.7;
    }

The point being, there's a lot of very simple ways to handle this, all very
trivial, particularly when compared to the extensive effort they're already
investing to fine-tune confidence values for each recognizer using engine
logs or tuning tools.  Further, the group 2.5 developers get the advantages
of all of the above.


For all these reasons, I believe that defining this as a relative value is
clearly preferable over an absolute value.


The remaining question is whether there should also be some mapping, or
just a purely linear scale.  I believe a trivial mapping is preferable
because it is very beneficial for group 2 and group 2.5 developers (because
it provides a greater level of recognizer-independent adjustment), and adds
trivial overhead for group 3 developers.  For example, here's one method
that allows group 3 developers to directly use absolute confidence values
from engine logs or tuning tools:

function SetConfidenceAbsolute(conf) {
  var c = conf - 0.7;
  if (c > 0)
    recognizer.confidenceAdjustment = c / 0.3;
  else
    recognizer.confidenceAdjustment = c / 0.7;
}

Here I'm assuming that 0.7 is the current recognizer's default confidence
value. This function linearly maps the values above 0.7 to between 0.0 and
1.0 and the values below 0.7 to between -1.0 and 0.0. Conversely, the
un-mapping that the engine would have to do would be equally trivial:

function MapConfidence(c) {
  if (c > 0)
    return c * 0.3 + 0.7;
  else
    return c * 0.7 + 0.7;
}

/Glen Shires

[1] http://lists.w3.org/Archives/Public/public-speech-api/2012Jun/0000.html



On Mon, Jun 4, 2012 at 12:09 PM, Young, Milan <Milan.Young@nuance.com>wrote:

>  Comments inline…****
>
> ** **
>
> *From:* Glen Shires [mailto:gshires@google.com]
> *Sent:* Friday, June 01, 2012 6:46 PM
>
> *To:* Young, Milan
> *Cc:* Satish S; public-speech-api@w3.org
> *Subject:* Re: Confidence property****
>
> ** **
>
> Milan,****
>
> Can you please clarify your proposal:****
>
> ** **
>
> - Does it pass a string or a float to the recognizer?****
>
> [Milan] String.****
>
> ** **
>
> - Can the developer inquire (read) the current confidence value? Is the
> value returned relative (with plus/minus prefix) or absolute? A string or a
> float?****
>
> [Milan] Yes, the property could be read and it would return the absolute
> value.  We would just document that if the recognizer is remote, this would
> trigger a trip to the server.  Developers would choose whether the cost is
> worth the reward.****
>
> ** **
>
> - If the developer sets recognizer.confidence = “+.1”, then later sets
> recognizer.confidence = “+.2”, would the result be summed "+.3" or
> overwritten "+.2" ?****
>
> [Milan] I figured they would be cumulative, but could be swayed.****
>
> ** **
>
> The main question is what to do with an out of bounds event (eg default
> value is 0.5 and developer sets +0.6).  I suggest the recognizer internally
> truncate on the [0.0-1.0] range (essentially a scaling operation similar to
> your proposal).  The important thing is that higher thresholds must always
> generate >= number of results than lower thresholds.****
>
> ** **
>
> - Is there a defined range for the increments? (Example, is "+0.5" valid?
> is "+1.0" valid? is "+10.0" valid?)****
>
> [Milan] The UA would enforce syntax and limit the range to [-1.0,1.0].****
>
> ** **
>
> - It seems that what you are defining is an offset from a
> recognizer-dependent default value, which seems very similar to
> the confidenceThresholdAdjustment I propose.  What are the advantages of
> your proposal over the syntax I proposed?****
>
> [Milan] Yes, the functionality from a developer perspective is essentially
> the same.  The advantage of my proposal:****
>
> **·       **Minimize work on the engine side with the implementation of a
> scaling system.****
>
> **·       **Confidence scores in the result have a direct correspondence
> to the values pushed through the UA.****
>
> **·       **Tuning tools can continue to use the actual threshold instead
> of having to special case applications developed for HTML Speech.****
>
>   ****
>
> ** **
>
> I disagree with your contention that confidenceThresholdAdjustment that I
> proposed "is just as recognizer-dependent as the much simpler mechanism of
> just setting the value".  Because the range is defined, a
> confidenceThresholdAdjustment = 0.3 indicates, in a recognizer-independent
> manner, that the confidence is substantially greater than the recognizer's
> default, but still far from the maximum possible setting.  In contrast, the
> meaning of recognizer.confidence = “+.3” may vary greatly, for example, the
> recognizer's default may be 0.2 (meaning the new setting is still nowhere
> near maximum confidence) or it may be 0.7 (meaning the new setting is the
> maximum confidence.)****
>
> [Milan] All true, but at the end of the day calling it a “adjustment”
> instead of a “threshold” doesn’t add any testable assertions.****
>
> ****
>
> ** **
>
> I agree that confidenceThresholdAdjustment is not perfect, but it's the
> most recognizer-independent solution I have seen to date, and I believe
> that the majority of web developers will be able to use it to accomplish
> the majority of tasks without resorting to any recognition-dependent
> programming.****
>
> [Milan] I think this is the fundamental disconnect between us.  A
> developer who sets an adjustment of 0.3 on recognizer A must not assume
> that behavior will be the same on recognizer B.  If they want to support
> multiple engines they must test on each engine and tune accordingly.
> Otherwise they risk undefined/incorrect behavior.****
>
> ** **
>
> ** **
>
> I also agree that for the subset of developers that want to fine-tune
> their application for specific recognizers by using engine logs and
> training tools, this introduces an abstraction. However, for this subset of
> developers, either of two simple solutions can be used: (a) the recognition
> vendor could provide the engine-specific mapping so that the developer can
> easily convert the values, or (b) the vendor could provide a
> recognizer-specific custom setting that
> overrides confidenceThresholdAdjustment.****
>
> [Milan] These work-abounds would be worth the cost if we were defining a
> truly recognizer-independent solution.  But since we are not, I view the
> proposal as a pointless exercise in semantic juggling.****
>
> ** **
>
> ** **
>
> I believe it's crucial that we define all attributes in the spec in a
> recognizer-independent manner, or at least recognizer-independent enough
> that most developers don't have to resort to recognizer-dependent
> programming.  If there are attributes that cannot be defined in a
> recognizer-independent manner, then I believe such inherently
> recognizer-specific settings should be just that,
> recognizer-specific custom settings.  ****
>
> [Milan] I could point to 100s of examples in W3C and IETF specifications
> where expected behavior is not 100% clear and I assure you these
> ambiguities were not the product of careless editing.  There is good reason
> and precedent behind the industry definition of confidence.  Please don’t
> throw the baby out with the bathwater.  ****
>
> ** **
>
> ** **
>
> Thanks,****
>
> Glen Shires****
>
> [Milan] Thank you too for keeping this discussion active.****
>
> ** **
>
> ** **
>
> On Fri, Jun 1, 2012 at 5:20 PM, Young, Milan <Milan.Young@nuance.com>
> wrote:****
>
> Glen, it’s clear that you put a lot of thought into trying to come up with
> a compromise.  I appreciate the effort.****
>
>  ****
>
> My contention, however, is that this new mechanism for manipulating
> confidence is just as recognizer dependent as the much simpler mechanism of
> just setting the value.  All you have done is precisely define a new term
> using existing terminology that has no precise definition.  An “adjustment”
> of 0.3 doesn’t have any more of grounded or recognizer independent meaning
> than a “threshold” of 0.3.****
>
>  ****
>
> Furthermore, you’ve introduced yet another parameter to jiggle, and this
> will cause all sorts of headaches during the tuning phase.  That’s because
> the engine, logged results, and training tools will all be based on
> absolute confidence thresholds, and the user will need to figure out how to
> map those absolute thresholds onto the relative scale.  And they still need
> to perform this exercise independently for each engine.****
>
>  ****
>
> One of the things I do like about your proposal is that it circumvents the
> need to read the confidence threshold before setting it in incremental
> mode.  But this could just as easily be accomplished with syntax such as
> recognizer.confidence = “+.1”.  If I added such a plus/minus prefix to my
> previous proposal would you be satisfied?****
>
>  ****
>
> Thanks****
>
>  ****
>
>  ****
>
> *From:* Glen Shires [mailto:gshires@google.com]
> *Sent:* Friday, June 01, 2012 9:01 AM
> *To:* Young, Milan
> *Cc:* Satish S; public-speech-api@w3.org
> *Subject:* Re: Confidence property****
>
>  ****
>
> I propose the following definition:****
>
>  ****
>
> attribute float confidenceThresholdAdjustment;****
>
>  ****
>
> - confidenceThresholdAdjustment attribute - This attribute defines a
> relative threshold for rejecting recognition results based on the estimated
> confidence score that they are correct.  The value
> of confidenceThresholdAdjustment ranges from -1.0 (least confidence) to 1.0
> (most confidence), with 0.0 mapping to the default confidence threshold as
> defined by the recognizer. confidenceThresholdAdjustment is monotonically
> increasing such that larger values will return an equal or fewer number of
> results than lower values.  (Note that the confidence scores reported
> within the SpeechRecognitionResult and within the EMMA results use a 0.0 -
> 1.0 scale, and the correspondence between these scores
> and confidenceThresholdAdjustment may vary across UAs, recognition engines,
> and even task to task.) Unlike maxNBest, there is no defined mapping
> between the value of the threshold and how many results will be returned.*
> ***
>
>  ****
>
>  ****
>
>  ****
>
> This definition has these advantages:****
>
>  ****
>
> For web developers, it provides flexibility and simplicity in a
> recognizer-independent manner. It covers the vast majority of the ways in
> which developers use confidence values:****
>
>  ****
>
> - Developers can easily adjust the threshold for certain tasks. For
> example, to confirm a transaction, the developer may increase the threshold
> to be more stringent than the recognizer's default, e.g.
> confidenceThresholdAdjustment = 0.3****
>
>  ****
>
> - Developer can adjust the threshold based on prior usage. For example, if
> not getting enough (or any) results, he may bump down the confidence to be
> more lenient, e.g: confidenceThreshold -= 0.1 (Developers should ensure
> they don't underflow/overflow the -1.0 - 1.0 scale.)****
>
>  ****
>
> - Developers can perform their own processing of the results by comparing
> confidence scores in the normal manner.  (The confidence scores in the
> results use the recognizer's native scale, so they are not mapped or skewed
> and so relative comparisons are not affected by "inflated" or "deflated"
> ranges.)****
>
>  ****
>
> It provides clear semantics that are recognizer-independent:****
>
>  ****
>
> - It avoids all issues of latency and asynchrony issues. The UA does not
> have to inquire the recognizer's default threshold value from the
> [potentially remote] recognizer before the UA returns the value when
> this JavaScript attribute is read. Instead, the UA maintains the value of
> this attribute, and simply sends it to the recognizer along with the
> recognition request.****
>
>  ****
>
> - It avoids all issues of threshold values change due to changes in the
> selected recognizer or task or grammar.****
>
>  ****
>
> - It allows recognition engines the freedom to define any mapping that is
> appropriate, and use any internal default threshold value they choose
> (which may vary from engine to engine and/or from task to task).****
>
>  ****
>
> The one drawback is that the confidenceThresholdAdjustment mapping
> may "require significant skewing of the range" and "squeeze" and "inflate".
> However, I see this as a minimal disadvantage, particularly when weighed
> against all the advantages above.****
>
>  ****
>
>  ****
>
>  ****
>
> Earlier in this thread we looked at four different options [1]. This
> solution is a variation of option 1 in that list. All the other options in
> that list have significant drawbacks:****
>
>  ****
>
> Option 2) Let speech recognizers define the default: has these
> disadvantages:****
>
>  ****
>
> - If a new recognizer is selected, it's default threshold needs to be
> retrieved, an operation that may have latency. If the developer then reads
> the confidenceThreshold attribute, the read can't stall until the threshold
> is read. Fixing this requires defining an asynchronous event to indicate
> that the confidenceThreshold value is now available to be read. All very
> messy for both the web developer and the UA implementer.****
>
>  ****
>
> - The semantics are unclear and recognizer-dependent. If the developer set
> the confidenceThreshold = 0.4, then selects a new recognizer (or perhaps a
> new task or grammar), does the confidenceThreshold change? When, and if so,
> how does the developer know to what value - does it get reset to the
> recognizer's default? If not, what does 0.4 now mean in this new context?*
> ***
>
>  ****
>
> Option 3) Make it write-only (not readable): has these disadvantages:****
>
>  ****
>
> - A developer must write recognizer-dependent code. Since he can't read
> the value, he can't increment/decrement it, so he must blindly set it. He
> must know what set confidenceThreshold = 0.4 means for the current
> recognizer.****
>
>  ****
>
>  ****
>
> Thus I propose the solution above, with it's many advantages and only a
> minor drawback.****
>
>  ****
>
> [1]
> http://lists.w3.org/Archives/Public/public-speech-api/2012Apr/0051.html***
> *
>
>  ****
>
>  ****
>
> On Wed, May 23, 2012 at 3:56 PM, Young, Milan <Milan.Young@nuance.com>
> wrote:****
>
> >> The benefit of minimizing deaf periods is therefore again recognizer
> specific****
>
>  ****
>
> Most (all?) of the recognition engines which can be embedded within an
> HTML browser currently operate over a network.  In fact if you study the
> use cases, you’d find that the majority of those transactions are over a 3G
> network which is notoriously latent.****
>
>  ****
>
> It’s possible that this may begin to change over the next few year, but
> it’s surely not going to be in the lifetime of our 1.0 spec (at least I
> hope we can come to agreement before then J).  Thus the problem can
> hardly be called engine specific.****
>
>  ****
>
> Yes, the semantics are unclear, but that wouldn’t be any different than a
> quasi-standard which would undoubtedly emerge in the absence of a
> specification.****
>
>  ****
>
>  ****
>
>  ****
>
> *From:* Satish S [mailto:satish@google.com]
> *Sent:* Wednesday, May 23, 2012 6:28 AM
> *To:* Young, Milan
> *Cc:* public-speech-api@w3.org
> *Subject:* Re: Confidence property****
>
>  ****
>
> Hi Milan,****
>
>  ****
>
>  Summarizing previous discussion, we have:****
>
>   Pros:  1) Aids efficient application design, 2) minimizes deaf periods,
> 3) avoids a proliferation of semi-standard custom parameters.****
>
>   Cons: 1) Semantics of the value are not precisely defined, and 2) Novice
> users may not understand how confidence differs from maxnbest.****
>
>  ****
>
> My responses to the cons are: 1) Precedent from the speech industry, and
> 2) Thousands of VoiceXML developers do understand the difference and will
> balk at an API that does not accommodate their needs.****
>
>   ****
>
> This was well debated in the earlier thread and it is clear that
> confidence threshold semantics are tied to the recognizer (not portable).
> The benefit of minimizing deaf periods is therefore again recognizer
> specific and not portable. This is a well suited use case for custom
> parameters and I'd suggest we start with that.****
>
>  ****
>
> Thousands of VoiceXML developers do understand the difference and will
> balk at an API that does not accommodate their needs.****
>
>   ****
>
> I hope we aren't trying to replicate VoiceXML in the browser. If it is
> indeed a must have feature for web developers we'll be receiving requests
> for it from them very soon, so it would be easy to discuss and add it in
> future.****
>
>  ****
>
> ** **
>
Received on Tuesday, 5 June 2012 02:24:29 UTC