Re: Confidence property from Glen Shires on 2012-06-01 (public-speech-api@w3.org from June 2012)

From: Glen Shires <gshires@google.com>
Date: Fri, 1 Jun 2012 09:01:17 -0700
To: "Young, Milan" <Milan.Young@nuance.com>
Cc: Satish S <satish@google.com>, "public-speech-api@w3.org" <public-speech-api@w3.org>
Message-ID: <CAEE5bcjcHEkoPYOL2+=kYvRaKqtXbQfsTH_B4w-wrbJ75ok-9w@mail.gmail.com>
I propose the following definition:

attribute float confidenceThresholdAdjustment;

- confidenceThresholdAdjustment attribute - This attribute defines a
relative threshold for rejecting recognition results based on the estimated
confidence score that they are correct.  The value
of confidenceThresholdAdjustment ranges from -1.0 (least confidence) to 1.0
(most confidence), with 0.0 mapping to the default confidence threshold as
defined by the recognizer. confidenceThresholdAdjustment is monotonically
increasing such that larger values will return an equal or fewer number of
results than lower values.  (Note that the confidence scores reported
within the SpeechRecognitionResult and within the EMMA results use a 0.0 -
1.0 scale, and the correspondence between these scores
and confidenceThresholdAdjustment may vary across UAs, recognition engines,
and even task to task.) Unlike maxNBest, there is no defined mapping
between the value of the threshold and how many results will be returned.



This definition has these advantages:

For web developers, it provides flexibility and simplicity in a
recognizer-independent manner. It covers the vast majority of the ways in
which developers use confidence values:

- Developers can easily adjust the threshold for certain tasks. For
example, to confirm a transaction, the developer may increase the threshold
to be more stringent than the recognizer's default, e.g.
confidenceThresholdAdjustment = 0.3

- Developer can adjust the threshold based on prior usage. For example, if
not getting enough (or any) results, he may bump down the confidence to be
more lenient, e.g: confidenceThreshold -= 0.1 (Developers should ensure
they don't underflow/overflow the -1.0 - 1.0 scale.)

- Developers can perform their own processing of the results by comparing
confidence scores in the normal manner.  (The confidence scores in the
results use the recognizer's native scale, so they are not mapped or skewed
and so relative comparisons are not affected by "inflated" or "deflated"
ranges.)

It provides clear semantics that are recognizer-independent:

- It avoids all issues of latency and asynchrony issues. The UA does not
have to inquire the recognizer's default threshold value from the
[potentially remote] recognizer before the UA returns the value when
this JavaScript attribute is read. Instead, the UA maintains the value of
this attribute, and simply sends it to the recognizer along with the
recognition request.

- It avoids all issues of threshold values change due to changes in the
selected recognizer or task or grammar.

- It allows recognition engines the freedom to define any mapping that is
appropriate, and use any internal default threshold value they choose
(which may vary from engine to engine and/or from task to task).

The one drawback is that the confidenceThresholdAdjustment mapping
may "require significant skewing of the range" and "squeeze" and "inflate".
However, I see this as a minimal disadvantage, particularly when weighed
against all the advantages above.



Earlier in this thread we looked at four different options [1]. This
solution is a variation of option 1 in that list. All the other options in
that list have significant drawbacks:

Option 2) Let speech recognizers define the default: has these
disadvantages:

- If a new recognizer is selected, it's default threshold needs to be
retrieved, an operation that may have latency. If the developer then reads
the confidenceThreshold attribute, the read can't stall until the threshold
is read. Fixing this requires defining an asynchronous event to indicate
that the confidenceThreshold value is now available to be read. All very
messy for both the web developer and the UA implementer.

- The semantics are unclear and recognizer-dependent. If the developer set
the confidenceThreshold = 0.4, then selects a new recognizer (or perhaps a
new task or grammar), does the confidenceThreshold change? When, and if so,
how does the developer know to what value - does it get reset to the
recognizer's default? If not, what does 0.4 now mean in this new context?

Option 3) Make it write-only (not readable): has these disadvantages:

- A developer must write recognizer-dependent code. Since he can't read the
value, he can't increment/decrement it, so he must blindly set it. He must
know what set confidenceThreshold = 0.4 means for the current recognizer.


Thus I propose the solution above, with it's many advantages and only a
minor drawback.

[1] http://lists.w3.org/Archives/Public/public-speech-api/2012Apr/0051.html


On Wed, May 23, 2012 at 3:56 PM, Young, Milan <Milan.Young@nuance.com>wrote:

>  >> The benefit of minimizing deaf periods is therefore again recognizer
> specific****
>
> ** **
>
> Most (all?) of the recognition engines which can be embedded within an
> HTML browser currently operate over a network.  In fact if you study the
> use cases, you’d find that the majority of those transactions are over a 3G
> network which is notoriously latent.****
>
> ** **
>
> It’s possible that this may begin to change over the next few year, but
> it’s surely not going to be in the lifetime of our 1.0 spec (at least I
> hope we can come to agreement before then J).  Thus the problem can
> hardly be called engine specific.****
>
> ** **
>
> Yes, the semantics are unclear, but that wouldn’t be any different than a
> quasi-standard which would undoubtedly emerge in the absence of a
> specification.****
>
> ** **
>
> ** **
>
> ** **
>
> *From:* Satish S [mailto:satish@google.com]
> *Sent:* Wednesday, May 23, 2012 6:28 AM
> *To:* Young, Milan
> *Cc:* public-speech-api@w3.org
> *Subject:* Re: Confidence property****
>
> ** **
>
> Hi Milan,****
>
>  ****
>
>  Summarizing previous discussion, we have:****
>
>   Pros:  1) Aids efficient application design, 2) minimizes deaf periods,
> 3) avoids a proliferation of semi-standard custom parameters.****
>
>   Cons: 1) Semantics of the value are not precisely defined, and 2) Novice
> users may not understand how confidence differs from maxnbest.****
>
>  ****
>
> My responses to the cons are: 1) Precedent from the speech industry, and
> 2) Thousands of VoiceXML developers do understand the difference and will
> balk at an API that does not accommodate their needs.****
>
>  ** **
>
> This was well debated in the earlier thread and it is clear that
> confidence threshold semantics are tied to the recognizer (not portable).
> The benefit of minimizing deaf periods is therefore again recognizer
> specific and not portable. This is a well suited use case for custom
> parameters and I'd suggest we start with that.****
>
> ** **
>
> Thousands of VoiceXML developers do understand the difference and will
> balk at an API that does not accommodate their needs.****
>
>  ** **
>
> I hope we aren't trying to replicate VoiceXML in the browser. If it is
> indeed a must have feature for web developers we'll be receiving requests
> for it from them very soon, so it would be easy to discuss and add it in
> future.****
>
Received on Friday, 1 June 2012 16:02:34 UTC