RE: Additional parameters to SpeechRecognition (was "Speech API: first editor's draft posted") from Young, Milan on 2012-04-27 (public-speech-api@w3.org from April 2012)

From: Young, Milan <Milan.Young@nuance.com>
Date: Fri, 27 Apr 2012 20:20:39 +0000
To: Deborah Dahl <dahl@conversational-technologies.com>, 'Jerry Carter' <jerry@jerrycarter.org>, 'Glen Shires' <gshires@google.com>
CC: "public-speech-api@w3.org" <public-speech-api@w3.org>
Message-ID: <B236B24082A4094A85003E8FFB8DDC3C1A4570AA@SOM-EXCH04.nuance.com>
Inline with Deborah's comments below, I vote #2:

-          You can't count latency or inoperability as a disadvantage if all the alternatives have the same issue.

-          The speech industry has used approach #2 since its inception.  It would be an exercise in hubris for this community to suppose they could do better.

I would further suggest that Jerry's assumption that "the audience is primarily individuals with limited experience building speech applications" is unfounded.  There is strong representation in this community from Enterprise.


From: Deborah Dahl [mailto:dahl@conversational-technologies.com]
Sent: Friday, April 27, 2012 1:14 PM
To: 'Jerry Carter'; 'Glen Shires'
Cc: public-speech-api@w3.org
Subject: RE: Additional parameters to SpeechRecognition (was "Speech API: first editor's draft posted")

I'm not sure what it means in practice to not define a confidenceThreshold (option 4). Doesn't it just mean that recognizer behavior is implementation-specific, and isn't that equivalent to option (2)? Isn't (4) subject to the same problems when changing recognizers as (2)?

From: Jerry Carter [mailto:jerry@jerrycarter.org]<mailto:[mailto:jerry@jerrycarter.org]>
Sent: Friday, April 27, 2012 3:03 PM
To: Glen Shires
Cc: public-speech-api@w3.org<mailto:public-speech-api@w3.org>
Subject: Re: Additional parameters to SpeechRecognition (was "Speech API: first editor's draft posted")

I'm inclined to agree with you on #4.

Assuming that the audience is primarily individuals with limited experience building speech applications, simplicity is admirable.  My experience has been that for the vast majority of cases, the recognition scores are bimodal (i.e. very high or very low).  Intermediate values are of limited utility outside of application development teams within recognition vendors and experienced speech applications teams.  This does not mean that recognition thresholds are useless, because they aren't.  A talented and experienced speech scientist can optimize settings to tailor the 'false acceptance' / 'false rejection' rates according to satisfy business objectives, but in most cases, the cost of doing so is not justified.  Providing an outlet such as a custom parameter seems appropriate for a first version.

-=- Jerry


On Apr 27, 2012, at 2:53 PM, Glen Shires wrote:

If I may summarize, we have 4 proposals for confidenceThreshold attribute, each with drawbacks:

1. Arbitrarily define a value to be the default (be it 0.0 or 0.5 or whatever), and let speech recognizers map this to their own confidence values. Problem: mapping may "require significant skewing of the range" and "squeeze" and "inflate"..."This would confuse developers who believe that a .1 adjustment means the same thing across dialog states as long as the use the same engine." [1]

2. Let speech recognizers define the default. Problem: "If the developer switches to a new recognizer, the default confidenceThreshold may change. If the developer then reads the confidenceThreshold (for example, to increment it by 0.05), then presumably the browser needs to get the default confidence value from the speech recognizer. For a remote recognizer, this round-trip takes time, and the browser cannot stall the javascript processing." [2]

3. Make it write-only (not readable). Problem: "Incrementally bumping up confidence (eg recognizer.confidence += 5) in response to a series of misrecognitions is a common technique." [3]

4. Don't define a confidenceThreshold attribute in the first version of the specification. Rely on setCustomParameter instead.

Given the complexities of defining this properly, the differences between recognizer implementations, and because it takes a very savvy web-developer to know how to adjust this properly (for many developers, relying on the default is often best), I suggest #4 for the first version of this specification.

/Glen Shires

[1] http://lists.w3.org/Archives/Public/public-speech-api/2012Apr/0049.html<http://lists.w3.org/Archives/Public/public-speech-api/2012Apr/0049..html>
[2] http://lists.w3.org/Archives/Public/public-speech-api/2012Apr/0042.html<http://lists.w3.org/Archives/Public/public-speech-api/2012Apr/0042..html>
[3] http://lists.w3.org/Archives/Public/public-speech-api/2012Apr/0041.html<http://lists.w3.org/Archives/Public/public-speech-api/2012Apr/0041..html>


On Thu, Apr 26, 2012 at 9:50 AM, Young, Milan <Milan.Young@nuance.com<mailto:Milan.Young@nuance.com>> wrote:
I'm glad to see we are coming together on this issue.  But I don't yet understand how your [-1.0-1.0] mapping solves the problem of variations across recognition engines any more than the usual [0-1.0] does.

In both cases, the developer is guaranteed that higher numbers will produces <= results, and lower numbers will produce >= results.  But there is still no guarantee that running the same number (whether it be the default or otherwise) will provide the same number or quality of results across UAs, speech engines, time, etc.

The only thing that you have done is suggest a default value of 0 which could just as easily be represented with 0.5.  And on this point I'll take the stance that it would be better for the speech engine to decide what that value should be.  Sure the speech engine could always remap that number back to 0.5, but in many cases this will require significant skewing of the range.  For example, let's say that the engine selected .9 on [0-1.0] as a good value for the task.  It now needs to map that to 0.5 which means that the real scores of 0-8.99 are squeezed, and the .91-1.0 are inflated.  This would confuse developers who believe that a .1 adjustment means the same thing across dialog states as long as the use the same engine.

Put another way, if this community group could really solve the problem of unifying confidence scores, then I'm all for it.  But we haven't done that, and as such I view this suggestion as rocking the boat without any gain.

Thanks


From: Glen Shires [mailto:gshires@google.com<mailto:gshires@google.com>]
Sent: Wednesday, April 25, 2012 1:20 PM

To: Young, Milan
Cc: Hans Wennborg; Satish S; public-speech-api@w3.org<mailto:public-speech-api@w3.org>
Subject: Re: Additional parameters to SpeechRecognition (was "Speech API: first editor's draft posted")


I heartily agree that it is typically best to filter low confidence matches in the speech recognizer (reducing computation and bandwidth). Using JS to process the confidence values returned in the results does not imply that pruning is not done in the recognizer. The recognizer still uses a confidenceThreshold. If a savvy JS developer chooses to set the confidenceThreshold, he should do so such that the recognizer prunes (but doesn't over prune) the data returned, so JS can sufficiently process the returned confidence values.

Implementation of a readable confidenceThreshold with ABSOLUTE values corresponding to the recognizer's default value is problematic when supporting multiple speech recognizers. Therefore I propose the following definition for a readable/writeable confidenceThreshold attribute that uses RELATIVE values...



confidenceThreshold attribute

This attribute represents a relative degree of confidence the recognition system needs in order to return a recognition match instead of a nomatch. The confidence-threshold is a monotonically increasing value between -1.0 (least confidence needed) and 1.0 (most confidence needed) with 0.0 as the default.



In this way, 0.0 is the default for all recognizers, and each recognizer is free to define how to map the threshold into whatever confidence values it returns with the results. In other words:

- For a confidenceThreshold of 0.0, one recognizer may return results with confidence values no lower than, for example, 0.72 whereas another might return confidence values no lower than, for example, 0.31.

- For a confidenceThreshold of, for example, -0.2, each recognizer will return more (or at least no fewer) results.

- For a confidenceThreshold of, for example, 0.2, each recognizer will return fewer (or at least no more) results.

I believe this is a good step towards consistent behavior across UAs and speech engines.

(Note that I intentionally defined confidenceThreshold as a value between -1.0 and 1.0 instead of between 0.0 and 1.0 for clarity. This is to emphasize that these threshold values are RELATIVE and do not have any ABSOLUTE correspondence to the confidence values returned.)

/Glen Shires


On Wed, Apr 25, 2012 at 11:08 AM, Young, Milan <Milan.Young@nuance.com<mailto:Milan.Young@nuance.com>> wrote:
You have ignored my two points about why it is often best to filter low confidence matches on the server (ie performance and clipping).  Just because Deborah points out that there are additional use cases for filtering on the client does not invalidate my claim.

Yes, we should try to deliver consistent behavior across UAs, speech engines, and even dialog states.  But let's not throw the baby out with the bathwater if we can't nail it down in a v1.


From: Glen Shires [mailto:gshires@google.com<mailto:gshires@google.com>]
Sent: Wednesday, April 25, 2012 10:43 AM
To: Young, Milan
Cc: Hans Wennborg; Satish S; public-speech-api@w3.org<mailto:public-speech-api@w3.org>

Subject: Re: Additional parameters to SpeechRecognition (was "Speech API: first editor's draft posted")

I think (hope) that most web developers won't have to worry about confidence values because the default set by the speech recognizer should be sufficient.

However, a JS API developer savvy enough to understand how/when to properly set a confidenceThreshold, is also savvy enough to intelligently process the confidence values returned in the results. As Deborah mentioned [1], "For example, if the top two alternatives in the nbest have very similar confidences...".  Typically, processing the confidence result values is a much better strategy than trying to tune the confidenceThreshold.

Only extremely savvy JS API developers will understand how to properly tune the confidenceThreshold so that it prunes (but doesn't over prune) the data returned.  I believe these developers can best adjust the confidenceThreshold by processing the confidence result values returned by prior recognitions (as opposed to simply bumping the default value by 0.05).


Also, from an implementation standpoint, there's a major issue with making confidenceThreshold readable. If the developer switches to a new recognizer, the default confidenceThreshold may change. If the developer then reads the confidenceThreshold (for example, to increment it by 0.05), then presumably the browser needs to get the default confidence value from the speech recognizer. For a remote recognizer, this round-trip takes time, and the browser cannot stall the javascript processing.

/Glen Shires

[1] http://lists.w3.org/Archives/Public/public-speech-api/2012Apr/0031.html<http://lists.w3.org/Archives/Public/public-speech-api/2012Apr/0031..html>
On Wed, Apr 25, 2012 at 9:47 AM, Young, Milan <Milan.Young@nuance.com<mailto:Milan.Young@nuance.com>> wrote:
The speech community has lived for 20 years with the fact that confidence values are not portable across engines.  I understand that we are courting a new class of developers with this HTML-based initiative, but I want to be careful not to dumb it down to the point where we impact the mainstream speech industry.

Incrementally bumping up confidence (eg recognizer.confidence += 5) in response to a series of misrecognitions is a common technique.  I also find it generally ugly that confidence is special cased with a function instead of a property.  (Is it a JS limitation that you cannot mark a property as write only?)

I would rather say something like "Recognition engines generally do a good job of choosing the right confidence value for a recognition task.  If you do choose to read this property, know that it's value is not portable to other recognition tasks, other speech engines, or other user agents."

Thanks

From: Glen Shires [mailto:gshires@google.com<mailto:gshires@google.com>]
Sent: Wednesday, April 25, 2012 8:11 AM
To: Hans Wennborg
Cc: Young, Milan; Satish S; public-speech-api@w3.org<mailto:public-speech-api@w3.org>

Subject: Re: Additional parameters to SpeechRecognition (was "Speech API: first editor's draft posted")

confidenceThreshold

I think we all agree that speech recognizers have a concept of confidence, and that it can be mapped to a monotonically increasing range of 0.0 to 1.0.  However, specific values (for example 0.5) do not correspond to the same level of confidence for different recognizers.

I believe that if the developer does not set the confidenceThreshold, the speech recognizer should use a default value that is appropriate for that recognizer.

A complication with a confidenceThreshold attribute is defining the default value (if the value is read, but not written, what value does the BROWSER return? - particularly because the optimal default value may vary from one RECOGNIZER to another).

Perhaps instead of an attribute, this should be a write-only value, specifically a setConfidenceThreshold method.

/Glen Shires
On Wed, Apr 25, 2012 at 6:43 AM, Hans Wennborg <hwennborg@google.com<mailto:hwennborg@google.com>> wrote:
On Tue, Apr 24, 2012 at 17:22, Young, Milan <Milan.Young@nuance.com<mailto:Milan.Young@nuance.com>> wrote:
> There are two reasons for including confidence that I would like this community to consider:
>  Efficiency - Similar to the argument Satish put forward for limiting the size of the nbest array, pruning the result candidates at the server is more efficient.
>  Clipping - There are many environments where background noise and side speech that can trigger junk results.  If confidence is low, this will trigger a result and then the application enters a deaf period where it processes the result and discovers the content is junk.  If real speech happens during this phase, its start will be missed.
>
> Every recognizer that was ever invented has a concept of confidence.  Yes, the semantics of that value vary across platforms, but for us to push this to a custom parameter will confuse developers, and ultimately slow adoption.
Ok, I don't feel strongly about this, so I would be fine adding a
confidenceThreshold if others agree.

> Regarding the timeout family, an open-ended dialog like "Tell me what is wrong with your computer", should have generous timeouts.  Compare this to "So it's something to do with your new Google double mouse configuration, is that correct?" which should have short timeouts.
>
> Our goal should be a consistent application experience across UAs, and that's only going to happen if we standardize timeouts.  I would also like to mention that the definition of these timeouts is clear and has been industry standard for 10+ years.
What do you think about my idea of just letting the web page handle
the timeout itself, calling abort() when it decides a request is
taking too long?


Thanks,
Hans



--
Thanks!
Glen Shires




--
Thanks!
Glen Shires




--
Thanks!
Glen Shires




--
Thanks!
Glen Shires
Received on Friday, 27 April 2012 20:21:17 UTC