RE: Additional parameters to SpeechRecognition (was "Speech API: first editor's draft posted")

In this case, the needs of server implementations are a super set of client implementations.  So if it's true that server side implementations are predominant (informed by my poll), then the question becomes relevant.


From: Glen Shires [mailto:gshires@google.com]
Sent: Friday, May 04, 2012 10:11 AM
To: Young, Milan
Cc: Satish S; Jerry Carter; public-speech-api@w3.org
Subject: Re: Additional parameters to SpeechRecognition (was "Speech API: first editor's draft posted")

> Does anyone on this forum plan to run the recognition on the client?

Whether or not anyone "on this forum" does, I believe we should consider the implications of both client-side and server-side recognition implementations of this API, because there exist many client-side as well as server-side implementations of speech recognition engines. The goal is to create a standard API that can be widely adopted and implemented.

Glen Shires

On Fri, May 4, 2012 at 9:43 AM, Young, Milan <Milan.Young@nuance.com<mailto:Milan.Young@nuance.com>> wrote:
Hello Statish,

I believe my "no harm" comment was taken out of context.  The point was that confidence is a mainstream concept for the speech industry, and it's hard to see how those outside of the mainstream are going to be harmed by its inclusion.

Regarding nbest vs confidence:  Hotword recognitions usually contain only a single phrase.  Telling the recognizer that you only want a single result is a no-op.  The speech industry understood this point 20 years ago, and that's why we have separate parameters.

Two points regarding server optimization:

-          Does anyone on this forum plan to run the recognition on the client?

-          If we were talking about significant changes to the architecture, I agree that performance might take a backseat.  But this is just a single parameter, so it's hard to see the value of this line of reasoning.

Thanks



From: Satish S [mailto:satish@google.com]<mailto:[mailto:satish@google.com]>
Sent: Friday, May 04, 2012 8:40 AM
To: Young, Milan
Cc: Jerry Carter; public-speech-api@w3.org<mailto:public-speech-api@w3.org>

Subject: Re: Additional parameters to SpeechRecognition (was "Speech API: first editor's draft posted")

But "confidence" is a much easier to understand concept, and I don't see any harm to the average web developer by including it in the list.

FWIW, that shouldn't be the bar to include items in the API. Since web APIs are supported perpetually in practice we should start with the most basic set and iterate based on concrete application requirements.

One example is "hotword" recognition, which might be used to wake up the application after long periods of silence, side speech, noise, etc.  The hotword grammar is often very simple (eg "wake up"), and thus multiple interpretations are extremely uncommon.  Developers would use "confidence" to avoid false positives which consume processing resources and induce the deaf periods I mentioned before.

I can see the same use case addressed by setting maxNBest=1 so that only the topmost interpretation is returned and the engine optimises resources for that.

I am also wondering if optimising for server side performance should even be a consideration when designing the web speech API. Developing a simple web developer facing API is our explicit goal and optimisation is something that implementors of both UAs and speech engines would do based on a lot of parameters, hence the API should not really care about it.

I'm not sure what it means in practice to not define a confidenceThreshold (option 4). Doesn't it just mean that recognizer behavior is implementation-specific, and isn't that equivalent to option (2)? Isn't (4) subject to the same problems when changing recognizers as (2)?

 Yes I think (2) and (4) are the same because the actual custom parameters aren't going to be defined in the spec.

Received on Friday, 4 May 2012 17:15:45 UTC