API Hooks for Setting Speech Recognition Parameters

August 4, 2011
Deborah Dahl

The API hooks for specifying grammars and also other recognition properties (both what these properties are, and how to specify them).
This is an update of my earlier proposal.

Relevant requirements and design decisions

Strong interest

Moderate interest

Mild interest

New

  1. The web application must be able to perform continuous recognition (I.e., dictation).
  2. The web application must be able to perform an open mic scenario (I.e., always listening for keywords).
  3. The web application must be able to get interim recognition results when it is performing continuous recognition.

Design Decisions

Settable Recognition Properties

All parameters are set on on "recognizer" object in this proposal. Another option would be to have a "properties" object where properties are set, but that seems more complicated. However, it might address the issue of when do the property settings actually take effect on the recognizer, because we could say that they take effect when the recognizer's properties object is set. Some speech API's also have other objects, like "grammar" and "rule". I think these are primarily useful for dynamically manipulating grammars in code. I'm not sure how much of that developers are going to do. I can think of use cases for this, for example, to enable users to add their own words to the grammar, but defining "grammar" and "rule" objects seems like a lower priority to me.

From requirements and design decisions

recognizer.grammar() (DD11, FPR44) This means "use the default language model" of the implementation. I'm not sure if we need this, though, normally you would just not specify a grammar at all.
recognizer.grammar(URI or String) where the string is the name of a builtin
grammar(URI or String, float) where the float represents the grammar's weight relative to other active grammars. (FPR34, FPR45,FPR48,DD9, DD21,DD55, DD72)

multiple grammars are possible (DD55), so setting a grammar doesn't necessarily mean disabling a previous grammar. This means that there needs to be a way to explicitly disable a grammar. Either
recognizer.disablegrammar(URI) (FPR45) to disable a specific grammar only
or perhaps
recognizer.grammar(URI,boolean), where if the boolean is true, the grammar is modal and all other grammars are disabled. This is simpler but less flexible than "disablegrammar". We could also have both.

recognizer.setmaxresults(integer) (DD36) the maximum size of the nbest,  default is 1. The standard could set a maximum size of the maxnbest, leave the maximum up to the implementation, or set a "minimum maximum", that is, every implementation has to be able to return at least N results, but implemenations can be able to return more than N if they like.
recognizer.setlanguage(String)
(FPR38, DD10) language of recognition, using standard ISO language codes. We could also have some convenience syntax so that developers can use more normal words to refer to languages, like "French" vs. "fr". The standard needs to define what happens if no language is set, or if multiple languages are set.
recognizer.recognitiontype(constant) (e.g. streaming, hotword) (NR1, NR2, DD33).
The default is recognition after the user stops speaking. For hotword recognition we need a way to specify the hotwords, perhaps
recognizer.recognitiontype(hotword, array of hotwords)

recognizer.savewaveformURI (URI)  (FPR57) the place to save a saved waveform
recognizer.inputwaveformURI (URI)  (FPR57) when recognition starts, recognize from this saved waveform
recognizer.savewaveform(boolean) specify that the recognizer should save waveforms

recognizer.canrerecognize(boolean) (DD76)
recognizer.endpointdetection (boolean)(DD28)
recognizer.enablefinalizebeforeend (boolean)(DD34)
recognizer.sendinterimresults(boolean) (NR3)
the recognizer sends interim results at some frequency determined by the recognizer
recognizer.sendinterimresults(integer)  (NR3),
the integer indicates the frequency of results requested in msec
generally, for service-specific parameters--
recognizer.setparameter(parameter name, parameter value) (DD73)

Other

Not in requirements or design decisions but they are commonly used in speech API's. We should discuss adding these to design decisions.
recognizer.confidencethreshold(float) between 0 and 1.0
recognizer.speedvsaccuracy
recognizer.profile, recognizer.gender, recognizer.age (for recognition tuned to a particular speaker or type of speaker) values may need to be implementation-specific, or we could just use the general method for specifying service-specific parameters, e.g.
recognizer.setparameter(age, adult)
recognizer.sensitivity(float) between 0 and 1.0
recognizer.completetimeout(integer) msec
recognizer.incompletetimeout(integer) msec
recognizer.maxspeechtimeout(integer) sec

Issues

Not addressed

DD29 Needs clarification

DD29. The API will provide control over which portions of the captured audio are sent to the recognizer.

(see final report draft

http://www.w3.org/2005/Incubator/htmlspeech/live/NOTE-htmlspeech.html)

Is the situation something like we have some captured audio in hand and we want to tell the recognizer to start recognizing at two seconds into the audio through 5 seconds into the audio? Or are we interested in triggering the start and end of recognition by someevent (like a mouse click) that occurs when audio is being captured. Or something else?

Setting parameters during recognition.

What happens when parameters are set while a recognition is in progress? Should there be an "updateParameters" method that is invoked after the parameter setting function is called to actually cause the parameters to take effect on the recognition object? Another option is to distinguish parameters that take effect immediately, like changing the grammar, from parameters that take effect only when the next recognition occurs (like maxnbest).

We also discussed setting multiple parameters and whether there should be a way to set several parameters in one call, as in: setParameters({ param1: value, param2: value2}).

Interfaces

 

Speech Recognition Interface

 

Constants

 

Attributes

 

Methods:

Example:

 

WebIDL