API Hooks for Setting Speech Recognition Parameters

August 4, 2011
Deborah Dahl

The API hooks for specifying grammars and also other recognition properties (both what these properties are, and how to specify them).
This is an update of my earlier proposal.

Relevant requirements and design decisions

New

Design Decisions

DD9. It must be possible to reference ASR grammars by URI.
DD10. It must be possible to select the ASR language using language tags.
DD11. It must be possible to leave the ASR grammar unspecified. Behavior in this case is not yet defined.
DD21. A standard set of common-task grammars must be supported. The details of what those are is TBD.
DD28. A low-latency endpoint detector must be available. It should be possible for a web app to enable and disable it, although the default setting (enabled/disabled) is TBD. The detector detects both start of speech and end of speech and fires an event in each case.
DD29. The API will provide control over which portions of the captured audio are sent to the recognizer.
DD33. Support for streaming audio is required -- in particular, that ASR may begin processing before the user has finished speaking.
DD34. It must be possible for the recognizer to return a final result before the user is done speaking.
DD36. Maxresults should be an ASR parameter representing the maximum number of results to return.
DD55. The API will support multiple simultaneous grammars, any combination of allowed grammar formats. It will also support a weight on each grammar.
DD72. In Javascript, speech reco requests should have an attribute for a sequence of grammars, each of which can have properties, including weight (and possibly language, but that is TBD).
DD 73. In Javascript will be able to set parameters as dot properties and also via a getParameters method. Browser should also allow service-specific parameters to be set this way.in
DD76. It must be possible to do one or more re-recognitions with any request that you have indicated before first use that it can be re-recognized later. This will be indicated in the API by setting a parameter to indicate re-recognition. Any parameter can be changed, including the speech service.

Settable Recognition Properties

All parameters are set on on "recognizer" object in this proposal. Another option would be to have a "properties" object where properties are set, but that seems more complicated. However, it might address the issue of when do the property settings actually take effect on the recognizer, because we could say that they take effect when the recognizer's properties object is set. Some speech API's also have other objects, like "grammar" and "rule". I think these are primarily useful for dynamically manipulating grammars in code. I'm not sure how much of that developers are going to do. I can think of use cases for this, for example, to enable users to add their own words to the grammar, but defining "grammar" and "rule" objects seems like a lower priority to me.

From requirements and design decisions

recognizer.grammar() (DD11, FPR44) This means "use the default language model" of the implementation. I'm not sure if we need this, though, normally you would just not specify a grammar at all.
recognizer.grammar(URI or String) where the string is the name of a builtin
grammar(URI or String, float) where the float represents the grammar's weight relative to other active grammars. (FPR34, FPR45,FPR48,DD9, DD21,DD55, DD72)

multiple grammars are possible (DD55), so setting a grammar doesn't necessarily mean disabling a previous grammar. This means that there needs to be a way to explicitly disable a grammar. Either
recognizer.disablegrammar(URI) (FPR45) to disable a specific grammar only
or perhaps
recognizer.grammar(URI,boolean), where if the boolean is true, the grammar is modal and all other grammars are disabled. This is simpler but less flexible than "disablegrammar". We could also have both.

recognizer.setmaxresults(integer) (DD36) the maximum size of the nbest, default is 1. The standard could set a maximum size of the maxnbest, leave the maximum up to the implementation, or set a "minimum maximum", that is, every implementation has to be able to return at least N results, but implemenations can be able to return more than N if they like.
recognizer.setlanguage(String) (FPR38, DD10) language of recognition, using standard ISO language codes. We could also have some convenience syntax so that developers can use more normal words to refer to languages, like "French" vs. "fr". The standard needs to define what happens if no language is set, or if multiple languages are set.
recognizer.recognitiontype(constant) (e.g. streaming, hotword) (NR1, NR2, DD33). The default is recognition after the user stops speaking. For hotword recognition we need a way to specify the hotwords, perhaps
recognizer.recognitiontype(hotword, array of hotwords)

recognizer.savewaveformURI (URI) (FPR57) the place to save a saved waveform
recognizer.inputwaveformURI (URI) (FPR57) when recognition starts, recognize from this saved waveform
recognizer.savewaveform(boolean) specify that the recognizer should save waveforms

recognizer.canrerecognize(boolean) (DD76)
recognizer.endpointdetection (boolean)(DD28)
recognizer.enablefinalizebeforeend (boolean)(DD34)
recognizer.sendinterimresults(boolean) (NR3) the recognizer sends interim results at some frequency determined by the recognizer
recognizer.sendinterimresults(integer) (NR3), the integer indicates the frequency of results requested in msec
generally, for service-specific parameters--
recognizer.setparameter(parameter name, parameter value) (DD73)

Other

Not in requirements or design decisions but they are commonly used in speech API's. We should discuss adding these to design decisions.
recognizer.confidencethreshold(float) between 0 and 1.0
recognizer.speedvsaccuracy
recognizer.profile, recognizer.gender, recognizer.age (for recognition tuned to a particular speaker or type of speaker) values may need to be implementation-specific, or we could just use the general method for specifying service-specific parameters, e.g.
recognizer.setparameter(age, adult)
recognizer.sensitivity(float) between 0 and 1.0
recognizer.completetimeout(integer) msec
recognizer.incompletetimeout(integer) msec
recognizer.maxspeechtimeout(integer) sec

Issues

Not addressed

DD29 Needs clarification

DD29. The API will provide control over which portions of the captured audio are sent to the recognizer.

(see final report draft

http://www.w3.org/2005/Incubator/htmlspeech/live/NOTE-htmlspeech.html)

Is the situation something like we have some captured audio in hand and we want to tell the recognizer to start recognizing at two seconds into the audio through 5 seconds into the audio? Or are we interested in triggering the start and end of recognition by someevent (like a mouse click) that occurs when audio is being captured. Or something else?

Setting parameters during recognition.

What happens when parameters are set while a recognition is in progress? Should there be an "updateParameters" method that is invoked after the parameter setting function is called to actually cause the parameters to take effect on the recognition object? Another option is to distinguish parameters that take effect immediately, like changing the grammar, from parameters that take effect only when the next recognition occurs (like maxnbest).

We also discussed setting multiple parameters and whether there should be a way to set several parameters in one call, as in: setParameters({ param1: value, param2: value2}).

Interfaces

Speech Recognition Interface

Constants

Attributes

Methods:

Example:

WebIDL