The API
hooks for specifying grammars and also other
recognition properties (both what these properties are, and how to
specify
them).
This is an update of my earlier
proposal.
All parameters are set on on "recognizer" object in this proposal. Another option would be to have a "properties" object where properties are set, but that seems more complicated. However, it might address the issue of when do the property settings actually take effect on the recognizer, because we could say that they take effect when the recognizer's properties object is set. Some speech API's also have other objects, like "grammar" and "rule". I think these are primarily useful for dynamically manipulating grammars in code. I'm not sure how much of that developers are going to do. I can think of use cases for this, for example, to enable users to add their own words to the grammar, but defining "grammar" and "rule" objects seems like a lower priority to me.
recognizer.grammar()
(DD11,
FPR44)
This means "use the default language model" of the
implementation. I'm not sure if we need this, though, normally you
would just not
specify a grammar at all.
recognizer.grammar(URI or String) where the string is
the
name of a builtin
grammar(URI or String, float) where the float represents the
grammar's weight relative to other active grammars. (FPR34,
FPR45,FPR48,DD9, DD21,DD55, DD72)
multiple grammars are possible (DD55), so setting a grammar doesn't
necessarily mean disabling a previous grammar. This means that there
needs to be a way to explicitly disable a grammar. Either
recognizer.disablegrammar(URI) (FPR45)
to disable a specific grammar only
or perhaps
recognizer.grammar(URI,boolean),
where if
the boolean is true, the grammar is modal and all other grammars are
disabled. This is simpler but less flexible than "disablegrammar". We
could also have both.
recognizer.setmaxresults(integer) (DD36) the
maximum size
of the nbest, default is 1. The standard could set a
maximum size of the maxnbest, leave the maximum up to the
implementation, or set a "minimum maximum", that is, every
implementation has to be able to return at least N results, but
implemenations can be able to return more than N if they like.
recognizer.setlanguage(String) (FPR38, DD10) language of
recognition, using standard ISO language
codes. We could also have some
convenience syntax so that developers can use more normal words to
refer to languages, like "French" vs. "fr". The standard needs to
define what happens if no language is set, or if multiple languages are
set.
recognizer.recognitiontype(constant) (e.g. streaming, hotword) (NR1,
NR2, DD33). The default is recognition after the user stops
speaking. For hotword recognition we need a way to specify the
hotwords, perhaps
recognizer.recognitiontype(hotword, array of hotwords)
recognizer.savewaveformURI (URI)
(FPR57) the place to save a saved waveform
recognizer.inputwaveformURI (URI) (FPR57)
when recognition starts, recognize from this saved waveform
recognizer.savewaveform(boolean) specify that the recognizer
should save waveforms
recognizer.canrerecognize(boolean) (DD76)
recognizer.endpointdetection (boolean)(DD28)
recognizer.enablefinalizebeforeend (boolean)(DD34)
recognizer.sendinterimresults(boolean) (NR3) the recognizer sends
interim results at some frequency determined by the recognizer
recognizer.sendinterimresults(integer) (NR3), the integer
indicates the
frequency of results requested in msec
generally, for service-specific parameters--
recognizer.setparameter(parameter
name, parameter value) (DD73)
Not in requirements
or design decisions but they are commonly used in speech API's. We
should discuss adding these to design decisions.
recognizer.confidencethreshold(float) between 0 and 1.0
recognizer.speedvsaccuracy
recognizer.profile, recognizer.gender, recognizer.age (for
recognition tuned to a particular speaker or type of speaker)
values may need to be implementation-specific, or we could just use the
general method for specifying service-specific parameters, e.g.
recognizer.setparameter(age, adult)
recognizer.sensitivity(float) between 0 and 1.0
recognizer.completetimeout(integer) msec
recognizer.incompletetimeout(integer) msec
recognizer.maxspeechtimeout(integer) sec
DD29 Needs clarification
DD29. The API will provide control over which portions of the captured audio are sent to the recognizer.
(see final report draft
http://www.w3.org/2005/Incubator/htmlspeech/live/NOTE-htmlspeech.html)
Is the situation something like we have some captured audio in hand and we want to tell the recognizer to start recognizing at two seconds into the audio through 5 seconds into the audio? Or are we interested in triggering the start and end of recognition by someevent (like a mouse click) that occurs when audio is being captured. Or something else?
What happens when parameters are set while a recognition is in progress? Should there be an "updateParameters" method that is invoked after the parameter setting function is called to actually cause the parameters to take effect on the recognition object? Another option is to distinguish parameters that take effect immediately, like changing the grammar, from parameters that take effect only when the next recognition occurs (like maxnbest).
We also discussed setting multiple parameters and whether there should be a way to set several parameters in one call, as in: setParameters({ param1: value, param2: value2}).
Speech Recognition Interface
Constants
Attributes
Methods:
WebIDL