- From: James Salsman <j.salsman@bovik.org>
- Date: Tue, 13 Feb 2001 22:16:54 -0800 (PST)
- To: paulvm@ne.mediaone.net
- Cc: www-voice@w3.org
> After reading this specification (http://www.w3.org/TR/ngram-spec) I was > left somewhat confused about its purpose. Could one of the authors perhaps > explain exactly what problem this spec is trying to solve? I am not an author of the spec, but I might be able to help understand the value of language models. In general, they are useful for resolving ambiguities in large-vocabulary utterances, unconstrained by a fixed grammar. For example, the words "boy" and "poi" can sound very similar, but are rarely if ever in the same context, so the context of even the neighboring word(s) can help a great deal; i.e., if you can't tell which was said, but you know the next word is "tasted", then you can be confident to some degree that the word was "poi", but if the next word was "played" then you know it is probably "boy" instead. So for automatic dictation, language models are more important than grammars (even if they could be compiled completely, grammars would still not have the important probability information of N-gram language models.) Also, please consider beginning readers as they try to correct their mistakes when reading out loud. In this instance, you may know exactly what words they are supposed to say, and in what order. However, they will often go back and repeat words, sometimes multiple times. A system which tries to determine how well they pronounced the sentence could be led astray by even very sophisticated grammars, if, for instance, the reader mispronounces something partially and then begins repeating without finishing the word in question. If, however, you have an n-gram model of the neighboring phonetic units (such as phonemes, syllables, or -- best of all -- diphones) then it becomes much easier to find the correct attribution for each portion of the utterance. This is an interesting task that you can try for yourself at: http://www.bovik.org/reps-char.cgi > How is it envisioned that data in this format would be generated and > then used? You can derive statistical language models with tools such as these, the first of which works completely on the web: http://www.speech.cs.cmu.edu/tools/lmtool.html http://svr-www.eng.cam.ac.uk/~prc14/toolkit.html http://www.speech.sri.com/projects/srilm You can then use the language model with recognizers such as: http://htk.eng.cam.ac.uk http://sourceforge.net/projects/cmusphinx Cheers, James
Received on Wednesday, 14 February 2001 01:17:29 UTC