Re: Question on Stochastic Language Models (N-Gram) Specification WD 3 January 2001 from James Salsman on 2001-02-14 (www-voice@w3.org from January to March 2001)

From: James Salsman <j.salsman@bovik.org>
Date: Tue, 13 Feb 2001 22:16:54 -0800 (PST)
To: paulvm@ne.mediaone.net
Cc: www-voice@w3.org
Message-Id: <200102140616.WAA12386@shell9.ba.best.com>

> After reading this specification (http://www.w3.org/TR/ngram-spec) I was 
> left somewhat confused about its purpose.  Could one of the authors perhaps 
> explain exactly what problem this spec is trying to solve? 

I am not an author of the spec, but I might be able to help understand 
the value of language models.  In general, they are useful for resolving
ambiguities in large-vocabulary utterances, unconstrained by a fixed 
grammar.  For example, the words "boy" and "poi" can sound very similar, 
but are rarely if ever in the same context, so the context of even the 
neighboring word(s) can help a great deal; i.e., if you can't tell which
was said, but you know the next word is "tasted", then you can be 
confident to some degree that the word was "poi", but if the next word 
was "played" then you know it is probably "boy" instead.  So for 
automatic dictation, language models are more important than grammars 
(even if they could be compiled completely, grammars would still not 
have the important probability information of N-gram language models.)

Also, please consider beginning readers as they try to correct their 
mistakes when reading out loud.  In this instance, you may know exactly 
what words they are supposed to say, and in what order.  However, they 
will often go back and repeat words, sometimes multiple times.  A system 
which tries to determine how well they pronounced the sentence could be 
led astray by even very sophisticated grammars, if, for instance, the 
reader mispronounces something partially and then begins repeating 
without finishing the word in question.  If, however, you have an n-gram 
model of the neighboring phonetic units (such as phonemes, syllables, or 
-- best of all -- diphones) then it becomes much easier to find the 
correct attribution for each portion of the utterance.  This is an 
interesting task that you can try for yourself at:

  http://www.bovik.org/reps-char.cgi

> How is it envisioned that data in this format would be generated and 
> then used?

You can derive statistical language models with tools such as these, the 
first of which works completely on the web:

  http://www.speech.cs.cmu.edu/tools/lmtool.html

  http://svr-www.eng.cam.ac.uk/~prc14/toolkit.html

  http://www.speech.sri.com/projects/srilm

You can then use the language model with recognizers such as:

  http://htk.eng.cam.ac.uk

  http://sourceforge.net/projects/cmusphinx

Cheers,
James

Received on Wednesday, 14 February 2001 01:17:29 UTC