FW: Query about draft Stochastic Language Models (N-Grams) Specification

Dear James Hammerton,

You are completely right. If you want to train a 
Language Model you need to take care of the start and
end of sentence symbols.

I'll check in the draft if this is covered somewhere,
otherwise something important is missing.

I would like to inform you that the current specification 
is still a Working Draft, so it has to be considered
a "work in progress", a contribution of the W3C voice
Browser to the community, but it is far from a W3C
Reccomendation. Unfortunately the Voice 
Browser is currently not working on this topic, because
it is a "low priority" activity and currently inactive.
See the Voice Browser page http://www.w3.org/Voice/
for further details on it.

Best regards,
Paolo Baggia, Loquendo.


> - ---------- Forwarded message ----------
> Date: Thu, 30 Sep 2004 14:33:40 +0100
> From: James Hammerton <james.hammerton@gtnet.com>
> To: dsr@w3.org
> Subject: Query about draft Stochastic Language Models (N-Grams)
>     Specification
> 
> Dave,
> 
> I have a query about the draft Stochastic Language Models (N-Gram)
> Specification at http://www.w3.org/TR/2001/WD-ngram-spec-20010103/ of which
> you're listed as an author.
> 
> When extracting N-Grams from a corpus of sentences to develop a language
> model, standard practice(*) is to add (N - 1) dummy symbols to the start of
> each sentence. For example, given the sentence "john loves mary", before
> extracting bi-grams you'd add a dummy symbol '<s>', e.g. "<s> john loves
> mary", thus the bi-grams extracted would be "<s> john", "john loves" and
> "loves mary".
> 
> * E.g. Chapter 6 of "Speech and Language Processing", Jurafsky and Martin,
> Prentice-Hall, 2000, and Chapter 6 of "Foundations of Statistical Natural
> Language Processing", Manning and Schutze, MIT Press, 1999 suggest this.
> 
> However, in the draft specification no mention is made of this. It provides
> an example of a corpus containing one sentence "A B A B C" with the
> corresponding N-Gram tree holding the following tri-grams (as well as the
> uni-gram "B" and bi-gram "B C"):
> 
> "A B A",
> "B A B",
> "A B C"
> 
> This seems wrong to me. Using the approach described in the above textbooks,
> I'd convert the sentence to "<s> <s> A B A B C" and extract the following
> N-grams:
> 
> "<s> <s> A"
> "<s> A   B"
> "A   B   A"
> "B   A   B"
> "A   B   C"
> 
> You can then correctly compute the probability of a sentence starting with
> A, or with A B. This information is lost in the tree as described in the
> draft specification. Using the uni-gram probability for "A" and the bi-gram
> probability for "A B" to get the ball rolling would be to use the wrong
> probabilities since those are probabilities for A or A B occurring anywhere
> in a sentence, not just at the start.
> 
> Also explicitly including the N-Grams with a dummy token in the tree would
> involve specifying a symbol that does not and should not occur in the
> recognition result.
> 
> How is a compliant speech recogniser/voice browser meant to deal with this
> issue when interpreting files written according to this specification?
> 
> Regards,
> 
> James



Gruppo Telecom Italia - Direzione e coordinamento di Telecom Italia S.p.A.

====================================================================
CONFIDENTIALITY NOTICE
This message and its attachments are addressed solely to the persons
above and may contain confidential information. If you have received
the message in error, be informed that any use of the content hereof
is prohibited. Please return it immediately to the sender and delete
the message. Should you have any questions, please send an e_mail to
MailAdmin@tilab.com. Thank you
====================================================================

Received on Tuesday, 5 October 2004 13:55:15 UTC