- From: Baggia Paolo <Paolo.Baggia@LOQUENDO.COM>
- Date: Tue, 05 Oct 2004 15:53:54 +0200
- To: www-voice@w3.org, james.hammerton@gtnet.com
- Cc: Baggia Paolo <Paolo.Baggia@LOQUENDO.COM>, dsr@w3.org
Dear James Hammerton, You are completely right. If you want to train a Language Model you need to take care of the start and end of sentence symbols. I'll check in the draft if this is covered somewhere, otherwise something important is missing. I would like to inform you that the current specification is still a Working Draft, so it has to be considered a "work in progress", a contribution of the W3C voice Browser to the community, but it is far from a W3C Reccomendation. Unfortunately the Voice Browser is currently not working on this topic, because it is a "low priority" activity and currently inactive. See the Voice Browser page http://www.w3.org/Voice/ for further details on it. Best regards, Paolo Baggia, Loquendo. > - ---------- Forwarded message ---------- > Date: Thu, 30 Sep 2004 14:33:40 +0100 > From: James Hammerton <james.hammerton@gtnet.com> > To: dsr@w3.org > Subject: Query about draft Stochastic Language Models (N-Grams) > Specification > > Dave, > > I have a query about the draft Stochastic Language Models (N-Gram) > Specification at http://www.w3.org/TR/2001/WD-ngram-spec-20010103/ of which > you're listed as an author. > > When extracting N-Grams from a corpus of sentences to develop a language > model, standard practice(*) is to add (N - 1) dummy symbols to the start of > each sentence. For example, given the sentence "john loves mary", before > extracting bi-grams you'd add a dummy symbol '<s>', e.g. "<s> john loves > mary", thus the bi-grams extracted would be "<s> john", "john loves" and > "loves mary". > > * E.g. Chapter 6 of "Speech and Language Processing", Jurafsky and Martin, > Prentice-Hall, 2000, and Chapter 6 of "Foundations of Statistical Natural > Language Processing", Manning and Schutze, MIT Press, 1999 suggest this. > > However, in the draft specification no mention is made of this. It provides > an example of a corpus containing one sentence "A B A B C" with the > corresponding N-Gram tree holding the following tri-grams (as well as the > uni-gram "B" and bi-gram "B C"): > > "A B A", > "B A B", > "A B C" > > This seems wrong to me. Using the approach described in the above textbooks, > I'd convert the sentence to "<s> <s> A B A B C" and extract the following > N-grams: > > "<s> <s> A" > "<s> A B" > "A B A" > "B A B" > "A B C" > > You can then correctly compute the probability of a sentence starting with > A, or with A B. This information is lost in the tree as described in the > draft specification. Using the uni-gram probability for "A" and the bi-gram > probability for "A B" to get the ball rolling would be to use the wrong > probabilities since those are probabilities for A or A B occurring anywhere > in a sentence, not just at the start. > > Also explicitly including the N-Grams with a dummy token in the tree would > involve specifying a symbol that does not and should not occur in the > recognition result. > > How is a compliant speech recogniser/voice browser meant to deal with this > issue when interpreting files written according to this specification? > > Regards, > > James Gruppo Telecom Italia - Direzione e coordinamento di Telecom Italia S.p.A. ==================================================================== CONFIDENTIALITY NOTICE This message and its attachments are addressed solely to the persons above and may contain confidential information. If you have received the message in error, be informed that any use of the content hereof is prohibited. Please return it immediately to the sender and delete the message. Should you have any questions, please send an e_mail to MailAdmin@tilab.com. Thank you ====================================================================
Received on Tuesday, 5 October 2004 13:55:15 UTC