- From: Paul van Mulbregt <paulvm@ne.mediaone.net>
- Date: Fri, 16 Feb 2001 00:16:51 -0500
- To: "Michael K. Brown" <mkb@avaya.com>
- Cc: www-voice@w3.org
At 03:41 PM 15-02-01 +0000, Michael K. Brown wrote: Michael, I think you have rather misinterpreted my questions. But in doing so you have actually answered a couple of them. >Comments inserted: > >Paul van Mulbregt wrote: > > > > > > The proposal seems to be somewhat general in its attempt to model the > > types of N-Gram Language Models (LMs) in use, even though it does use > > the term "N-Gram Language Model" as though it has a unique well-defined > > meaning. It also seems to have made some attempts at efficiency, and > > these two goals combine to give the document a schizophrenic feel at > > times. > >N-gram refers to a class of stochastic language models. That was actually my point. There isn't one N-Gram Language Model, it is a collection of models. Whether it be the interpolation of tri- bi- and uni-grams approach popularized by IBM, or the Katz backoff approach, or a more general backoff scheme, N-Gram LM does not have a unique meaning. But as your comments also made clear, this spec isn't really specifying N-Gram Language Models, it's more about specifying N-Gram Corpus Counts, and the platform can do whatever it wants with the data, even potentially ignore it. So the fact that N-Gram LM doesn't have a unique meaning doesn't actually matter. In fact, that there is no requirement that if the platform does build an LM from the data, that it be an N-Gram LM. ... > > It is also the case that a N-Gram model may be built by combining > > several models (e.g. by linear or log-linear interpolation) in which > > case counts are not an appropriate way to represent that data. > >I disagree. Combining models does not allow you to recover cutoff >statistics. When you have several sources of data the accumulated >counts can rise above cutoff threshold and bring in additional >statistics you could not obtain from model interpolation. Counts are the >prime data from which all this modeling information can be derived. Heres an example of an N-Gram model where counts are suboptimal. If I wish to precompute a log-linear combination of several N-Gram LMs, and approximate it by one N-Gram LM (presumably heavily pruned), then counts are not an appropriate way of representing that LM. All discounting has already been done, I have the probabilities, I just want to store the LM. Sure I could pick some total count, scale by it and store the resulting counts, but that just introduces error where none need exist. And it could encourage the platform to perform further discounting on the counts. > > Or are these counts really meant to be thought of as probabilities > > scaled up by some factor and rounded to the nearest integer?? The > > examples presented seem to suggest that it really is counts, not > > probabilities, that are being stored, in which case these aren't LMs. > >These are counts, not probabilities. At this point (and looking >forward) I think you may want to read some background references to get >more familiar with general n-gram techniques. For your convenience I >have attached one paper. You will find many others in the literature. If you look at the paper you attached describing the CMU-CU Toolkit, and look at section 5, Future Development, you will see that items 3. and 4. actually correspond pretty closely to the comments and pictorial representations at the end of my previous post. But here I'm really more interested in the LM work done since 1995. This spec refers to linear and log-linear interpolated models, distance n-gram models, and I thought it would also be able to accommodate the recent work on pruning that I referred to in the earlier post. But it looks to me as though pruning here means pruning a branch of the depth-first tree, rather than just pruning individual grams. Am I mistaken on that? (BTW, pruning is not explicitly defined in this spec - is it defined elsewhere? Same question for backoff weights.) > > ************************************************ > > Q2. > > > > Section 2. N-Gram Grammar Representation > > "Counts allows for a data reduction" > > > > Where is the reduction? In the file, or some implementation? > > If the former, how much reduction is it? > >Variable. This refers to cutoffs and consequences. Please read the >paper and prior minutes of meetings. Many of your questions were >discussed in earlier meetings so this is not an efficient way to address >these questions. The minutes you refer to, are they available publicly? I didn't see them obviously listed on <http://www.w3.org/Voice/> or <http://www.w3.org/TR/ngram-spec/>. Can you post links to them? > > > > ************************************************ > > Q3. > > > > Section 2. N-Gram Grammar Representation > > "Counts are more robust in terms of manipulations of the tree" > > > > Could you be explicit about these manipulations? > >Discussed in meetings. Counts are less sensitive to data errors. Of course, you may have already introduced large data errors just getting the probabilities into count form. >Discussed. Implicit. Please read the paper. I'm sorry, but I'm going >to give up at this point and ask you to read the paper and minutes >before we go further. If you can then come back with the probably >smaller set of questions, I would be happy to address them. I believe that a lot of my questions result from thinking that this spec actually concerned N-Gram LMs, when in fact it is tackling a much simpler problem, namely providing count data. Perhaps the title should be adjusted to reflect what it is actually doing. There are references to LMs in the spec, but only in a general sense (linear and log-interpolation) nothing really specific to N-Gram LMs. [The one place that does refer to something specific to N-Gram LMs is the backoff weights, but this use of backoff weights is somewhat baffling to me. To compute the backoff weights, one has to know the successor probabilities both for the context in question and its backoff context. Since only counts are provided, the probabilities have to be computed -- but the tool producing the count and backoff data can smooth/discount in an arbitrary manner. So the tool discounts one way and produces a file with a set of backoff weights, and the platform discounts a different way and so the backoff weights don't match the probabilities. Unless the data producing tool and the platform using the data are in cahoots. But in that case there wouldn't be the need for a public transfer format.] The large number of questions may be because the spec is rather vague or non-existent on many items. I'm having to guess what you mean by certain terms, and whether its the same thing that I mean (E.g. even N-Gram Models!). I'm trying to phrase questions with a partial understanding of what the spec is trying to accomplish, and you're trying to answer with a full knowledge of what you intended it to mean. If the minutes can help clear up issues that would be useful. Cheers, -Paul ------------------------------------------------------------- Paul van Mulbregt, paulvm@ne.mediaone.net
Received on Friday, 16 February 2001 00:14:54 UTC