Re: Comments on Stochastic Language Models (N-Gram) Specification WD 3 January 2001

At 03:41 PM 15-02-01 +0000, Michael K. Brown wrote:
Michael,
I think you have rather misinterpreted my questions.  But in doing so you 
have actually answered a couple of them.

>Comments inserted:
>
>Paul van Mulbregt wrote:
> >
> >
> > The proposal seems to be somewhat general in its attempt to model the
> > types of N-Gram Language Models (LMs) in use, even though it does use
> > the term "N-Gram Language Model" as though it has a unique well-defined
> > meaning.  It also seems to have made some attempts at efficiency, and
> > these two goals combine to give the document a schizophrenic feel at
> > times.
>
>N-gram refers to a class of stochastic language models.

That was actually my point. There isn't one N-Gram Language Model, it is a 
collection of models.
Whether it be the interpolation of tri- bi- and uni-grams approach 
popularized by IBM, or the Katz backoff approach, or a more general backoff 
scheme, N-Gram LM does not have a unique meaning.
But as your comments also made clear, this spec isn't really specifying 
N-Gram Language Models, it's more about specifying N-Gram Corpus Counts, 
and the platform can do whatever it wants with the data, even potentially 
ignore it.  So the fact that N-Gram LM doesn't have a unique meaning 
doesn't actually matter. In fact, that there is no requirement that if the 
platform does build an LM from the data, that it be an N-Gram LM.

...
> > It is also the case that a N-Gram model may be built by combining
> > several models (e.g. by linear or log-linear interpolation) in which
> > case counts are not an appropriate way to represent that data.
>
>I disagree.  Combining models does not allow you to recover cutoff
>statistics.  When you have several sources of data the accumulated
>counts can rise above cutoff threshold and bring in additional
>statistics you could not obtain from model interpolation. Counts are the
>prime data from which all this modeling information can be derived.

Heres an example of an N-Gram model where counts are suboptimal.
If I wish to precompute a log-linear combination of several N-Gram LMs, and
approximate it by one N-Gram LM (presumably heavily pruned), then counts are
not an appropriate way of representing that LM. All discounting has already 
been done,
I have the probabilities, I just want to store the LM.  Sure I could pick 
some total count,
scale by it and store the resulting counts, but that just introduces error 
where none need exist.
And it could encourage the platform to perform further discounting on the 
counts.


> > Or are these counts really meant to be thought of as probabilities
> > scaled up by some factor and rounded to the nearest integer??  The
> > examples presented seem to suggest that it really is counts, not
> > probabilities, that are being stored, in which case these aren't LMs.
>
>These are counts, not probabilities.  At this point (and looking
>forward) I think you may want to read some background references to get
>more familiar with general n-gram techniques.  For your convenience I
>have attached one paper.  You will find many others in the literature.

If you look at the paper you attached describing the CMU-CU Toolkit, and 
look at section 5, Future Development, you will see that items 3. and 4. 
actually correspond pretty closely to the comments and pictorial 
representations at the end of my previous post.
But here I'm really more interested in the LM work done since 1995.  This 
spec refers to linear and log-linear interpolated models, distance n-gram 
models, and I thought it would also be able to accommodate the recent work 
on pruning that I referred to in the earlier post.  But it looks to me as 
though pruning here means pruning a branch of the depth-first tree, rather 
than just pruning individual grams.  Am I mistaken on that? (BTW, pruning 
is not explicitly defined in this spec - is it defined elsewhere? Same 
question for backoff weights.)

> > ************************************************
> > Q2.
> >
> > Section 2. N-Gram Grammar Representation
> >      "Counts allows for a data reduction"
> >
> > Where is the reduction?  In the file, or some implementation?
> > If the former, how much reduction is it?
>
>Variable.  This refers to cutoffs and consequences.  Please read the
>paper and prior minutes of meetings.  Many of your questions were
>discussed in earlier meetings so this is not an efficient way to address
>these questions.

The minutes you refer to, are they available publicly?  I didn't see them 
obviously listed on <http://www.w3.org/Voice/> or 
<http://www.w3.org/TR/ngram-spec/>. Can you post links to them?

> >
> > ************************************************
> > Q3.
> >
> > Section 2. N-Gram Grammar Representation
> >      "Counts are more robust in terms of manipulations of the tree"
> >
> > Could you be explicit about these manipulations?
>
>Discussed in meetings.  Counts are less sensitive to data errors.

Of course, you may have already introduced large data errors just getting 
the probabilities into count form.

>Discussed.  Implicit.  Please read the paper.  I'm sorry, but I'm going
>to give up at this point and ask you to read the paper and minutes
>before we go further.  If you can then come back with the probably
>smaller set of questions, I would be happy to address them.

I believe that a lot of my questions result from thinking that this spec 
actually concerned
N-Gram LMs, when in fact it is tackling a much simpler problem, namely 
providing count data.
Perhaps the title should be adjusted to reflect what it is actually doing.
There are references to LMs in the spec, but only in a general sense 
(linear and log-interpolation) nothing really specific to N-Gram LMs.
[The one place that does refer to something specific to N-Gram LMs is the 
backoff weights, but this use of backoff weights is somewhat baffling to 
me. To compute the backoff weights, one has to know the successor 
probabilities both for the context in question and its backoff 
context.  Since only counts are provided, the probabilities have to be 
computed -- but the tool producing the count and backoff data can 
smooth/discount in an arbitrary manner.  So the tool discounts one way and 
produces a file with a set of backoff weights, and the platform discounts a 
different way and so the backoff weights don't match the 
probabilities.  Unless the data producing tool and the platform using the 
data are in cahoots.  But in that case there wouldn't be the need for a 
public transfer format.]

The large number of questions may be because the spec is rather vague or 
non-existent on many items. I'm having to guess what you mean by certain 
terms, and whether its the same thing that I mean (E.g. even N-Gram 
Models!).  I'm trying to phrase questions with a partial understanding of 
what the spec is trying to accomplish, and you're trying to answer with a 
full knowledge of what you intended it to mean.  If the minutes can help 
clear up issues that would be useful.

Cheers,
-Paul

-------------------------------------------------------------
Paul van Mulbregt,  paulvm@ne.mediaone.net

Received on Friday, 16 February 2001 00:14:54 UTC