Comments on Stochastic Language Models (N-Gram) Specification WD 3 January 2001

There hasn't been much discussion on this proposal in the month since
it was posted - maybe these comments will help generate some.  Since
it's a Working Draft, I'm assuming it's fair game to vigorously
question every detail.

The proposal seems to be somewhat general in its attempt to model the
types of N-Gram Language Models (LMs) in use, even though it does use
the term "N-Gram Language Model" as though it has a unique well-defined
meaning.  It also seems to have made some attempts at efficiency, and
these two goals combine to give the document a schizophrenic feel at
times.

In thinking about this, I kept coming back to the issue of to who or
what this specification is aimed.

Is it meant that a user of this data is to build an LM from this data,
or use it as a precomputed LM?

Is it meant that the user of this data will use it exactly as
presented, or that the user will take it as a "hint"?

Questions
************************************************
Q1.

Section 2. N-Gram Grammar Representation

* Counts vs Probabilities
     "Using counts leaves the decision on the actual implementation of
     the N-Gram model to the platform".

Is this referring to internal representation, or to the actual
algorithms used by the LM?

Building a quality LM from count data is not trivial.  If these counts
are meant to be counts (as opposed to scaled counts), then it requires
substantial processing (smoothing, pruning etc.) to produce a N-Gram LM
from the data.

It is also the case that a N-Gram model may be built by combining
several models (e.g. by linear or log-linear interpolation) in which
case counts are not an appropriate way to represent that data.

Or are these counts really meant to be thought of as probabilities
scaled up by some factor and rounded to the nearest integer??  The
examples presented seem to suggest that it really is counts, not
probabilities, that are being stored, in which case these aren't LMs.

************************************************
Q2.

Section 2. N-Gram Grammar Representation
     "Counts allows for a data reduction"

Where is the reduction?  In the file, or some implementation?
If the former, how much reduction is it?

************************************************
Q3.

Section 2. N-Gram Grammar Representation
     "Counts are more robust in terms of manipulations of the tree"

Could you be explicit about these manipulations?

************************************************
Q4.

Section 2. N-Gram Grammar Representation
     "Presentation of data in depth-first rather than breadth-first"

I think this is problematic for four reasons.

a.) It seems very conceivable that an implementation may just wish to
use the unigrams from the file, not the whole 2,3,4-grams for example,
especially if the LM is just a hint, not a requirement.  In the
depth-first proposal, the implementation would have to read every line
in the file, rather than just the first section as would be the case
for a breadth-first format.

b.) This format precludes using more sophisticated backoff models. For
example, models which backoff to a different set of unigrams than the
regular unigrams.  It also seems to preclude any other type of backoff
model.

c.) An example in the document (Section 9.1) refers to an LM
"classlms.xml" which might well be a class model.  If the format for a
class model is to use something similar to a N-Gram model, then an
immediate problem arises. Namely, if N(u) is the count of class u, then
N(u,v) only makes sense if u is also a class as a predecessor. I.e. if
the map from word to class depends on the position of the word in the
tuple, then the data doesn't fit into this tree.

d.) Pruning seems to require pruning of a whole branch of a tree, not
just individual nodes.  Whilst this is fine for count based pruning, it
is restrictive for other pruning methods (Seymore-Rosenfeld, Stolcke's
Relative Entropy)

Can someone give some more explanation as to why depth-first is a good
thing in the context of everything else that is being contemplated?

************************************************
Q5.

Section 5. Lexicon Declaration

How are Out-of-Vocabulary (OOV) tokens supposed to be recorded?

************************************************
Q6.

Section 5. Lexicon Declaration

Is there any notion of utterance start/end (corresponding to <s></s> in
ARPA-style LMs)?

************************************************
Q7.

Section 6.  N-Gram Event Count Declaration

What are these counts actually counting?   Is it the number of times
the token (or token tuple) appeared in the corpus?  Or is it the number
of times it appeared as a context (aka history)?

It appears to be the former from the example, and that is not
sufficient to build the LM.  (If a token X occurs 100 times as a
unigram, and the only bigram that occurs is "X Y" 10 times, what should
be the probability P(Y|X)?)

If counts are missing from the bigrams "A *" relative to the unigram
"A", how can one determine if the counts are missing due to pruning, or
OOV, or non-existence due to being at the end of a utterance or
document?

************************************************
Q8.

Section 6.  N-Gram Event Count Declaration

Is there any restriction on the magnitude of a count in this file?  Is
a platform expected to be able to handle counts greater than 2^31 for
example? Certainly modern corpora may well soon have this many tokens.

************************************************
Q9.

Section 6.  N-Gram Event Count Declaration

Pruning appears to be problematic also.  In the example given, if "B A"
and "B C" are pruned, what probability should be given to P(B|BA)?

************************************************
Q10.

Section 6.  N-Gram Event Count Declaration

I personally find the format difficult to read and understand, looking
at the example at the end of section 6.  That may just be because I'm
not used to viewing things in this manner.  But it looks as though it
will not be simple matter to find any particular n-gram probability in
this file. Readability need not be a major concern if there are tools
to turn it into something readable.

************************************************
Q11.

7. Backoff Weight Declaration

If the backoff weight for a context is given, then the probabilities
for the backoff LM must have been known.  In particular, the smoothing
used to compute both the N-gram and (N-1)-gram (backoff) probabilities
for that context must have been known, but the smoothing is not
specified anywhere, and may not be recomputable if pruning has occurred.

************************************************
Q12.

9.1 Linear Interpolation

There is no requirement specified for the weights of the interpolation
elements.  Presumably, they are required to be non-negative, but not
required to be positive? I.E. Greater than or equal to zero?

************************************************
Q13.

9.1 Linear Interpolation

There is no requirement that the sum of the weights be 1.0. Why is
that? Is it because of the floating point issue?  I've yet to come
across a situation where the sum of the weights wasn't 1, which wasn't
also an error of some kind.   For instance, what action is envisioned
if all the weights are 0?

************************************************
Q14.

9.1 Linear Interpolation

The interpolation weights here are dependent on the LMs, but not on the
contexts.  Obviously it's much more difficult to specify and/or use
context-dependent weights, so this is just a comment on their absence.

************************************************
Q15.

9.1 Linear Interpolation

No common lexicon?  What action is envisioned for a platform asked to
deal with a token which is in one LM, but not another?  Or for that
matter, for a token which appears in none of the LMs?

************************************************
Q16.

9.1 Linear Interpolation

If an implementation doesn't support interpolation, is it envisioned
that the platform doesn't use any LM at all?  Or that it do whatever it
wants?

************************************************

Overall impression

It seems as though this is a mix of N-Gram token counts, and N-Gram LM,
and it doesn't have enough information for either situation.

Perhaps one question to ask is "What builds the LM?"

If the platform is expected to build the LM from this data, then that
could be a rather heavy burden on the platform.
If a separate process builds the LM, and writes it out, then that
relieves the platform of much of the effort.

One concern is that there are subtleties and non-trivialities in
building an LM, and having every platform try and do this could well
end up with a mishmash of results. (In fact, this is not a solved
problem, with much art as well as science going into the building of a
quality LM.) Having stand-along processes that do the hard work (once)
of building a particular LM and making it available is an attractive
option to consider.

While ARPA-style LMs have some problems for today, there are many
reasons to like the idea of breadth-first (log) probabilities.
Choosing efficiency "more convenient ordering for stream processing and
data loading" is likely to be efficient only for some implementations
and indeed problematic for others.  Reducing the file size is a nice
goal, but having all the information necessary to compute the LM is
more important.

I raise again the question of whether this LM is meant to be a hint to
the platform or meant to be used exactly as is.  Certainly it is easy
to design a structure that no platform actually supports, and then it
is a matter of what the platform does with this LM.  If the tool
building the LM knows the target platform, then it can tailor the LM
appropriately, but that seems to defeat the purpose of a spec.  On the
other hand, once a spec is in place it is possible that it will last a
long time -- ARPA-style LMs are still a lowest common denominator.

I like to think of backoff LMs as being a collection of probability
distributions. For each context there is a main distribution, and a
chain of further distributions and backoff weights.  For ARPA-style
N-gram LMs, this can be represented as a tree, with many contexts of
length N-1 backing off to the same context of length N-2 etc.  E.g. the
distributions for ABC and DBC both back off to the same distribution
for BC, which also happens to be the main distribution for context BC.
But there is no requirement on a backoff LM that this be the case.
Supporting a structure that general may be a bit much to contemplate
currently but it should be possible to support having the distributions
for contexts of length M specify the backoff model to use, which may be
different from the distributions for length M-1.

E.g. For a 4-gram APRA-style model the tree of distributions looks like
ABC
    \
     BC
    /   \
DBC     C -- unigrams
        /
     EC
...
with ABC backing off to BC backing off to C backing off to unigrams etc.


But it should be possible to support something like the following

contexts of length 3 start in this tree
ABC
    \
     BC
    /   \
DBC     C -- unigrams
        /
     EC
...

contexts of length 2 start in this tree
     BC'
        \
         C' -- unigrams'
        /
     EC'
     ...

contexts of length 1 start in this tree
         C'' -- unigrams''
         ...

contexts of length 0 start in this (1-element) tree
                unigrams'''


In summary, this spec feels like it is aiming to be very general in
some places but not so in others.  It could use a bit of tightening up
here and there.  But it is also considering efficiency, and in doing so
it ends up with something which lies in between a N-Gram corpus count
file and an N-Gram LM, and I'm unable to use it for either purpose in
its current form.

Cheers,
Paul
-------------------------------------------------------------
Paul van Mulbregt,  paulvm@ne.mediaone.net

Received on Thursday, 15 February 2001 00:35:54 UTC