Re: Comments on Stochastic Language Models (N-Gram) Specification WD 3 January 2001

Comments inserted:

Paul van Mulbregt wrote:
> 
> There hasn't been much discussion on this proposal in the month since
> it was posted - maybe these comments will help generate some.  Since
> it's a Working Draft, I'm assuming it's fair game to vigorously
> question every detail.
> 
> The proposal seems to be somewhat general in its attempt to model the
> types of N-Gram Language Models (LMs) in use, even though it does use
> the term "N-Gram Language Model" as though it has a unique well-defined
> meaning.  It also seems to have made some attempts at efficiency, and
> these two goals combine to give the document a schizophrenic feel at
> times.

N-gram refers to a class of stochastic language models.  The preferred
language specification has a well defined meaning within the document. 
Flexibility is provided to alter the stochastic modeling methods.  In
this sense it attempts to cover the most common methods and make
allowances for as much flexibility as we can predict right now.

There is a pure XML format suggested as well, but practical
implementations for very large n-grams will require an efficient model
(which is explicitly stated in the requirements document).  I agree that
it would be elegant if we could just spec a nice regular XML markup for
this, but it probably would not be practical.

> 
> In thinking about this, I kept coming back to the issue of to who or
> what this specification is aimed.
> 
> Is it meant that a user of this data is to build an LM from this data,
> or use it as a precomputed LM?

Both.  The spec supports either event counts with cutoff, or full event
counts if you want to build your own models.  Full event counts allow
incremental training of the models.  You can request backoff weight be
precomputed or compute them yourself (necessary for incremental
training).  You can alter the statistical methods used.  Essentially,
you have a range of control over how much you want to compute vs. how
much you want the server to do in advance using the preferred method. 
In later spec I would expect us to provide a richer set of alternative
pre-defined methods to be chosen by a query parameter.  Right now the
basic framework is being defined.

> 
> Is it meant that the user of this data will use it exactly as
> presented, or that the user will take it as a "hint"?

Again, both (or either).  You may load the n-gram for a specific
application and use it as your entire n-gram definition.  Alternatively,
if you have a generic n-gram model that could be made more application
specific by incrementally loading in additional n-gram data from a site,
you could augment your model on the fly to increase (perhaps *enable* in
highly specialized cases) performance for a particular domain.  You may
want to use your generic model as the n-gram base for many different
applications requiring specialized vocabularies that you could not
predict in your generic model.  Still, the generic component could be
highly valuable (and your market differentiator) since it might provide
the boilerplate for all the application non-specific UI for your
platform features.

> 
> Questions
> ************************************************
> Q1.
> 
> Section 2. N-Gram Grammar Representation
> 
> * Counts vs Probabilities
>      "Using counts leaves the decision on the actual implementation of
>      the N-Gram model to the platform".
> 
> Is this referring to internal representation, or to the actual
> algorithms used by the LM?
> 
> Building a quality LM from count data is not trivial.  If these counts
> are meant to be counts (as opposed to scaled counts), then it requires
> substantial processing (smoothing, pruning etc.) to produce a N-Gram LM
> from the data.

To say it is "trivial" may indeed be an underestimate, but building
n-grams has been well understood for decades.  This document is not
intended to be a tutorial on the subject.  We have existance proofs that
building the LM incrementally can be done quickly enough to be
practical.

> 
> It is also the case that a N-Gram model may be built by combining
> several models (e.g. by linear or log-linear interpolation) in which
> case counts are not an appropriate way to represent that data.

I disagree.  Combining models does not allow you to recover cutoff
statistics.  When you have several sources of data the accumulated
counts can rise above cutoff threshold and bring in additional
statistics you could not obtain from model interpolation. Counts are the
prime data from which all this modeling information can be derived.

> 
> Or are these counts really meant to be thought of as probabilities
> scaled up by some factor and rounded to the nearest integer??  The
> examples presented seem to suggest that it really is counts, not
> probabilities, that are being stored, in which case these aren't LMs.

These are counts, not probabilities.  At this point (and looking
forward) I think you may want to read some background references to get
more familiar with general n-gram techniques.  For your convenience I
have attached one paper.  You will find many others in the literature.

I've added a couple more comments below.

> 
> ************************************************
> Q2.
> 
> Section 2. N-Gram Grammar Representation
>      "Counts allows for a data reduction"
> 
> Where is the reduction?  In the file, or some implementation?
> If the former, how much reduction is it?

Variable.  This refers to cutoffs and consequences.  Please read the
paper and prior minutes of meetings.  Many of your questions were
discussed in earlier meetings so this is not an efficient way to address
these questions.

> 
> ************************************************
> Q3.
> 
> Section 2. N-Gram Grammar Representation
>      "Counts are more robust in terms of manipulations of the tree"
> 
> Could you be explicit about these manipulations?

Discussed in meetings.  Counts are less sensitive to data errors.

> 
> ************************************************
> Q4.
> 
> Section 2. N-Gram Grammar Representation
>      "Presentation of data in depth-first rather than breadth-first"
> 
> I think this is problematic for four reasons.
> 
> a.) It seems very conceivable that an implementation may just wish to
> use the unigrams from the file, not the whole 2,3,4-grams for example,
> especially if the LM is just a hint, not a requirement.  In the
> depth-first proposal, the implementation would have to read every line
> in the file, rather than just the first section as would be the case
> for a breadth-first format.

Discussed in meetings.  Depth first was chosen for a couple of reasons: 
1. it makes the the file somewhat smaller since the order encodes part
of the information; 2. the full LM building can be single pass.

You are right, however, that if you simply want unigrams you would have
to scan the file.

> 
> b.) This format precludes using more sophisticated backoff models. For
> example, models which backoff to a different set of unigrams than the
> regular unigrams.  It also seems to preclude any other type of backoff
> model.

I don't believe this is true, but if you have a counter example I would
like to see it.  You can get the raw data if you ask for it.  I don't
see how this precludes anything.

> 
> c.) An example in the document (Section 9.1) refers to an LM
> "classlms.xml" which might well be a class model.  If the format for a
> class model is to use something similar to a N-Gram model, then an
> immediate problem arises. Namely, if N(u) is the count of class u, then
> N(u,v) only makes sense if u is also a class as a predecessor. I.e. if
> the map from word to class depends on the position of the word in the
> tuple, then the data doesn't fit into this tree.

Discussed.  Class models were intended to gather sets of words and
superwords.  We did not intend it to be a full hierarchical model, if
this is what you mean.

> 
> d.) Pruning seems to require pruning of a whole branch of a tree, not
> just individual nodes.  Whilst this is fine for count based pruning, it
> is restrictive for other pruning methods (Seymore-Rosenfeld, Stolcke's
> Relative Entropy)
> 
> Can someone give some more explanation as to why depth-first is a good
> thing in the context of everything else that is being contemplated?
> 
> ************************************************
> Q5.
> 
> Section 5. Lexicon Declaration
> 
> How are Out-of-Vocabulary (OOV) tokens supposed to be recorded?

Discussed.  <UNK> is one example, but you can choose.  I think it would
be best if you read some of the literature before we go any further.

> 
> ************************************************
> Q6.
> 
> Section 5. Lexicon Declaration
> 
> Is there any notion of utterance start/end (corresponding to <s></s> in
> ARPA-style LMs)?

Discussed.  Implicit.  Please read the paper.  I'm sorry, but I'm going
to give up at this point and ask you to read the paper and minutes
before we go further.  If you can then come back with the probably
smaller set of questions, I would be happy to address them.

> 
> ************************************************
> Q7.
> 
> Section 6.  N-Gram Event Count Declaration
> 
> What are these counts actually counting?   Is it the number of times
> the token (or token tuple) appeared in the corpus?  Or is it the number
> of times it appeared as a context (aka history)?
> 
> It appears to be the former from the example, and that is not
> sufficient to build the LM.  (If a token X occurs 100 times as a
> unigram, and the only bigram that occurs is "X Y" 10 times, what should
> be the probability P(Y|X)?)
> 
> If counts are missing from the bigrams "A *" relative to the unigram
> "A", how can one determine if the counts are missing due to pruning, or
> OOV, or non-existence due to being at the end of a utterance or
> document?
> 
> ************************************************
> Q8.
> 
> Section 6.  N-Gram Event Count Declaration
> 
> Is there any restriction on the magnitude of a count in this file?  Is
> a platform expected to be able to handle counts greater than 2^31 for
> example? Certainly modern corpora may well soon have this many tokens.
> 
> ************************************************
> Q9.
> 
> Section 6.  N-Gram Event Count Declaration
> 
> Pruning appears to be problematic also.  In the example given, if "B A"
> and "B C" are pruned, what probability should be given to P(B|BA)?
> 
> ************************************************
> Q10.
> 
> Section 6.  N-Gram Event Count Declaration
> 
> I personally find the format difficult to read and understand, looking
> at the example at the end of section 6.  That may just be because I'm
> not used to viewing things in this manner.  But it looks as though it
> will not be simple matter to find any particular n-gram probability in
> this file. Readability need not be a major concern if there are tools
> to turn it into something readable.
> 
> ************************************************
> Q11.
> 
> 7. Backoff Weight Declaration
> 
> If the backoff weight for a context is given, then the probabilities
> for the backoff LM must have been known.  In particular, the smoothing
> used to compute both the N-gram and (N-1)-gram (backoff) probabilities
> for that context must have been known, but the smoothing is not
> specified anywhere, and may not be recomputable if pruning has occurred.
> 
> ************************************************
> Q12.
> 
> 9.1 Linear Interpolation
> 
> There is no requirement specified for the weights of the interpolation
> elements.  Presumably, they are required to be non-negative, but not
> required to be positive? I.E. Greater than or equal to zero?
> 
> ************************************************
> Q13.
> 
> 9.1 Linear Interpolation
> 
> There is no requirement that the sum of the weights be 1.0. Why is
> that? Is it because of the floating point issue?  I've yet to come
> across a situation where the sum of the weights wasn't 1, which wasn't
> also an error of some kind.   For instance, what action is envisioned
> if all the weights are 0?
> 
> ************************************************
> Q14.
> 
> 9.1 Linear Interpolation
> 
> The interpolation weights here are dependent on the LMs, but not on the
> contexts.  Obviously it's much more difficult to specify and/or use
> context-dependent weights, so this is just a comment on their absence.
> 
> ************************************************
> Q15.
> 
> 9.1 Linear Interpolation
> 
> No common lexicon?  What action is envisioned for a platform asked to
> deal with a token which is in one LM, but not another?  Or for that
> matter, for a token which appears in none of the LMs?
> 
> ************************************************
> Q16.
> 
> 9.1 Linear Interpolation
> 
> If an implementation doesn't support interpolation, is it envisioned
> that the platform doesn't use any LM at all?  Or that it do whatever it
> wants?
> 
> ************************************************
> 
> Overall impression
> 
> It seems as though this is a mix of N-Gram token counts, and N-Gram LM,
> and it doesn't have enough information for either situation.
> 
> Perhaps one question to ask is "What builds the LM?"
> 
> If the platform is expected to build the LM from this data, then that
> could be a rather heavy burden on the platform.
> If a separate process builds the LM, and writes it out, then that
> relieves the platform of much of the effort.
> 
> One concern is that there are subtleties and non-trivialities in
> building an LM, and having every platform try and do this could well
> end up with a mishmash of results. (In fact, this is not a solved
> problem, with much art as well as science going into the building of a
> quality LM.) Having stand-along processes that do the hard work (once)
> of building a particular LM and making it available is an attractive
> option to consider.
> 
> While ARPA-style LMs have some problems for today, there are many
> reasons to like the idea of breadth-first (log) probabilities.
> Choosing efficiency "more convenient ordering for stream processing and
> data loading" is likely to be efficient only for some implementations
> and indeed problematic for others.  Reducing the file size is a nice
> goal, but having all the information necessary to compute the LM is
> more important.
> 
> I raise again the question of whether this LM is meant to be a hint to
> the platform or meant to be used exactly as is.  Certainly it is easy
> to design a structure that no platform actually supports, and then it
> is a matter of what the platform does with this LM.  If the tool
> building the LM knows the target platform, then it can tailor the LM
> appropriately, but that seems to defeat the purpose of a spec.  On the
> other hand, once a spec is in place it is possible that it will last a
> long time -- ARPA-style LMs are still a lowest common denominator.
> 
> I like to think of backoff LMs as being a collection of probability
> distributions. For each context there is a main distribution, and a
> chain of further distributions and backoff weights.  For ARPA-style
> N-gram LMs, this can be represented as a tree, with many contexts of
> length N-1 backing off to the same context of length N-2 etc.  E.g. the
> distributions for ABC and DBC both back off to the same distribution
> for BC, which also happens to be the main distribution for context BC.
> But there is no requirement on a backoff LM that this be the case.
> Supporting a structure that general may be a bit much to contemplate
> currently but it should be possible to support having the distributions
> for contexts of length M specify the backoff model to use, which may be
> different from the distributions for length M-1.
> 
> E.g. For a 4-gram APRA-style model the tree of distributions looks like
> ABC
>     \
>      BC
>     /   \
> DBC     C -- unigrams
>         /
>      EC
> ...
> with ABC backing off to BC backing off to C backing off to unigrams etc.
> 
> But it should be possible to support something like the following
> 
> contexts of length 3 start in this tree
> ABC
>     \
>      BC
>     /   \
> DBC     C -- unigrams
>         /
>      EC
> ...
> 
> contexts of length 2 start in this tree
>      BC'
>         \
>          C' -- unigrams'
>         /
>      EC'
>      ...
> 
> contexts of length 1 start in this tree
>          C'' -- unigrams''
>          ...
> 
> contexts of length 0 start in this (1-element) tree
>                 unigrams'''
> 
> In summary, this spec feels like it is aiming to be very general in
> some places but not so in others.  It could use a bit of tightening up
> here and there.  But it is also considering efficiency, and in doing so
> it ends up with something which lies in between a N-Gram corpus count
> file and an N-Gram LM, and I'm unable to use it for either purpose in
> its current form.
> 
> Cheers,
> Paul
> -------------------------------------------------------------
> Paul van Mulbregt,  paulvm@ne.mediaone.net

-- 
		Michael K. Brown
		Avaya Labs, Rm. 2D-534, (908) 582-5044
		600 Mountain Ave., Murray Hill, NJ 07974
		mkb@avaya.com

Received on Thursday, 15 February 2001 10:44:05 UTC