Re: Comments on Stochastic Language Models (N-Gram) Specification WD 3 January 2001 from Michael K. Brown on 2001-03-01 (www-voice@w3.org from January to March 2001)

From: Michael K. Brown <mkb@avaya.com>
Date: Thu, 01 Mar 2001 18:16:41 -0500
To: Paul van Mulbregt <paulvm@ne.mediaone.net>, www-voice@w3.org
Message-ID: <3A9ED859.91A6C4EF@avaya.com>
I'm in the process of catching up on email.  There are two other n-gram
responses that I will attempt to answer shortly.  I may pass some of
this on to Andi Kellner as well for his expert advice.

In general, I think a lot of these questions can/should be answered by
reading the existing literature on n-grams, which is freely available on
the web in various places.  The specification should be just that, a
spec rather than tutorial, because experts will not want to wade through
a lot of prose to get to the relevant details.  To help with this I
think we should probably add some citations, but I'm not much inclined
to add a lot of prose and duplicate information readily available on
this subject.

This is the continuation of my responses to these questions.  Andi may
want to comment as well.

> 
> ************************************************
> Q7.
> 
> Section 6.  N-Gram Event Count Declaration
> 
> What are these counts actually counting?   Is it the number of times
> the token (or token tuple) appeared in the corpus?  Or is it the number
> of times it appeared as a context (aka history)?

They are both, depending on how you interpret the data.  They are
intended to be n-gram (tuple) counts, but of course, these are also
history for longer token sequences.  Since we model after the basic
paradigm of the ARPA model, you can learn much more by reading that
literature.

> 
> It appears to be the former from the example, and that is not
> sufficient to build the LM.  (If a token X occurs 100 times as a
> unigram, and the only bigram that occurs is "X Y" 10 times, what should
> be the probability P(Y|X)?)

It is the former and is entirely sufficient, and represents an early
stage representation in the ARPA modeling process.

> 
> If counts are missing from the bigrams "A *" relative to the unigram
> "A", how can one determine if the counts are missing due to pruning, or
> OOV, or non-existence due to being at the end of a utterance or
> document?

You can't.  That's why we provide the option of not pruning.

> 
> ************************************************
> Q8.
> 
> Section 6.  N-Gram Event Count Declaration
> 
> Is there any restriction on the magnitude of a count in this file?  Is
> a platform expected to be able to handle counts greater than 2^31 for
> example? Certainly modern corpora may well soon have this many tokens.

There is no explicit limitation.  LISP has not had a problem with this
for many years, and modern C is now supporting 64 bit long longs, so I
don't see this as a real issue.

> 
> ************************************************
> Q9.
> 
> Section 6.  N-Gram Event Count Declaration
> 
> Pruning appears to be problematic also.  In the example given, if "B A"
> and "B C" are pruned, what probability should be given to P(B|BA)?

This is for you to decide.  We simply provide the counts.  There are a
variety of smoothing methods for this purpose and the group did not want
to impose any particular enforcement on the user.

> 
> ************************************************
> Q10.
> 
> Section 6.  N-Gram Event Count Declaration
> 
> I personally find the format difficult to read and understand, looking
> at the example at the end of section 6.  That may just be because I'm
> not used to viewing things in this manner.  But it looks as though it
> will not be simple matter to find any particular n-gram probability in
> this file. Readability need not be a major concern if there are tools
> to turn it into something readable.

It's not intended for humans to read.  Real n-gram data will be far too
large for a human to ever want to actually look at it directly.  Tools
are the only sensible way to work with this data.

Even in a "more readable" format this data would be largely unreadable
by humans.  That's why we opted for an efficient format.

> 
> ************************************************
> Q11.
> 
> 7. Backoff Weight Declaration
> 
> If the backoff weight for a context is given, then the probabilities
> for the backoff LM must have been known.  In particular, the smoothing
> used to compute both the N-gram and (N-1)-gram (backoff) probabilities
> for that context must have been known, but the smoothing is not
> specified anywhere, and may not be recomputable if pruning has occurred.

At the request of some of the group members, precomputed backoff weights
were provided as a convenience and efficiency hack.  Strictly speaking,
we should probably not provide them at all because it implies a
particular enforcement of methodology that we wanted to avoid.

> 
> ************************************************
> Q12.
> 
> 9.1 Linear Interpolation
> 
> There is no requirement specified for the weights of the interpolation
> elements.  Presumably, they are required to be non-negative, but not
> required to be positive? I.E. Greater than or equal to zero?

We had no intention of placing any restrictions, but reality probably
does dictate certain constraints.

> 
> ************************************************
> Q13.
> 
> 9.1 Linear Interpolation
> 
> There is no requirement that the sum of the weights be 1.0. Why is
> that? Is it because of the floating point issue?  I've yet to come
> across a situation where the sum of the weights wasn't 1, which wasn't
> also an error of some kind.   For instance, what action is envisioned
> if all the weights are 0?

Some members insisted on integer data.  This implies we need to scale
for total mass.  The user has to make the appropriate adjustments.

> 
> ************************************************
> Q14.
> 
> 9.1 Linear Interpolation
> 
> The interpolation weights here are dependent on the LMs, but not on the
> contexts.  Obviously it's much more difficult to specify and/or use
> context-dependent weights, so this is just a comment on their absence.
> 
> ************************************************
> Q15.
> 
> 9.1 Linear Interpolation
> 
> No common lexicon?  What action is envisioned for a platform asked to
> deal with a token which is in one LM, but not another?  Or for that
> matter, for a token which appears in none of the LMs?

I'm not sure what you mean by this.  OOV covers the last case.

> 
> ************************************************
> Q16.
> 
> 9.1 Linear Interpolation
> 
> If an implementation doesn't support interpolation, is it envisioned
> that the platform doesn't use any LM at all?  Or that it do whatever it
> wants?

Then I assume you lose mostly the low frequency n-grams, but not
everything.  Of course, depending on the LM statistics you could be
entirely hosed.

> 
> ************************************************
> 
> Overall impression
> 
> It seems as though this is a mix of N-Gram token counts, and N-Gram LM,
> and it doesn't have enough information for either situation.

Design by committee is probably never optimal - but it does have the
basic information needed.

> 
> Perhaps one question to ask is "What builds the LM?"

You do.  Or rather, your platform does.

> 
> If the platform is expected to build the LM from this data, then that
> could be a rather heavy burden on the platform.

Actually, the processing is tiny compared to ASR processing.  For this,
however, I refer you to Andi Kellner at Philips where they have more
experience with on-the-fly model processing.

> If a separate process builds the LM, and writes it out, then that
> relieves the platform of much of the effort.

In our large scale platforms the host would do this while EC/ASR/TTS is
done in firmware running on subprocessors, so it doesn't really impact
on the speech processing performance at all.  However, if the host is
also doing XML interpretation, etc. then it does add to that load.  Like
I said above, I don't think this is a tremendous increase in
computational load, partcularly if you take care to cache computed
results as we do for larger scale platforms.  Then the load get
distributed over the user population.

> 
> One concern is that there are subtleties and non-trivialities in
> building an LM, and having every platform try and do this could well
> end up with a mishmash of results. (In fact, this is not a solved
> problem, with much art as well as science going into the building of a
> quality LM.) Having stand-along processes that do the hard work (once)
> of building a particular LM and making it available is an attractive
> option to consider.

Precomputing models imposes constraints, and the group has opposed
constraints pretty consistently.  This is where you get your chance to
beat the other guy in the marketplace.  This is also why we still
support basic research.

> 
> While ARPA-style LMs have some problems for today, there are many
> reasons to like the idea of breadth-first (log) probabilities.
> Choosing efficiency "more convenient ordering for stream processing and
> data loading" is likely to be efficient only for some implementations
> and indeed problematic for others.  Reducing the file size is a nice
> goal, but having all the information necessary to compute the LM is
> more important.

Andi Kellner has said that Philips found depth-first more useful for
building models incrementally.  Perhaps Andi would like to respond to
this question further.

> 
> I raise again the question of whether this LM is meant to be a hint to
> the platform or meant to be used exactly as is.  Certainly it is easy
> to design a structure that no platform actually supports, and then it
> is a matter of what the platform does with this LM.  If the tool
> building the LM knows the target platform, then it can tailor the LM
> appropriately, but that seems to defeat the purpose of a spec.  On the
> other hand, once a spec is in place it is possible that it will last a
> long time -- ARPA-style LMs are still a lowest common denominator.

Presumably your tools would run for your platform and be designed
accordingly.  You can use the data in whatever way gives you the best
advantage.

LCD is what I've discovered is typical of design-by-committee; not
exciting, but most generally useful by the largest audience.  As Dave
Raggett has stated on several occassions, "The W3C is not a research
organization."  Therefore, you can expect the outcome to be somewhat
less than leading edge.

> 
> I like to think of backoff LMs as being a collection of probability
> distributions. For each context there is a main distribution, and a
> chain of further distributions and backoff weights.  For ARPA-style
> N-gram LMs, this can be represented as a tree, with many contexts of
> length N-1 backing off to the same context of length N-2 etc.  E.g. the
> distributions for ABC and DBC both back off to the same distribution
> for BC, which also happens to be the main distribution for context BC.
> But there is no requirement on a backoff LM that this be the case.
> Supporting a structure that general may be a bit much to contemplate
> currently but it should be possible to support having the distributions
> for contexts of length M specify the backoff model to use, which may be
> different from the distributions for length M-1.
> 
> E.g. For a 4-gram APRA-style model the tree of distributions looks like
> ABC
>     \
>      BC
>     /   \
> DBC     C -- unigrams
>         /
>      EC
> ...
> with ABC backing off to BC backing off to C backing off to unigrams etc.
> 
> But it should be possible to support something like the following
> 
> contexts of length 3 start in this tree
> ABC
>     \
>      BC
>     /   \
> DBC     C -- unigrams
>         /
>      EC
> ...
> 
> contexts of length 2 start in this tree
>      BC'
>         \
>          C' -- unigrams'
>         /
>      EC'
>      ...
> 
> contexts of length 1 start in this tree
>          C'' -- unigrams''
>          ...
> 
> contexts of length 0 start in this (1-element) tree
>                 unigrams'''
> 
> In summary, this spec feels like it is aiming to be very general in
> some places but not so in others.  It could use a bit of tightening up
> here and there.  But it is also considering efficiency, and in doing so
> it ends up with something which lies in between a N-Gram corpus count
> file and an N-Gram LM, and I'm unable to use it for either purpose in
> its current form.

In general I would suggest and welcome contributions to improve the
specification, so if you think there is something specific that can be
added/altered that adds to the value of the spec without subtracting
elsewhere, then you should write a brief description that can serve as
the basis for a group discussion.  As with all such changes, these
suggestions will probably get criticized/modified in group discussion,
so you should also be prepared to defend the suggestions and be prepared
to compromise on the final form.

> 
> Cheers,
> Paul
> -------------------------------------------------------------
> Paul van Mulbregt,  paulvm@ne.mediaone.net
Received on Thursday, 1 March 2001 18:28:03 UTC