- From: Michael K. Brown <mkb@avaya.com>
- Date: Thu, 01 Mar 2001 18:16:41 -0500
- To: Paul van Mulbregt <paulvm@ne.mediaone.net>, www-voice@w3.org
- Message-ID: <3A9ED859.91A6C4EF@avaya.com>
I'm in the process of catching up on email. There are two other n-gram responses that I will attempt to answer shortly. I may pass some of this on to Andi Kellner as well for his expert advice. In general, I think a lot of these questions can/should be answered by reading the existing literature on n-grams, which is freely available on the web in various places. The specification should be just that, a spec rather than tutorial, because experts will not want to wade through a lot of prose to get to the relevant details. To help with this I think we should probably add some citations, but I'm not much inclined to add a lot of prose and duplicate information readily available on this subject. This is the continuation of my responses to these questions. Andi may want to comment as well. > > ************************************************ > Q7. > > Section 6. N-Gram Event Count Declaration > > What are these counts actually counting? Is it the number of times > the token (or token tuple) appeared in the corpus? Or is it the number > of times it appeared as a context (aka history)? They are both, depending on how you interpret the data. They are intended to be n-gram (tuple) counts, but of course, these are also history for longer token sequences. Since we model after the basic paradigm of the ARPA model, you can learn much more by reading that literature. > > It appears to be the former from the example, and that is not > sufficient to build the LM. (If a token X occurs 100 times as a > unigram, and the only bigram that occurs is "X Y" 10 times, what should > be the probability P(Y|X)?) It is the former and is entirely sufficient, and represents an early stage representation in the ARPA modeling process. > > If counts are missing from the bigrams "A *" relative to the unigram > "A", how can one determine if the counts are missing due to pruning, or > OOV, or non-existence due to being at the end of a utterance or > document? You can't. That's why we provide the option of not pruning. > > ************************************************ > Q8. > > Section 6. N-Gram Event Count Declaration > > Is there any restriction on the magnitude of a count in this file? Is > a platform expected to be able to handle counts greater than 2^31 for > example? Certainly modern corpora may well soon have this many tokens. There is no explicit limitation. LISP has not had a problem with this for many years, and modern C is now supporting 64 bit long longs, so I don't see this as a real issue. > > ************************************************ > Q9. > > Section 6. N-Gram Event Count Declaration > > Pruning appears to be problematic also. In the example given, if "B A" > and "B C" are pruned, what probability should be given to P(B|BA)? This is for you to decide. We simply provide the counts. There are a variety of smoothing methods for this purpose and the group did not want to impose any particular enforcement on the user. > > ************************************************ > Q10. > > Section 6. N-Gram Event Count Declaration > > I personally find the format difficult to read and understand, looking > at the example at the end of section 6. That may just be because I'm > not used to viewing things in this manner. But it looks as though it > will not be simple matter to find any particular n-gram probability in > this file. Readability need not be a major concern if there are tools > to turn it into something readable. It's not intended for humans to read. Real n-gram data will be far too large for a human to ever want to actually look at it directly. Tools are the only sensible way to work with this data. Even in a "more readable" format this data would be largely unreadable by humans. That's why we opted for an efficient format. > > ************************************************ > Q11. > > 7. Backoff Weight Declaration > > If the backoff weight for a context is given, then the probabilities > for the backoff LM must have been known. In particular, the smoothing > used to compute both the N-gram and (N-1)-gram (backoff) probabilities > for that context must have been known, but the smoothing is not > specified anywhere, and may not be recomputable if pruning has occurred. At the request of some of the group members, precomputed backoff weights were provided as a convenience and efficiency hack. Strictly speaking, we should probably not provide them at all because it implies a particular enforcement of methodology that we wanted to avoid. > > ************************************************ > Q12. > > 9.1 Linear Interpolation > > There is no requirement specified for the weights of the interpolation > elements. Presumably, they are required to be non-negative, but not > required to be positive? I.E. Greater than or equal to zero? We had no intention of placing any restrictions, but reality probably does dictate certain constraints. > > ************************************************ > Q13. > > 9.1 Linear Interpolation > > There is no requirement that the sum of the weights be 1.0. Why is > that? Is it because of the floating point issue? I've yet to come > across a situation where the sum of the weights wasn't 1, which wasn't > also an error of some kind. For instance, what action is envisioned > if all the weights are 0? Some members insisted on integer data. This implies we need to scale for total mass. The user has to make the appropriate adjustments. > > ************************************************ > Q14. > > 9.1 Linear Interpolation > > The interpolation weights here are dependent on the LMs, but not on the > contexts. Obviously it's much more difficult to specify and/or use > context-dependent weights, so this is just a comment on their absence. > > ************************************************ > Q15. > > 9.1 Linear Interpolation > > No common lexicon? What action is envisioned for a platform asked to > deal with a token which is in one LM, but not another? Or for that > matter, for a token which appears in none of the LMs? I'm not sure what you mean by this. OOV covers the last case. > > ************************************************ > Q16. > > 9.1 Linear Interpolation > > If an implementation doesn't support interpolation, is it envisioned > that the platform doesn't use any LM at all? Or that it do whatever it > wants? Then I assume you lose mostly the low frequency n-grams, but not everything. Of course, depending on the LM statistics you could be entirely hosed. > > ************************************************ > > Overall impression > > It seems as though this is a mix of N-Gram token counts, and N-Gram LM, > and it doesn't have enough information for either situation. Design by committee is probably never optimal - but it does have the basic information needed. > > Perhaps one question to ask is "What builds the LM?" You do. Or rather, your platform does. > > If the platform is expected to build the LM from this data, then that > could be a rather heavy burden on the platform. Actually, the processing is tiny compared to ASR processing. For this, however, I refer you to Andi Kellner at Philips where they have more experience with on-the-fly model processing. > If a separate process builds the LM, and writes it out, then that > relieves the platform of much of the effort. In our large scale platforms the host would do this while EC/ASR/TTS is done in firmware running on subprocessors, so it doesn't really impact on the speech processing performance at all. However, if the host is also doing XML interpretation, etc. then it does add to that load. Like I said above, I don't think this is a tremendous increase in computational load, partcularly if you take care to cache computed results as we do for larger scale platforms. Then the load get distributed over the user population. > > One concern is that there are subtleties and non-trivialities in > building an LM, and having every platform try and do this could well > end up with a mishmash of results. (In fact, this is not a solved > problem, with much art as well as science going into the building of a > quality LM.) Having stand-along processes that do the hard work (once) > of building a particular LM and making it available is an attractive > option to consider. Precomputing models imposes constraints, and the group has opposed constraints pretty consistently. This is where you get your chance to beat the other guy in the marketplace. This is also why we still support basic research. > > While ARPA-style LMs have some problems for today, there are many > reasons to like the idea of breadth-first (log) probabilities. > Choosing efficiency "more convenient ordering for stream processing and > data loading" is likely to be efficient only for some implementations > and indeed problematic for others. Reducing the file size is a nice > goal, but having all the information necessary to compute the LM is > more important. Andi Kellner has said that Philips found depth-first more useful for building models incrementally. Perhaps Andi would like to respond to this question further. > > I raise again the question of whether this LM is meant to be a hint to > the platform or meant to be used exactly as is. Certainly it is easy > to design a structure that no platform actually supports, and then it > is a matter of what the platform does with this LM. If the tool > building the LM knows the target platform, then it can tailor the LM > appropriately, but that seems to defeat the purpose of a spec. On the > other hand, once a spec is in place it is possible that it will last a > long time -- ARPA-style LMs are still a lowest common denominator. Presumably your tools would run for your platform and be designed accordingly. You can use the data in whatever way gives you the best advantage. LCD is what I've discovered is typical of design-by-committee; not exciting, but most generally useful by the largest audience. As Dave Raggett has stated on several occassions, "The W3C is not a research organization." Therefore, you can expect the outcome to be somewhat less than leading edge. > > I like to think of backoff LMs as being a collection of probability > distributions. For each context there is a main distribution, and a > chain of further distributions and backoff weights. For ARPA-style > N-gram LMs, this can be represented as a tree, with many contexts of > length N-1 backing off to the same context of length N-2 etc. E.g. the > distributions for ABC and DBC both back off to the same distribution > for BC, which also happens to be the main distribution for context BC. > But there is no requirement on a backoff LM that this be the case. > Supporting a structure that general may be a bit much to contemplate > currently but it should be possible to support having the distributions > for contexts of length M specify the backoff model to use, which may be > different from the distributions for length M-1. > > E.g. For a 4-gram APRA-style model the tree of distributions looks like > ABC > \ > BC > / \ > DBC C -- unigrams > / > EC > ... > with ABC backing off to BC backing off to C backing off to unigrams etc. > > But it should be possible to support something like the following > > contexts of length 3 start in this tree > ABC > \ > BC > / \ > DBC C -- unigrams > / > EC > ... > > contexts of length 2 start in this tree > BC' > \ > C' -- unigrams' > / > EC' > ... > > contexts of length 1 start in this tree > C'' -- unigrams'' > ... > > contexts of length 0 start in this (1-element) tree > unigrams''' > > In summary, this spec feels like it is aiming to be very general in > some places but not so in others. It could use a bit of tightening up > here and there. But it is also considering efficiency, and in doing so > it ends up with something which lies in between a N-Gram corpus count > file and an N-Gram LM, and I'm unable to use it for either purpose in > its current form. In general I would suggest and welcome contributions to improve the specification, so if you think there is something specific that can be added/altered that adds to the value of the spec without subtracting elsewhere, then you should write a brief description that can serve as the basis for a group discussion. As with all such changes, these suggestions will probably get criticized/modified in group discussion, so you should also be prepared to defend the suggestions and be prepared to compromise on the final form. > > Cheers, > Paul > ------------------------------------------------------------- > Paul van Mulbregt, paulvm@ne.mediaone.net
Received on Thursday, 1 March 2001 18:28:03 UTC