- From: Paul van Mulbregt <paulvm@ne.mediaone.net>
- Date: Thu, 15 Feb 2001 00:42:57 -0500
- To: www-voice@w3.org
There hasn't been much discussion on this proposal in the month since it was posted - maybe these comments will help generate some. Since it's a Working Draft, I'm assuming it's fair game to vigorously question every detail. The proposal seems to be somewhat general in its attempt to model the types of N-Gram Language Models (LMs) in use, even though it does use the term "N-Gram Language Model" as though it has a unique well-defined meaning. It also seems to have made some attempts at efficiency, and these two goals combine to give the document a schizophrenic feel at times. In thinking about this, I kept coming back to the issue of to who or what this specification is aimed. Is it meant that a user of this data is to build an LM from this data, or use it as a precomputed LM? Is it meant that the user of this data will use it exactly as presented, or that the user will take it as a "hint"? Questions ************************************************ Q1. Section 2. N-Gram Grammar Representation * Counts vs Probabilities "Using counts leaves the decision on the actual implementation of the N-Gram model to the platform". Is this referring to internal representation, or to the actual algorithms used by the LM? Building a quality LM from count data is not trivial. If these counts are meant to be counts (as opposed to scaled counts), then it requires substantial processing (smoothing, pruning etc.) to produce a N-Gram LM from the data. It is also the case that a N-Gram model may be built by combining several models (e.g. by linear or log-linear interpolation) in which case counts are not an appropriate way to represent that data. Or are these counts really meant to be thought of as probabilities scaled up by some factor and rounded to the nearest integer?? The examples presented seem to suggest that it really is counts, not probabilities, that are being stored, in which case these aren't LMs. ************************************************ Q2. Section 2. N-Gram Grammar Representation "Counts allows for a data reduction" Where is the reduction? In the file, or some implementation? If the former, how much reduction is it? ************************************************ Q3. Section 2. N-Gram Grammar Representation "Counts are more robust in terms of manipulations of the tree" Could you be explicit about these manipulations? ************************************************ Q4. Section 2. N-Gram Grammar Representation "Presentation of data in depth-first rather than breadth-first" I think this is problematic for four reasons. a.) It seems very conceivable that an implementation may just wish to use the unigrams from the file, not the whole 2,3,4-grams for example, especially if the LM is just a hint, not a requirement. In the depth-first proposal, the implementation would have to read every line in the file, rather than just the first section as would be the case for a breadth-first format. b.) This format precludes using more sophisticated backoff models. For example, models which backoff to a different set of unigrams than the regular unigrams. It also seems to preclude any other type of backoff model. c.) An example in the document (Section 9.1) refers to an LM "classlms.xml" which might well be a class model. If the format for a class model is to use something similar to a N-Gram model, then an immediate problem arises. Namely, if N(u) is the count of class u, then N(u,v) only makes sense if u is also a class as a predecessor. I.e. if the map from word to class depends on the position of the word in the tuple, then the data doesn't fit into this tree. d.) Pruning seems to require pruning of a whole branch of a tree, not just individual nodes. Whilst this is fine for count based pruning, it is restrictive for other pruning methods (Seymore-Rosenfeld, Stolcke's Relative Entropy) Can someone give some more explanation as to why depth-first is a good thing in the context of everything else that is being contemplated? ************************************************ Q5. Section 5. Lexicon Declaration How are Out-of-Vocabulary (OOV) tokens supposed to be recorded? ************************************************ Q6. Section 5. Lexicon Declaration Is there any notion of utterance start/end (corresponding to <s></s> in ARPA-style LMs)? ************************************************ Q7. Section 6. N-Gram Event Count Declaration What are these counts actually counting? Is it the number of times the token (or token tuple) appeared in the corpus? Or is it the number of times it appeared as a context (aka history)? It appears to be the former from the example, and that is not sufficient to build the LM. (If a token X occurs 100 times as a unigram, and the only bigram that occurs is "X Y" 10 times, what should be the probability P(Y|X)?) If counts are missing from the bigrams "A *" relative to the unigram "A", how can one determine if the counts are missing due to pruning, or OOV, or non-existence due to being at the end of a utterance or document? ************************************************ Q8. Section 6. N-Gram Event Count Declaration Is there any restriction on the magnitude of a count in this file? Is a platform expected to be able to handle counts greater than 2^31 for example? Certainly modern corpora may well soon have this many tokens. ************************************************ Q9. Section 6. N-Gram Event Count Declaration Pruning appears to be problematic also. In the example given, if "B A" and "B C" are pruned, what probability should be given to P(B|BA)? ************************************************ Q10. Section 6. N-Gram Event Count Declaration I personally find the format difficult to read and understand, looking at the example at the end of section 6. That may just be because I'm not used to viewing things in this manner. But it looks as though it will not be simple matter to find any particular n-gram probability in this file. Readability need not be a major concern if there are tools to turn it into something readable. ************************************************ Q11. 7. Backoff Weight Declaration If the backoff weight for a context is given, then the probabilities for the backoff LM must have been known. In particular, the smoothing used to compute both the N-gram and (N-1)-gram (backoff) probabilities for that context must have been known, but the smoothing is not specified anywhere, and may not be recomputable if pruning has occurred. ************************************************ Q12. 9.1 Linear Interpolation There is no requirement specified for the weights of the interpolation elements. Presumably, they are required to be non-negative, but not required to be positive? I.E. Greater than or equal to zero? ************************************************ Q13. 9.1 Linear Interpolation There is no requirement that the sum of the weights be 1.0. Why is that? Is it because of the floating point issue? I've yet to come across a situation where the sum of the weights wasn't 1, which wasn't also an error of some kind. For instance, what action is envisioned if all the weights are 0? ************************************************ Q14. 9.1 Linear Interpolation The interpolation weights here are dependent on the LMs, but not on the contexts. Obviously it's much more difficult to specify and/or use context-dependent weights, so this is just a comment on their absence. ************************************************ Q15. 9.1 Linear Interpolation No common lexicon? What action is envisioned for a platform asked to deal with a token which is in one LM, but not another? Or for that matter, for a token which appears in none of the LMs? ************************************************ Q16. 9.1 Linear Interpolation If an implementation doesn't support interpolation, is it envisioned that the platform doesn't use any LM at all? Or that it do whatever it wants? ************************************************ Overall impression It seems as though this is a mix of N-Gram token counts, and N-Gram LM, and it doesn't have enough information for either situation. Perhaps one question to ask is "What builds the LM?" If the platform is expected to build the LM from this data, then that could be a rather heavy burden on the platform. If a separate process builds the LM, and writes it out, then that relieves the platform of much of the effort. One concern is that there are subtleties and non-trivialities in building an LM, and having every platform try and do this could well end up with a mishmash of results. (In fact, this is not a solved problem, with much art as well as science going into the building of a quality LM.) Having stand-along processes that do the hard work (once) of building a particular LM and making it available is an attractive option to consider. While ARPA-style LMs have some problems for today, there are many reasons to like the idea of breadth-first (log) probabilities. Choosing efficiency "more convenient ordering for stream processing and data loading" is likely to be efficient only for some implementations and indeed problematic for others. Reducing the file size is a nice goal, but having all the information necessary to compute the LM is more important. I raise again the question of whether this LM is meant to be a hint to the platform or meant to be used exactly as is. Certainly it is easy to design a structure that no platform actually supports, and then it is a matter of what the platform does with this LM. If the tool building the LM knows the target platform, then it can tailor the LM appropriately, but that seems to defeat the purpose of a spec. On the other hand, once a spec is in place it is possible that it will last a long time -- ARPA-style LMs are still a lowest common denominator. I like to think of backoff LMs as being a collection of probability distributions. For each context there is a main distribution, and a chain of further distributions and backoff weights. For ARPA-style N-gram LMs, this can be represented as a tree, with many contexts of length N-1 backing off to the same context of length N-2 etc. E.g. the distributions for ABC and DBC both back off to the same distribution for BC, which also happens to be the main distribution for context BC. But there is no requirement on a backoff LM that this be the case. Supporting a structure that general may be a bit much to contemplate currently but it should be possible to support having the distributions for contexts of length M specify the backoff model to use, which may be different from the distributions for length M-1. E.g. For a 4-gram APRA-style model the tree of distributions looks like ABC \ BC / \ DBC C -- unigrams / EC ... with ABC backing off to BC backing off to C backing off to unigrams etc. But it should be possible to support something like the following contexts of length 3 start in this tree ABC \ BC / \ DBC C -- unigrams / EC ... contexts of length 2 start in this tree BC' \ C' -- unigrams' / EC' ... contexts of length 1 start in this tree C'' -- unigrams'' ... contexts of length 0 start in this (1-element) tree unigrams''' In summary, this spec feels like it is aiming to be very general in some places but not so in others. It could use a bit of tightening up here and there. But it is also considering efficiency, and in doing so it ends up with something which lies in between a N-Gram corpus count file and an N-Gram LM, and I'm unable to use it for either purpose in its current form. Cheers, Paul ------------------------------------------------------------- Paul van Mulbregt, paulvm@ne.mediaone.net
Received on Thursday, 15 February 2001 00:35:54 UTC