LingPipe provides two API methods in the language model package that allow either collocations or new phrases to be extracted. Both are implemented by classical hypothesis testing. There’s an Interesting Phrases Tutorial on our web site.

Collocation extraction works over a single corpus (text collection). It tests the independence of a phrase. For a bigram phrase like “language model” it uses the classic four-way comparison of bigram counts: count(+language,+model), count(-language,+model), count(+language,-model), count(-language,-model) in a chi-squared test of independence. Basically, the question is: does the phrase “language model” appear more than you would expect it to by chance based on the probability of “language” and “model” occurring independently?

New terminology extraction works over two corpora. The goal is to find things that are mentioned significantly more in the second corpus than the first. Here we use a standard binomial significance test, measuring if the probability in the foreground corpus is “significantly” higher than the probability in the background corpus. This test is based on language models. The standard binomial (large sample) test computes a z-score, which essentially measures the number of standard deviations above or below the number that were expected. This is a beautiful formula:

countFG(x) - totalCountFG * probBG(x) z = ----------------------------------------------- sqrt(totalCountFG * probBG(x) * (1 - probBG(x)))

Here the `totalCountFG`

is the total count of phrases in the foreground corpus, and probBG is the estimate in the background

model of the probability of phrase x. High positive values of z means that the phrase x occurs many more times than expected; low negative values means many times less than expected. Recall that `sqrt(probBG(x)*(1-probBG(x)))`

is the deviation of the background Bernoulli distribution (which is iterated a number of times equal to the total count for the binomial model). This is just the standard hypothesis test for large sample binomial distributions using a normal approximation of the binomial variance, the error of which is bounded by the central limit theorem (isn’t everhthing in stats?).

The actual scaling taking place w.r.t. corpus counts makes more sense if we replace `countFG(x)`

with an estimate based on a foreground language model, because our best estimate of the foreground count is `totalCountFG * pFG(x)`

, where `pFG(x)`

is a foreground model probability. This gives us:

totalCountFG * probFG(x) - totalCountFG * probBG(x) z = ---------------------------------------------------- sqrt(totalCountFG * probBG(x) * (1 - probBG(x))) totalCountFG * (probFG(x) - probBG(x)) = ------------------------------------------------ sqrt(totalCountFG * probBG(x) * (1 - probBG(x))) sqrt(totalCountFG) * (probFG(x) - probBG(x)) = ---------------------------------------------- sqrt(probBG(x) * (1 - probBG(x)))

Here we see that our confidence in the significance of a (deviation scaled) difference goes up with the square root of the total count (as we’d expect, again from the central limit theorem).

In ordering these results for terminology extraction over a single corpus, the total counts are ignored. They are useful if you want to make comparisons across corpora. You can also get rid of the square roots by squaring everything; much faster to compute that way. So the final ordering (as implemented inside LingPipe) is:

(probFG(x) - probBG(x))^{2}= ------------------------------------------------------------------------- (1-probBG(x)) * probBG(x)

(Curse this too-smart-for-it’s-own-good blog software — it’s insisting on em-dashes here.)

Thus we’re ordering by probability differences scaled by deviation. But note that `1-probBG(x)`

is so close to `1 when `

`probBG(x)`

is small (as it usually is here), that it can be safely dropped from the denominator of the above equation (as suggested in Manning and Schuetze’s book).

Now if the background model is just a unigram (or other lower order model), you can use this second setup to test independence; things that are significantly more likelely in the higher-order model than the lower-order model are more phrase-like.

Turns out I’m not the only one who thought of doing things this way. Yesterday, I read an interesting paper by Takashi Tomokiyo and Matthew Hurst A language model approach to keyphrase extraction from a 2003 ACL workshop. They distinguish two issues. The first is *phrasiness* (our notion of collocation), which they measure over a single corpus by taking the foreground model to be an n-gram and the bacground model to be a unigram. The second is *informativeness*, which we’ve called *newness* or *significance*. They measure informativeness by comparing a foreground model to a background model of the same order. What’s interesting is that they also consider crossing these measures, with the best results coming from a linear interpolation of phrasiness and newness. We should try this, too.

Here’s where things get interesting. Instead of using classical hypothesis testing for significance of differences, they use compression loss, which totally makes sense. They wind up with a formula:

probFG(x) delta = probFG(x) * log --------- probBG(x)) = probFG(x) * [log probFG(x) - log probBG(x)]

Note that the log difference is the loss (in bits if logs are base 2) in compression for coding a single instance (that is, it’s the cross-entropy), and the whole thing is scaled by an estimate of the number of instances that need to be coded.

It has all the same parts as our estimate, but counter-intuitively, they’re in different polarities. Although the difference of logs is going in the same direction as our squared difference, Tomokiyo and Hurst multiply by foreground probability rather than (approximately) dividing by background probability. My guess is that this’ll skew their results toward less significant differences (in the statistical sense) that are more likely.

I’ll have to check out the combination approach and see what it looks like. I can also compare to what they did, because with the visitor pattern I used to encode terminology extraction, this’ll be easy to code.

November 4, 2006 at 4:10 am |

The hypothesis that is commonly used these days is chi-square goodness of fit test. Didn’t ever hear about binomial hypothesis testing.

November 6, 2006 at 3:46 pm |

What I describe in the post is the textbook method of hypothesis testing of a binomial model. For details, check out any math stats book; I liked Larsen and Marx’s

Intro to Mathematical Statistics. Section 6.3 is “Testing Binomial Data”. They apparently teach binomial hypothesis testing in school in the UK!You can think of binomial hypothesis testing as a single point chi-squared test. Wikipedia’s Chi-Square Test entry is nice in that it shows you how each table entry in the chi-squared test is a binomial hypothesis test, and how those binomials are estimated by normals for high count data. With multiple cells in the chi-square tests, the final estimate is a sum of normal approximations, which is distributed approximate chi-squared. So really chi-square tests are just series of binomial tests combined by summation.

In our application, it’s a binomial generated using independence assumptions versus a binomial model using our higher-order n-gram models. If you read the collocation chapter of Chris Manning and Hinrich Schuetze’s book on statistical NLP, they go over both chi-square and t-tests for collocation data.

LingPipe uses the chi-squared test for independence in generating collocations through the collocation method for token language models. You can also do this with a binomial by building a unigram background model. Our interesting terms method uses the binomial hypothesis test in measuring two distributions against each other. If you read the code or doc carefully enough, you’ll see I don’t really take into account the variance of both estimates, which I should be doing.