All About Language Models

Posted in NLP Research by rsmudge on March 4, 2010

One of the challenges with most natural language processing tasks is getting data and collapsing it into a usable model. Prepping a large data set is hard enough. Once you’ve prepped it, you have to put it into a language model. My old NLP lab (consisting of two computers I paid $100 for from Cornell University), it took 18 hours to build my language models. You probably have better hardware than I did.

Save the Pain

I want to save you some pain and trouble if I can. That’s why I’m writing today’s blog post. Did you know After the Deadline has prebuilt bigram language models for English, German, Spanish, French, Italian, Polish, Indonesian, Russian, Dutch, and Portuguese? That’s 10 languages!

Also, did you know that the After the Deadline language model is a simple serialized Java object. In fact, the only dependency to use it is one Java class. Now that I’ve got you excited, let’s ask… what can you do with a language model?

Language Model API

A bigram language model has the count of every sequence of two words seen in a collection of text. From this information you can calculate all kinds of interesting things.

As an administrative note, I will use the Sleep programming language for these examples. This code is trivial to port to Java but I’m on an airplane and too lazy to whip out the compiler.

Let’s load up the Sleep interactive interpreter and load the English language model. You may assume all these commands are executed from the top-level directory of the After the Deadline source code distribution.

$ java -Xmx1536M -jar lib/sleep.jar 
>> Welcome to the Sleep scripting language
> interact
>> Welcome to interactive mode.
Type your code and then '.' on a line by itself to execute the code.
Type Ctrl+D or 'done' on a line by itself to leave interactive mode.
import * from: lib/spellutils.jar;
$handle = openf('models/model.bin');
$model = readObject($handle);
closef($handle);
println("Loaded $model");
.
Loaded org.dashnine.preditor.LanguageModel@5cc145f9
done

And there you have it. A language model ready for your use. I’ll walk you through each API method.

Count Words

The count method returns the number of times the specified word was seen.

Examples:

> x [$model count: "hello"]
153
> x [$model count: "world"]
26355
> x [$model count: "the"]
3046771

Word Probability

The Pword method returns the probability of a word. The Java way to call this is model.Pword(“word”).

> x [$model Pword: "the"]
0.061422322118108906
> x [$model Pword: "Automattic"]
8.063923690767558E-7
> x [$model Pword: "fjsljnfnsk"]
0.0

Word Probability with Context

That’s the simple stuff. The fun part of the language model comes in when you can look at context. Imagine the sentence: “I want to bee an actor”. With the language model we can compare the fit of the word bee with the fit of the word be given the context. The contextual probability functions let you do that.

Pbigram1(“previous”, “word”)

This method calculates P(word|previous) or the probability of the specified word given the previous word. This is the most straight forward application of our bigram language model. After all, we have every count of “previous word” that was seen in the corpus we trained with. We simply divide this by the count of previous to arrive at an answer. Here we use our contextual probability to look at be vs. bee.

> x [$model Pbigram1: "to", "bee"]
1.8397294594205855E-5
> x [$model Pbigram1: "to", "be"]
0.06296975819264979

Pbigram2(“word”, “next”)

This method calculates P(word|next) or the probability of the specified word given the next word. How does it do it? It’s a simple application of Bayes Theorem. Bayes Theorem lets us flip the conditional in a probability. It’s calculated as: P(word|next) = P(next|word) * P(word) / P(next). Here we use it to further investigate the probability of be vs. bee:

> x [$model Pbigram2: "bee", "an"]
0.0
> x [$model Pbigram2: "be", "an"]
0.014840446919206074

If you were a computer, which word would you assume the writer meant?

A Little Trick

These methods will also accept a sequence of two words in the parameter that you’re calculating the probability of. I use this trick to segment a misspelled word with a space between each pair of letters and compare the results to the other spelling suggestions.

> x [$model Pword: "New York"]
3.3241509434266565E-4
> x [$model Pword: "a lot"]
2.1988303923800437E-4
> x [$model Pbigram1: "it", "a lot"]
8.972553689218159E-5
> x [$model Pbigram2: "a lot", '0END.0']
6.511467636360339E-7

You’ll notice 0END.0. This is a special word. It represents the end of a sentence. 0BEGIN.0 represents the beginning of a sentence. The only punctuation tracked by these models is the ','. You can refer to it directly.

Harvest a Dictionary

One of my uses for the language model is to dump a spell checker dictionary. I do this by harvesting all words that occur two or more times. When I add enough data, I’ll raise this number to get a higher quality dictionary. To harvest a dictionary:

> x [$model harvest: 1000000]
[a, of, to, and, the, 0END.0, 0BEGIN.0]

This command harvests all words that occur a million or more times. As you can see, there aren’t too many. The language model I have now was derived 75 million words of text.

The Next Step

That’s the After the Deadline language model in a nutshell. There is also a method to get the probability of a word given the two words that came before it. This is done using trigrams. I didn’t write about it here because AtD stores trigrams for words tracked by the misused word detector only.

That said, there’s a lot of fun you can have with this kind of data.

Download the After the Deadline open source distribution. You’ll find the English language model at models/model.bin. You can also get spellutils.jar from the lib directory.

If you want to experiment with bigrams in other languages, the After the Deadline language pack has the trained language models for nine other languages.

The English language model was trained from many sources including Project Gutenberg and Wikipedia. The other language models were trained from their respective Wikipedia dumps.

Good luck and have fun.

After the Deadline is an open source grammar, style, and spell checker. Unlike other tools, it uses context to make smart suggestions for errors. Plugins are available for Firefox, WordPress, and others.

5 comments

5 Responses

Subscribe to comments with RSS.

Kevin said, on March 11, 2010 at 8:51 am

Nice walkthrough! And I like the example use of Bayes Theorem; they should put that in statistics text books to accompany the endless list of breast cancer problems =P

Looking forward to when I have time to actually play around with this stuff 🙂
- rsmudge said, on March 11, 2010 at 1:34 pm
  
  Thanks. I would have liked it if my education was full of more relevant examples too. Unfortunately many who teach haven’t been exposed to R&D outside of the academic system, so breast cancer problems it is. *sigh*
Measuring the Real Word Error Corrector « After the Deadline said, on April 9, 2010 at 11:47 pm

[…] next two words requires trigrams and a little algebra using Bayes’ Theorem. I wrote a post on language models earlier. To bring all this together, After the Deadline uses a neural network to combine these […]
OpenOffice.org Grammar Checkers « After the Deadline said, on June 14, 2010 at 9:58 am

[…] know about After the Deadline. It’s a proofreading software service. After the Deadline uses statistical language models to offer smarter grammar and style recommendations. It also uses the same language models to detect […]
After the Deadline Bigram Corpus – Our Gift to You… « After the Deadline said, on July 20, 2010 at 8:46 pm

[…] in NLP Research, News by rsmudge on July 20, 2010 The zesty sauce of After the Deadline is our language model. We use our language model to improve our spelling corrector, filter ill-fitting grammar checker […]

Comments are closed.

After the Deadline