All About Language Models
One of the challenges with most natural language processing tasks is getting data and collapsing it into a usable model. Prepping a large data set is hard enough. Once you’ve prepped it, you have to put it into a language model. My old NLP lab (consisting of two computers I paid $100 for from Cornell University), it took 18 hours to build my language models. You probably have better hardware than I did.
Save the Pain
I want to save you some pain and trouble if I can. That’s why I’m writing today’s blog post. Did you know After the Deadline has prebuilt bigram language models for English, German, Spanish, French, Italian, Polish, Indonesian, Russian, Dutch, and Portuguese? That’s 10 languages!
Also, did you know that the After the Deadline language model is a simple serialized Java object. In fact, the only dependency to use it is one Java class. Now that I’ve got you excited, let’s ask… what can you do with a language model?
Language Model API
A bigram language model has the count of every sequence of two words seen in a collection of text. From this information you can calculate all kinds of interesting things.
As an administrative note, I will use the Sleep programming language for these examples. This code is trivial to port to Java but I’m on an airplane and too lazy to whip out the compiler.
Let’s load up the Sleep interactive interpreter and load the English language model. You may assume all these commands are executed from the top-level directory of the After the Deadline source code distribution.
$ java -Xmx1536M -jar lib/sleep.jar >> Welcome to the Sleep scripting language > interact >> Welcome to interactive mode. Type your code and then '.' on a line by itself to execute the code. Type Ctrl+D or 'done' on a line by itself to leave interactive mode. import * from: lib/spellutils.jar; $handle = openf('models/model.bin'); $model = readObject($handle); closef($handle); println("Loaded $model"); . Loaded org.dashnine.preditor.LanguageModel@5cc145f9 done
And there you have it. A language model ready for your use. I'll walk you through each API method.
The count method returns the number of times the specified word was seen.
> x [$model count: "hello"] 153 > x [$model count: "world"] 26355 > x [$model count: "the"] 3046771
The Pword method returns the probability of a word. The Java way to call this is model.Pword("word").
> x [$model Pword: "the"] 0.061422322118108906 > x [$model Pword: "Automattic"] 8.063923690767558E-7 > x [$model Pword: "fjsljnfnsk"] 0.0
Word Probability with Context
That's the simple stuff. The fun part of the language model comes in when you can look at context. Imagine the sentence: "I want to bee an actor". With the language model we can compare the fit of the word bee with the fit of the word be given the context. The contextual probability functions let you do that.
This method calculates P(word|previous) or the probability of the specified word given the previous word. This is the most straight forward application of our bigram language model. After all, we have every count of "previous word" that was seen in the corpus we trained with. We simply divide this by the count of previous to arrive at an answer. Here we use our contextual probability to look at be vs. bee.
> x [$model Pbigram1: "to", "bee"] 1.8397294594205855E-5 > x [$model Pbigram1: "to", "be"] 0.06296975819264979
This method calculates
P(word|next) or the probability of the specified word given the next word. How does it do it? It's a simple application of Bayes Theorem. Bayes Theorem lets us flip the conditional in a probability. It's calculated as:
P(word|next) = P(next|word) * P(word) / P(next). Here we use it to further investigate the probability of be vs. bee:
> x [$model Pbigram2: "bee", "an"] 0.0 > x [$model Pbigram2: "be", "an"] 0.014840446919206074
If you were a computer, which word would you assume the writer meant?
A Little Trick
These methods will also accept a sequence of two words in the parameter that you're calculating the probability of. I use this trick to segment a misspelled word with a space between each pair of letters and compare the results to the other spelling suggestions.
> x [$model Pword: "New York"] 3.3241509434266565E-4 > x [$model Pword: "a lot"] 2.1988303923800437E-4 > x [$model Pbigram1: "it", "a lot"] 8.972553689218159E-5 > x [$model Pbigram2: "a lot", '0END.0'] 6.511467636360339E-7
0END.0. This is a special word. It represents the end of a sentence.
0BEGIN.0 represents the beginning of a sentence. The only punctuation tracked by these models is the
','. You can refer to it directly.
Harvest a Dictionary
One of my uses for the language model is to dump a spell checker dictionary. I do this by harvesting all words that occur two or more times. When I add enough data, I'll raise this number to get a higher quality dictionary. To harvest a dictionary:
> x [$model harvest: 1000000] [a, of, to, and, the, 0END.0, 0BEGIN.0]
This command harvests all words that occur a million or more times. As you can see, there aren't too many. The language model I have now was derived 75 million words of text.
The Next Step
That's the After the Deadline language model in a nutshell. There is also a method to get the probability of a word given the two words that came before it. This is done using trigrams. I didn't write about it here because AtD stores trigrams for words tracked by the misused word detector only.
That said, there's a lot of fun you can have with this kind of data.
Download the After the Deadline open source distribution. You'll find the English language model at models/model.bin. You can also get spellutils.jar from the lib directory.
If you want to experiment with bigrams in other languages, the After the Deadline language pack has the trained language models for nine other languages.
Good luck and have fun.
After the Deadline is an open source grammar, style, and spell checker. Unlike other tools, it uses context to make smart suggestions for errors. Plugins are available for Firefox, WordPress, and others.