NLP Research | After the Deadline

After the Deadline Bigram Corpus – Our Gift to You…

Posted in News, NLP Research by rsmudge on July 20, 2010

The zesty sauce of After the Deadline is our language model. We use our language model to improve our spelling corrector, filter ill-fitting grammar checker suggestions, and even detect if you used the wrong word.

It’s not hard to build a language model, but it can be time-consuming. Our binary model files have always been available through our GPL After the Deadline distribution.

Today, as our gift to you, we’re releasing ASCII dumps of our language models under a creative commons attribution license. There is no over-priced consortium to join and you don’t need a university affiliation to get these files.

Here are the files:

unigrams.txt.gz

This file contains each word token, a tab separator, and a count. There are 164,834 words from 76,932,676 occurrences. Our spell checker dictionary is made of words that occur two or more times in this list.

beneficently    4
Yolande 12
Fillmore's      4
kantar  2
Kayibanda       3
Solyman 2
discourses      92
Yolanda 11
discourser      1

bigrams.txt.gz

This file is a dump of each two-word sequence that occurs in our corpus. It has 5,612,483 word pairs associated with a count. You can use this information to calculate the probability of a word given its next or previous words.

military annexation     4
military deceptions     1
military language       1
military legislation    1
military sophistication 1
military officer        61
military riot   1
military conspiracy     1
military retirement     2

trigrams-homophones.txt.gz

This file has a limited set of trigrams (sequences of three words). Each trigram begins or ends with a word in our confusion set text file. You will need the information from the bigram corpus to construct trigram probabilities for these words.


a puppy given   1
a puppy for     4
a puppy dies    1
a puppy and     4
a puppy named   2
a puppy is      3
a puppy of      3
a puppy with    1
a puppy when    2

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

5 comments

The Design of a Proofreading Software Service

Posted in NLP Research by rsmudge on June 9, 2010

I spent Sunday at the Computational Linguistics and Writing Workshop on Writing Processes and Authoring Aids held at NAACL-HLT 2010. There I presented After the Deadline. After the Deadline is an open source proofreading software service. If you’re curious about how contextual grammar and spell checkers work, then you’ll want to read on. Here are the materials:

Presentation

Slides

View this document on Scribd

Paper

View this document on Scribd

More Depth

If you want more depth, I write about AtD quite often. Here are a few related blog posts that may interest you:

AtD Source Code and Bootstrap Data – the server side heart of AtD is available under the GNU GPL
Measuring the Misused Word Corrector – data / code to replicate the experiment used to measure AtD’s real-word error detection abilities
How I Trie to Capture Mistakes – The paper glosses over the use of a Trie to generate a pool of suggestions. There just wasn’t enough space to explain it. This blog post covers it in detail.

1 comment

Measuring the Real Word Error Corrector

Posted in NLP Research by rsmudge on April 9, 2010

Before we begin: Did you notice my fancy and SEO friendly post title? Linguists refer to misused words as real word errors. When I write about real word errors in this post, I’m really referring to the misused word detector in After the Deadline.

One of my favorite features in After the Deadline is the real word error corrector. In this post, I’ll talk about this feature and how it works. I will also present an evaluation of this tool compared to Microsoft Word 2007 for Windows which has a similar feature, one they call a contextual spelling checker.

Confusion Sets

After the Deadline has a list of 1,603 words it looks at. In this list, words are grouped into confusion sets. A confusion set is two or more words that may be misused during the act of writing. Some surprise me, for example I saw someone write portrait on Hacker News when they meant portray. The words portrait and portray are an example of a confusion set.

Confusion sets are a band-aid and a limitation but they have their place for now. In an ideal program, the software would look at any word you use and try to decide if you meant some other word using various criteria at its disposal. After the Deadline doesn’t do this because it would mean storing more information about every word and how it’s used than my humble server could handle. Because of memory (and CPU constraints), I limit the words I check based on these fixed confusion sets.

Finding Errors

To detect real word errors, After the Deadline scans your document looking for any words in that potentially misused word list. When AtD encounters one of these words, it looks at the word’s confusion set and checks if any other word is a better fit.

How does this happen? It’s pretty simple. AtD looks two words to the left and two words to the right of the potentially misused word and tries to decide which of the words from the confusion set are the best fit. This looks like:

I’ll admit, the site does portrait credibility.

Here, After the Deadline uses the following statistical features to decide which word W (portrait or portray) you want:

P(W | site, does)
P(W | credibility, END)
P(W | does)
P(W | credibility)
P(W)

The probability of a word given the previous two words is calculated using a trigram. When After the Deadline learns a language, it stores every sequence of two words it finds and sequences of three words that begin or end with a confusion set. To calculate the probability of a word given the next two words requires trigrams and a little algebra using Bayes’ Theorem. I wrote a post on language models earlier. To bring all this together, After the Deadline uses a neural network to combine these statistical features into one score between 0.0 and 1.0. The word with the highest score, wins. To bias against false positives, the current word is multiplied by 10 to make sure it wins in ambiguous cases.

Let’s Measure It

Ok, so the natural question is, how well does this work? Every time I rebuild After the Deadline, I run a series of tests that check the neural network scoring function and tell me how often it’s correct and how often it’s wrong. This kind of evaluation serves as a good unit test but it’s hard to approximate real word world performance from it.

Fortunately, Dr. Jennifer Pedler’s PhD thesis has us covered. In her thesis she developed and evaluated techniques for detecting real word errors to help writers with dyslexia. Part of her research consisted of collecting writing samples from writer’s with dyslexia and annotating the errors along with the expected corrections. I took a look at her data and found that 97.8% of the 835 errors are real word errors. Perfect for an evaluation of a real word error corrector.

Many things we consider the realm of the grammar checker are actually real word errors. Errors that involve the wrong verb tense (e.g., built and build), indefinite articles (a and an), and wrong determiners (the, their, etc.) are real word errors. You may ask, can real word error detection be applied to grammar checking? Yes, others are working on it. It makes sense to test how well After the Deadline as a complete system (grammar checker, misused word detector, etc.) performs correcting these errors.

To test After the Deadline, I wrote a Sleep script to compare a corrected version of Dr. Pedler’s error corpus to the original corpus with errors. The software measures how many errors were found and changed to something (the recall) and how often these changes were correct (the precision). This test does not measure the number of elements outside the annotated errors that were changed correctly or incorrectly.

To run it:

grep -v ‘\-\-\-\-‘ corrected.txt >corrected.clean.txt
java -jar sleep.jar measure.sl corrected.clean.txt

Now we can compare one writing system to another using Dr. Pedler’s data. All we need to do is paste the error corpus into a system, accept every suggestion, and run the measure script against it. To generate the error file, I wrote another script that reads in Dr. Pedler’s data and removes the annotations:

java -jar sleep.jar errors.sl > errors.txt

Now we’re ready. Here are the numbers comparing After the Deadline to Microsoft Word 2007 on Windows, Microsoft Word 2008 on the Mac, and Apple’s Grammar and Spell checker built into MacOS X 10.6. I include Microsoft Word 2008 and the Apple software because neither of these have a contextual spell checker. They still correct some real word errors with grammar checking technology.

System	Precision	Recall
MS Word 2007 – Windows	90.0%	40.8%
After the Deadline	89.4%	27.1%
MS Word 2008 – MacOS X	79.7%	17.7%
MacOS X Spell and Grammar Checker	88.5%	9.3%

As you can see, Microsoft Word 2007 on Windows performs well in the recall department. Every error in the dyslexic corpus that is not in After the Deadline’s confusion sets is an automatic hit against recall. Still, the precision for both systems are similar.

You can try this experiment yourself. The code and data are available. I also created a page where you can paste in a document and accept all suggestions with one click.

Another Evaluation

Another evaluation of MS Word 2007’s real-word error detection was published by Graeme Hirst, University of Toronto in 2008. He found that MS Word has lower recall on errors. To evaluate MS Word, he randomly inserted (1:200 words) real-word errors into the Wall Street Journal corpus, and then measured the system performance. It would be a benefit to the research community (especially *cough*those of us outside of universities*cough*) if such tests were conducted on data that one could redistribute.

Final Thoughts

After running this experiment, I added the syntactically distinct words from Dr. Pedler’s confusion sets to AtD’s existing confusion sets, pushed the magical button to rebuild models, and reran these tests. I saw AtD’s recall rise to 33.9% with a precision of 86.6%. Expanding the confusion sets used in AtD will improve AtD’s real-word error correction performance.

I’ll continue to work on expanding the number of confusion sets in After the Deadline. One challenge to whole-sale importing several words is that some words create more false positives than others when checked using the method described here (hence the precision drop when adding several unevaluated new words to the misused word detector).

If you’re using Microsoft Word 2007 on Windows, you’re in good hands. Still, After the Deadline is comparable in the real-word error detection department when MS Word 2007 isn’t available.

If you’re using a Mac, you should use After the Deadline.

5 comments

Humble Origins: PolishMyWriting.com

Posted in NLP Research, Talking to myself by rsmudge on March 19, 2010

I found an old screenshot today and thought I’d share it to give you an idea of (1) how bad my design eye is and (2) some history of After the Deadline. After the Deadline started life as a web-based style checker hosted at PolishMyWriting.com. My goal? Convince people to paste in their documents to receive feedback on the fly, while I made tons of money from ad revenue. It seemed like a good idea at the time.

The Original PolishMyWriting.com

PolishMyWriting.com did not check spelling, misused words, or grammar. It relied on 2,283 rules to look for phrases to avoid, jargon terms, biased phrases, clichés, complex phrases, foreign words, and redundant phrases. The most sophisticated natural language processing in the system detected passive voice and hidden verbs.I wouldn’t call it sophisticated though. I referenced a chart and wrote simple heuristics to capture the different forms of passive voice. Funny, it’s the same passive voice detector used in After the Deadline today.

This rule-based system presents all of its suggestions (with no filtering or reordering) to the end-user. A hacker news user agreed with 50% of what it had to say and that’s not too bad. After the Deadline today looks at the context of any phrase it flags and tries to decide whether a suggestion is appropriate or not. A recent reviewer of After the Deadline says he agrees with 75% of what it says. An improvement!

How PolishMyWriting.com Worked

My favorite part of PolishMyWriting.com was how it stored rules. All the rules were collapsed into a tree. From each word position in the document, the system would the tree looking for the deepest match. In this way PolishMyWriting.com only had to evaluate rules that were relevant to a given position in the text. It was also easy for the match to fail right away (hey, the current word doesn’t match any of the possible starting words for a rule). With this I was able to create as many rules as I liked without impacting the performance of the system. The rule-tree in After the Deadline today has 33,331 end-states. Not too bad.

PolishMyWriting.com Rule Tree

The rule-tree above matches six rules. If I were to give it a sentence: I did not remember to write a lot of posts. The system would start with I and realize there is nowhere to go. The next word is did. It would look at did and check if any children from did in the tree match the word after did in the sentence. In this case not matches. The system repeats this process from not. The next word that matches is succeed. Here PolishMyWriting.com would present the suggestions for did not succeed to the user. If the search would have failed to turn up an end state for did not succeed, the system would advance to not and repeat the process from the beginning. Since it found a match, the process would start again from to write a lot of posts. The result I [forgot] to write [many, much] blog posts.

What happened to PolishMyWriting.com?

I eventually dumped the PolishMyWriting.com name. Mainly because I kept hearing jokes from people online (and in person) about how great it was that I developed a writing resource for the Polish people. It stills exists, mainly as a place to demonstrate After the Deadline.

And don’t forget, if PolishMyWriting.com helps you out.. you can link to it using our nifty banner graphic. 😉

2 comments

All About Language Models

Posted in NLP Research by rsmudge on March 4, 2010

One of the challenges with most natural language processing tasks is getting data and collapsing it into a usable model. Prepping a large data set is hard enough. Once you’ve prepped it, you have to put it into a language model. My old NLP lab (consisting of two computers I paid $100 for from Cornell University), it took 18 hours to build my language models. You probably have better hardware than I did.

Save the Pain

I want to save you some pain and trouble if I can. That’s why I’m writing today’s blog post. Did you know After the Deadline has prebuilt bigram language models for English, German, Spanish, French, Italian, Polish, Indonesian, Russian, Dutch, and Portuguese? That’s 10 languages!

Also, did you know that the After the Deadline language model is a simple serialized Java object. In fact, the only dependency to use it is one Java class. Now that I’ve got you excited, let’s ask… what can you do with a language model?

Language Model API

A bigram language model has the count of every sequence of two words seen in a collection of text. From this information you can calculate all kinds of interesting things.

As an administrative note, I will use the Sleep programming language for these examples. This code is trivial to port to Java but I’m on an airplane and too lazy to whip out the compiler.

Let’s load up the Sleep interactive interpreter and load the English language model. You may assume all these commands are executed from the top-level directory of the After the Deadline source code distribution.

$ java -Xmx1536M -jar lib/sleep.jar 
>> Welcome to the Sleep scripting language
> interact
>> Welcome to interactive mode.
Type your code and then '.' on a line by itself to execute the code.
Type Ctrl+D or 'done' on a line by itself to leave interactive mode.
import * from: lib/spellutils.jar;
$handle = openf('models/model.bin');
$model = readObject($handle);
closef($handle);
println("Loaded $model");
.
Loaded org.dashnine.preditor.LanguageModel@5cc145f9
done

And there you have it. A language model ready for your use. I’ll walk you through each API method.

Count Words

The count method returns the number of times the specified word was seen.

Examples:

> x [$model count: "hello"]
153
> x [$model count: "world"]
26355
> x [$model count: "the"]
3046771

Word Probability

The Pword method returns the probability of a word. The Java way to call this is model.Pword(“word”).

> x [$model Pword: "the"]
0.061422322118108906
> x [$model Pword: "Automattic"]
8.063923690767558E-7
> x [$model Pword: "fjsljnfnsk"]
0.0

Word Probability with Context

That’s the simple stuff. The fun part of the language model comes in when you can look at context. Imagine the sentence: “I want to bee an actor”. With the language model we can compare the fit of the word bee with the fit of the word be given the context. The contextual probability functions let you do that.

Pbigram1(“previous”, “word”)

This method calculates P(word|previous) or the probability of the specified word given the previous word. This is the most straight forward application of our bigram language model. After all, we have every count of “previous word” that was seen in the corpus we trained with. We simply divide this by the count of previous to arrive at an answer. Here we use our contextual probability to look at be vs. bee.

> x [$model Pbigram1: "to", "bee"]
1.8397294594205855E-5
> x [$model Pbigram1: "to", "be"]
0.06296975819264979

Pbigram2(“word”, “next”)

This method calculates P(word|next) or the probability of the specified word given the next word. How does it do it? It’s a simple application of Bayes Theorem. Bayes Theorem lets us flip the conditional in a probability. It’s calculated as: P(word|next) = P(next|word) * P(word) / P(next). Here we use it to further investigate the probability of be vs. bee:

> x [$model Pbigram2: "bee", "an"]
0.0
> x [$model Pbigram2: "be", "an"]
0.014840446919206074

If you were a computer, which word would you assume the writer meant?

A Little Trick

These methods will also accept a sequence of two words in the parameter that you’re calculating the probability of. I use this trick to segment a misspelled word with a space between each pair of letters and compare the results to the other spelling suggestions.

> x [$model Pword: "New York"]
3.3241509434266565E-4
> x [$model Pword: "a lot"]
2.1988303923800437E-4
> x [$model Pbigram1: "it", "a lot"]
8.972553689218159E-5
> x [$model Pbigram2: "a lot", '0END.0']
6.511467636360339E-7

You’ll notice 0END.0. This is a special word. It represents the end of a sentence. 0BEGIN.0 represents the beginning of a sentence. The only punctuation tracked by these models is the ','. You can refer to it directly.

Harvest a Dictionary

One of my uses for the language model is to dump a spell checker dictionary. I do this by harvesting all words that occur two or more times. When I add enough data, I’ll raise this number to get a higher quality dictionary. To harvest a dictionary:

> x [$model harvest: 1000000]
[a, of, to, and, the, 0END.0, 0BEGIN.0]

This command harvests all words that occur a million or more times. As you can see, there aren’t too many. The language model I have now was derived 75 million words of text.

The Next Step

That’s the After the Deadline language model in a nutshell. There is also a method to get the probability of a word given the two words that came before it. This is done using trigrams. I didn’t write about it here because AtD stores trigrams for words tracked by the misused word detector only.

That said, there’s a lot of fun you can have with this kind of data.

Download the After the Deadline open source distribution. You’ll find the English language model at models/model.bin. You can also get spellutils.jar from the lib directory.

If you want to experiment with bigrams in other languages, the After the Deadline language pack has the trained language models for nine other languages.

The English language model was trained from many sources including Project Gutenberg and Wikipedia. The other language models were trained from their respective Wikipedia dumps.

Good luck and have fun.

After the Deadline is an open source grammar, style, and spell checker. Unlike other tools, it uses context to make smart suggestions for errors. Plugins are available for Firefox, WordPress, and others.

5 comments

N-Gram Language Guessing with NGramJ

Posted in Multi-Lingual AtD, NLP Research by rsmudge on February 8, 2010

NGramJ is a Java library for language recognition. It uses language profiles (counts of character sequences) to guess what language some arbitrary text is. In this post I’ll briefly show you how to use it from the command-line and the Java API. I’ll also show you how to generate a new language profile. I’m doing this so I don’t have to figure out how to do it again.

Running

You can get a feel for how well NGramJ works by trying it on the command line. For example:

$ cat >a.txt
This is a test.
$ java -jar cngram.jar -lang2 a.txt
speed: en:0.667 ru:0.000 pt:0.000 .. de:0.000 |0.0E0 |0.0E0 dt=2812
$ cat >b.txt
Wikipedia ist ein Projekt zum Aufbau einer Enzyklopädie aus freien Inhalten in allen Sprachen der Welt.
$ java -jar cngram.jar -lang2 b.txt
speed: de:0.857 ru:0.000 pt:0.000 .. en:0.000 |0.0E0 |0.0E0 dt=2077

Using

Something I like about the API for this program–it’s simple. It is also thread-safe. You can instantiate a static reference for the library and call it from any thread later. Here is some code adopted from the Flaptor Utils library.

import de.spieleck.app.cngram.NGramProfiles;

protected NGramProfiles profiles = new NGramProfiles();

public String getLanguage(String text) {
 NGramProfiles.Ranker ranker = profiles.getRanker();
 ranker.account(text);
 NGramProfiles.RankResult result = ranker.getRankResult();
 return result.getName(0);
}

Now that you know how to use the library for language guessing, I’ll show you how to add a new language.

Adding a New Language

NGramJ comes with several language profiles but you may have a need to generate one yourself. A great source of language data is Wikipedia. I’ve written about extracting plain-text from Wikipedia here before. Today, I needed to generate a profile for Indonesian. The first step is to create a raw language profile. You can do this with the cngram.jar file:

$ java -jar cngram.jar -create id_big id_corpus.txt
new profile 'id_big.ngp' was created.

This will create an id.ngp file. I also noticed this file is huge. Several hundred kilobytes compared to the 30K of the other language profiles. The next step is to clean the language profile up. To do this, I created a short Sleep script to read in the id.ngp file and cut any 3-gram and 4-gram sequences that occur less than 20K times. I chose 20K because it leaves me with a file that is about 30K. If you have less data, you’ll want to adjust this number downwards. The other language profiles use 1000 as a cut-off. This leads me to believe they were trained on 6MB of text data versus my 114MB of Indonesian text.

Here is the script:

%grams = ohash();
setMissPolicy(%grams, { return @(); });

$handle = openf(@ARGV[0]);
$banner = readln($handle);
readln($handle); # consume the ngram_count value

while $text (readln($handle)) {
   ($gram, $count) = split(' ', $text);

   if (strlen($gram) <= 2 || $count > 20000) {
      push(%grams[strlen($gram)], @($gram, $count));
   }
}
closef($handle);

sub sortTuple {
   return $2[1] <=> $1[1];
}

println($banner);

printAll(map({ return join(" ", $1); }, sort(&sortTuple, %grams[1])));
printAll(map({ return join(" ", $1); }, sort(&sortTuple, %grams[2])));
printAll(map({ return join(" ", $1); }, sort(&sortTuple, %grams[3])));
printAll(map({ return join(" ", $1); }, sort(&sortTuple, %grams[4])));

To run the script:

$ java -jar lib/sleep.jar sortit.sl id_big.ngp >id.ngp

The last step is to copy id.ngp into src/de/spieleck/app/cngram/ and edit src/de/spieleck/app/cngram/profiles.lst to contain the id resource. Type ant in the top-level directory of the NGramJ source code to rebuild cngram.jar and then you’re ready to test:

$ cat >c.txt
Selamat datang di Wikipedia bahasa Indonesia, ensiklopedia bebas berbahasa Indonesia
$ java -jar cngram.jar -lang2 c.txt
speed: id:0.857 ru:0.000 pt:0.000 .. de:0.000 |0.0E0 |0.0E0 dt=1872

As you can see NGramJ is an easy to work with library. If you need to do language guessing, I recommend it.

Tagged with: Java, language, language guessing, ngramj, NLP

6 comments

How I Trie to Make Spelling Suggestions

Posted in NLP Research by rsmudge on January 29, 2010

Spell checking is a three-step process. Check if a word is in a dictionary, generate potential suggestions, and then sort the suggestions–hopefully with the intended word on top. Dr. Peter Norvig wrote a great essay on this and I suggest you read it.

Knowing about this three-step process, can you guess what the slowest part of it is? It’s not the dictionary check. It’s not the sorting of the suggestions either. Generating the potential suggestions is the killer. Today I’ll discuss a clever way to use to speed this process up using Tries.

The Problem: How I Generate Suggestions

The AtD spelling corrector generates potential suggestions by considering all words within two edits of the misspelled word. An edit is inserting, deleting, or substituting a letter. Transposing two letters is also an edit. From my own tests I found 96% of the time the correct word is within two edits. 1.5% of the time the desired word is within three edits. Beyond that, who cares?

Why is this operation so slow? The naïve algorithm is to generate all possible edits to the misspelled word to get every edit within an edit distance of one. For a word of length n, we have 54n+25 edits assuming a lowercase alphabet of a-z. To get all words within an edit distance of two, apply all possible edits to the previously generated edits. This number gets big quickly and forget three edits using this algorithm.

The Trie

A Trie is a tree data structure for storing values associated with a key. Each character of the key represents a branch of the Trie. Retrieving a value consists of finding the branch associated with the first character of the key, chopping it off, and repeating this process until there is no key. Tries are nice because the common values of the keys are shared making them space efficient for storing things like dictionaries.

Here is the Sleep code to build a Trie:

sub trie {
	local('$trie $word $x $root $letter');

	$trie = %();

	foreach $word ($1) {
		$root = $trie;

		for ($x = 0; $x < strlen($word); $x++) {
			$letter = charAt($word, $x);

			if ($letter !in $root) {
				$root[$letter] = %();
			}
			$root = $root[$letter];
		}

		$root['__'] = $word;
	}
	return $trie;
}

This function creates a Trie represented as a Sleep hash. Each branch in the Trie has letters which point to other Trie branches. A branch that represents a word will have a __ key with the associated word.

A word is added to the Trie by looking for the first letter of the word in the root branch. If the letter is not there, a new branch is made. This first letter is then removed from the word and the branch associated with the letter becomes the root branch. This process is repeated until there are no letters left in the word. At this point the __ key is set to mark that this branch represents a complete word.

$ java -jar lib/sleep.jar
>> Welcome to the Sleep scripting language
>
> interact
>> Welcome to interactive mode.
Type your code and then '.' on a line by itself to execute the code.
Type Ctrl+D or 'done' on a line by itself to leave interactive mode.
include("trie.sl");
$trie = trie(@("apple", "bat", "bart", "battle", "cat"));
println($trie);
.
%(b =>
       %(a =>
              %(t =>
                     %(t =>
                            %(l =>
                                   %(e =>
                                          %(__ => 'battle')
                                    )
                             ),
                __ => 'bat'),
                r =>
                     %(t =>
                            %(__ => 'bart')
                      )
                )
         ),
  c => %(a => %(t => %(__ => 'cat'))),
  a => %(p => %(p => %(l => %(e => %(__ => 'apple')))))
 )

If you’ve been exposed to Tries before, you’ve probably seen them used to check if a word is in the dictionary or not. Here is the Sleep code to test a word:

sub inTrie {
	local('$x $word $trie $letter');
	($word, $trie) = @_;

	for ($x = 0; $x < strlen($word); $x++) {
		$letter = charAt($word, $x);

		if ($letter !in $trie) {
			return;
		}
		$trie = $trie[$letter];
	}

	return iff ($trie['__'] eq $1);
}

This function walks the Trie using a process similar to adding a word. If there is no branch for the current letter then the test fails. If we’ve walked through all the letters, the __ key is checked to see if we indeed have a word match.

Typically __ doesn’t contain the word. It’s usually a boolean flag. I’m lazy though and put the word in this slot to make debugging the Trie code easier and to make it easier for me when testing if a word is in the Trie.

Generating Suggestions

AtD uses this Trie structure to generate all edits for a misspelled word. The Trie is nice because I only have to traverse paths that may yield words and I don’t have to keep a lot of traversal information in memory. Here is the code:

sub edits {
    local('$trie $word $results $depth $root $branch $letter');
	($trie, $word, $results, $depth) = @_;

	if (strlen($word) == 0 && $depth >= 0 && $trie['__'] ne '') {
		$results[ $trie['__'] ] = 1;
	}

	if ($depth >= 1) {

		# deletion. [remove the current letter, and try it on the current branch--see what happens]
		if (strlen($word) > 1) {
			edits($trie, substr($word, 1), $results, $depth - 1);
		}
		else {
			edits($trie, "", $results, $depth - 1);
		}

		foreach $letter => $branch ($trie) {
			if ($letter eq '__') { continue; }

			# insertion. [pass the current word, no changes, to each of the branches for processing]
			edits($branch, $word, $results, $depth - 1);

			# substitution. [pass the current word, sans first letter, to each of the branches for processing]
			if (strlen($word) > 1) {
				edits($branch, substr($word, 1), $results, $depth - 1);
			}
			else {
				edits($branch, "", $results, $depth - 1);
			}
		}

		# transposition. [swap the first and second letters]
		if (strlen($word) > 2) {
			edits($trie, charAt($word, 1) . charAt($word, 0) . substr($word, 2), $results, $depth - 1);
		}
		else if (strlen($word) == 2) {
			edits($trie, charAt($word, 1) . charAt($word, 0), $results, $depth - 1);
		}
	}

	# move on to the next letter. (no edits have happened)

	if (strlen($word) >= 1 && charAt($word, 0) in $trie) {
		$letter = charAt($2, 0);
		if (strlen($word) > 1) {
			edits($trie[$letter], substr($word, 1), $results, $depth);
		}
		else if (strlen($word) == 1) {
			edits($trie[$letter], "", $results, $depth);
		}
	}

	# results are stored in a hash to prevent duplicate words
	return keys($results);
}

This function considers four types of edits: deletion, insertion, substitution, and transposition. This function returns a list of words within N edits from the specified word. It does this by considering the first letter of the word, the rest of the word, and the current branch of the Trie.

It handles deletions by applying this function to the current Trie branch and the rest of the word. In this way, all edits that could be generated from this letter being deleted are found.

With each call to edits, the function subtracts 1 from the max depth counter. When the max depth value is zero, no more traversals into the Trie are allowed. This counter makes sure we generate only words within the desired number of edits.

Insertion iterates through all child branches (representing each possible next letter) and applies the edit function to those branches with the current as an argument. In this way each branch is treated as a possible insertion before the current letter in the word.

Substitution iterates through each child branch and applies the edit function to those branches with the rest of the word as an argument.

Transposition swaps the first and second letters of the current word and applies the edit function on this new word using the current branch.

A Trie branch that maps to a word is added to the suggestion pool when it’s called with an empty string for the word and the depth counter is greater than zero (meaning we’re within the allowed number of edits).

Once these steps have completed, the algorithm finds the Trie branch for the next letter in the word, and calls edits with the branch for that letter as the Trie and the rest of the word. The depth counter isn’t changed as no edits were made, this is just advancing the algorithm to the next letter.

Here is a printout of edits applied to ‘sucess’ using a Trie constructed from the public domain 1939 Webster’s dictionary.

$ java -jar lib/sleep.jar
>> Welcome to the Sleep scripting language
> interact
>> Welcome to interactive mode.
Type your code and then '.' on a line by itself to execute the code.
Type Ctrl+D or 'done' on a line by itself to leave interactive mode.

include("trie.sl");

$dictionary = readAll(openf("/usr/share/dict/words"));
println(size($dictionary));
.
98569

$trie = trie($dictionary);
printAll(edits($trie, "sucess", %(), 2));
.
surest
sunless
sauces
sucks
sauce's
excess
recess
stress
Luce's
sices
access
duress
saucers
duchess
supers
guess
suers
suckers
success
sues

I’ve found this method is faster than the naïve edits generation algorithm. Let me know how it works for you.

This code is from the After the Deadline software service which is available under the LGPL license. You can find the code to the rest of the system at http://open.afterthedeadline.com

Tagged with: algorithms, computer science, data structures, NLP, spell checking, Trie

5 comments

Thoughts on a tiny contextual spell checker

Posted in NLP Research by rsmudge on December 9, 2009

Spell checkers have a bad rap because they give poor suggestions, don’t catch real word errors, and usually have out of date dictionaries. With After the Deadline I’ve made progress on these three problems. The poor suggestions problem is solved by looking at context as AtD’s contextual spell checker does. AtD again uses context to help detect real word errors. It’s not flawless but it’s not bad either. AtD has also made progress on the dictionary front by querying multiple data sources (e.g., Wikipedia) to find missing words.

Problem Statement

So despite this greatness, contextual spell checking isn’t very common. I believe this is because contextual spell checking requires a language model. Language models keep track of every sequence of two words seen in a large corpus of text. From this data the spell checker can calculate P(currentWord|previousWord) and P(currentWord|nextWord). For a client side application, this information amounts to a lot of memory or disk space.

Is it possible to deliver the benefits of a contextual spell checker in a smaller package?

Why would someone want to do this? If this could be done, then it’d be possible to embed the tiny contextual spell checker into programs like Firefox, OpenOffice, and others. Spell check as you type would be easy and responsive as the client could download the library and execute everything client side.

Proposed Solution

I believe it’s possible to reduce the accuracy of the language model without greatly impacting its benefits. Context makes a difference when spell checking (because it’s extra information), but I think the mere idea that “this word occurs in this context a lot more than this other one” is enough information to help the spell checker. Usually the spell checker is making a choice between 3-6 words anyways.

One way to store low fidelity language model information is to associate each word with some number of bloom filters. Each bloom filter would represent a band of probabilities. For example a word could have three bloom filters associated with it to keep track of words occurring in the top-25%, middle-50%, and bottom-25%. This means the data size for the spell checker will be N*|dictionary| but this is better than having a language model that trends towards a size of |dictionary|^2.

A bloom filter is a data structure for tracking whether something belongs to a set or not. They’re very small and the trade-off is they may give false positives but they won’t give false negatives. It’s also easy to calculate the false positive rate in a bloom filter given the number of set entries expected, the bloom filter size, and the number of hash functions used. To optimize for space, the size of the bloom filter for each band and word could be determined from the language model.

If this technique works for spelling, could it also work for misused word detection? Imagine tracking trigrams (sequences of three words) for each potentially misused word using a bloom filter.

After looking further into this, it looks like others have attacked the problem of using bloom filters to represent a language model. This makes the approach even more interesting now.

Comments Off on Thoughts on a tiny contextual spell checker

Generating a Plain Text Corpus from Wikipedia

Posted in Multi-Lingual AtD, NLP Research by rsmudge on December 4, 2009

AtD *thrives* on data and one of the best places for a variety of data is Wikipedia. This post describes how to generate a plain text corpus from a complete Wikipedia dump. This process is a modification of Extracting Text from Wikipedia by Evan Jones.

Evan’s post shows how to extract the top articles from the English Wikipedia and make a plain text file. Here I’ll show how to extract all articles from a Wikipedia dump with two helpful constraints. Each step should:

finish before I’m old enough to collect social security
tolerate errors and run to completion without my intervention

Today, we’re going to do the French Wikipedia. I’m working on multi-lingual AtD and French seems like a fun language to go with. Our systems guy, Stephane speaks French. That’s as good of a reason as any.

Step 1: Download the Wikipedia Extractors Toolkit

Evan made available a bunch of code for extracting plaintext from Wikipedia. To meet the two goals above I made some modifications*. So the first thing you’ll want to do is download this toolkit and extract it somewhere:

wget http://www.polishmywriting.com/download/wikipedia2text_rsm_mods.tgz
tar zxvf wikipedia2text_rsm_mods.tgz
cd wikipedia2text

(* see the CHANGES file to learn what modifications were made)

Step 2: Download and Extract the Wikipedia Data Dump

You can do this from http://download.wikimedia.org/. The archive you’ll want for any language is *-pages-articles.xml.bz2. Here is what I did:

wget http://download.wikimedia.org/frwiki/20091129/frwiki-20091129-pages-articles.xml.bz2
bunzip2 frwiki-20091129-pages-articles.xml.bz2

Step 3: Extract Article Data from the Wikipedia Data

Now you have a big XML file full of all the Wikipedia articles. Congratulations. The next step is to extract the articles and strip all the other stuff.

Create a directory for your output and run xmldump2files.py against the .XML file you obtained in the last step:

mkdir out
./xmldump2files.py frwiki-20091129-pages-articles.xml out

This step will take a few hours depending on your hardware.

Step 4: Parse the Article Wiki Markup into XML

The next step is to take the extracted articles and parse the Wikimedia markup into an XML form that we can later recover the plain text from.There is a shell script to generate XML files for all the files in our out directory. If you have a multi-core machine, I don’t recommend running it. I prefer using a shell script for each core that executes the Wikimedia to XML command on part of the file set (aka poor man’s concurrent programming).

To generate these shell scripts:

find out -type f | grep '\.txt$' >fr.files

To split this fr.files into several .sh files.

java -jar sleep.jar into8.sl fr.files

You may find it helpful to create a launch.sh file to launch the shell scripts created by into8.sl.

cat >launch.sh
./files0.sh &
./files1.sh &
./files2.sh &
...
./files15.sh &
^D

Next, launch these shell scripts.

./launch.sh

Unfortunately this journey is filled with peril. The command run by these scripts for each file has the following comment: Converts Wikipedia articles in wiki format into an XML format. It might segfault or go into an “infinite” loop sometimes. This statement is true. The PHP processes will freeze or crash. My first time through this process I had to manually watching top and kill errant processes. This makes the process take longer than it should and it’s time-consuming. To help I’ve written a script that kills any php process that has run for more than two minutes. To launch it:

java -jar sleep.jar watchthem.sl

Just let this program run and it will do its job. Expect this step to take twelve or more hours depending on your hardware.

Step 5: Extract Plain Text from the Articles

Next we want to extract the article plaintext from the XML files. To do this:

./wikiextract.py out french_plaintext.txt

This command will create a file called french_plaintext.txt with the entire plain text content of the French Wikipedia. Expect this command to take a few hours depending on your hardware.

Step 6 (OPTIONAL): Split Plain Text into Multiple Files for Easier Processing

If you plan to use this data in AtD, you may want to split it up into several files so AtD can parse through it in pieces. I’ve included a script to do this:

mkdir corpus
java -jar sleep.jar makecorpus.sl french_plaintext.txt corpus

And that’s it. You now have a language corpus extracted from Wikipedia.

13 comments

Learning from your mistakes — Some ideas

Posted in NLP Research by rsmudge on November 23, 2009

I’m often asked if AtD gets smarter the more it’s used. The answer is not yet. To stimulate the imagination and give an idea of what’s coming, this post presents some ideas about how AtD can learn the more you use it.

There are two immediate sources I can harvest: the text you send and the phrases you ignore.

Learning from Submitted Text

The text you send is an opportunity for AtD to learn from the exact type of writing that is being checked. The process would consist of saving parts of submitted documents that pass some correctness filter and retrieving these for use in the AtD language model later.

Submitted documents can be added directly to the AtD training corpus which will impact almost all features positively. AtD uses statistical information from the training data to find the best spelling suggestions, detect misused words, and find exceptions to the style and grammar rules.

More data will let me create a spell checker dictionary with less misspelled or irrelevant words. A larger training corpus means more words to derive a spell checker dictionary from. AtD’s spell check requires a word is seen a certain number of times before it’s included in the spell checker dictionary. With enough automatically obtained data, AtD can find the highest threshold that will result in a dictionary about 120K words.

Learning from Ignored Phrases

When you click “Ignore Always”, this preference is saved to a database. I have access to this information for the WordPress.com users and went through it once to see what I could learn.

This information could be used to find new words to add to AtD. Any word that is ignored by multiple people may be a correct spelling that the AtD dictionary is missing. I can set a threshold of how many times a word must be ignored before it’s added to the AtD word lists. This is a tough problem though as I don’t want commonly ignored errors like ‘alot’ finding their way into the AtD dictionary, some protection against this is necessary.

Misused words that are flagged as ignored represent an opportunity to focus data collection efforts on that word. This is tougher but I may be able to query a service like wordnik to get more context on the word and add the extra context to the AtD corpus. If a word passes a threshold of being ignored by a lot of people, it may also be a good idea to consider removing it from the misused word list automatically. The statistical detection approach doesn’t work well with some words.

Aiming Higher

These ideas represent some of the low hanging fruit to make AtD learn as you use the service. Let’s imagine I could make AtD could track any behavior and the context around it. There are good learning opportunities when you accept a suggestion and the misused word feature has more opportunity to improve if context is attached to your accepting or ignoring a suggestion.

These are my thoughts, what are yours?

5 comments

Older Posts »

Presentation

Slides

Paper

More Depth

Confusion Sets

Finding Errors

Let’s Measure It

Another Evaluation

Final Thoughts

How PolishMyWriting.com Worked

What happened to PolishMyWriting.com?

Save the Pain

Language Model API

Count Words

Word Probability

Word Probability with Context

Pbigram1(“previous”, “word”)

Pbigram2(“word”, “next”)

A Little Trick

Harvest a Dictionary

The Next Step

Running

Using

Adding a New Language

The Problem: How I Generate Suggestions

The Trie

Generating Suggestions

Problem Statement

Proposed Solution

Step 1: Download the Wikipedia Extractors Toolkit

Step 2: Download and Extract the Wikipedia Data Dump

Step 3: Extract Article Data from the Wikipedia Data

Step 4: Parse the Article Wiki Markup into XML

Step 5: Extract Plain Text from the Articles

Step 6 (OPTIONAL): Split Plain Text into Multiple Files for Easier Processing

Learning from Submitted Text

Learning from Ignored Phrases

Aiming Higher

Categories

Subscribe