Measuring the Real Word Error Corrector
Before we begin: Did you notice my fancy and SEO friendly post title? Linguists refer to misused words as real word errors. When I write about real word errors in this post, I’m really referring to the misused word detector in After the Deadline.
One of my favorite features in After the Deadline is the real word error corrector. In this post, I’ll talk about this feature and how it works. I will also present an evaluation of this tool compared to Microsoft Word 2007 for Windows which has a similar feature, one they call a contextual spelling checker.
After the Deadline has a list of 1,603 words it looks at. In this list, words are grouped into confusion sets. A confusion set is two or more words that may be misused during the act of writing. Some surprise me, for example I saw someone write portrait on Hacker News when they meant portray. The words portrait and portray are an example of a confusion set.
Confusion sets are a band-aid and a limitation but they have their place for now. In an ideal program, the software would look at any word you use and try to decide if you meant some other word using various criteria at its disposal. After the Deadline doesn’t do this because it would mean storing more information about every word and how it’s used than my humble server could handle. Because of memory (and CPU constraints), I limit the words I check based on these fixed confusion sets.
To detect real word errors, After the Deadline scans your document looking for any words in that potentially misused word list. When AtD encounters one of these words, it looks at the word’s confusion set and checks if any other word is a better fit.
How does this happen? It’s pretty simple. AtD looks two words to the left and two words to the right of the potentially misused word and tries to decide which of the words from the confusion set are the best fit. This looks like:
I’ll admit, the site does portrait credibility.
Here, After the Deadline uses the following statistical features to decide which word W (portrait or portray) you want:
P(W | site, does)
P(W | credibility, END)
P(W | does)
P(W | credibility)
The probability of a word given the previous two words is calculated using a trigram. When After the Deadline learns a language, it stores every sequence of two words it finds and sequences of three words that begin or end with a confusion set. To calculate the probability of a word given the next two words requires trigrams and a little algebra using Bayes’ Theorem. I wrote a post on language models earlier. To bring all this together, After the Deadline uses a neural network to combine these statistical features into one score between 0.0 and 1.0. The word with the highest score, wins. To bias against false positives, the current word is multiplied by 10 to make sure it wins in ambiguous cases.
Let’s Measure It
Ok, so the natural question is, how well does this work? Every time I rebuild After the Deadline, I run a series of tests that check the neural network scoring function and tell me how often it’s correct and how often it’s wrong. This kind of evaluation serves as a good unit test but it’s hard to approximate real word world performance from it.
Fortunately, Dr. Jennifer Pedler’s PhD thesis has us covered. In her thesis she developed and evaluated techniques for detecting real word errors to help writers with dyslexia. Part of her research consisted of collecting writing samples from writer’s with dyslexia and annotating the errors along with the expected corrections. I took a look at her data and found that 97.8% of the 835 errors are real word errors. Perfect for an evaluation of a real word error corrector.
Many things we consider the realm of the grammar checker are actually real word errors. Errors that involve the wrong verb tense (e.g., built and build), indefinite articles (a and an), and wrong determiners (the, their, etc.) are real word errors. You may ask, can real word error detection be applied to grammar checking? Yes, others are working on it. It makes sense to test how well After the Deadline as a complete system (grammar checker, misused word detector, etc.) performs correcting these errors.
To test After the Deadline, I wrote a Sleep script to compare a corrected version of Dr. Pedler’s error corpus to the original corpus with errors. The software measures how many errors were found and changed to something (the recall) and how often these changes were correct (the precision). This test does not measure the number of elements outside the annotated errors that were changed correctly or incorrectly.
To run it:
grep -v ‘\-\-\-\-‘ corrected.txt >corrected.clean.txt
java -jar sleep.jar measure.sl corrected.clean.txt
Now we can compare one writing system to another using Dr. Pedler’s data. All we need to do is paste the error corpus into a system, accept every suggestion, and run the measure script against it. To generate the error file, I wrote another script that reads in Dr. Pedler’s data and removes the annotations:
java -jar sleep.jar errors.sl > errors.txt
Now we’re ready. Here are the numbers comparing After the Deadline to Microsoft Word 2007 on Windows, Microsoft Word 2008 on the Mac, and Apple’s Grammar and Spell checker built into MacOS X 10.6. I include Microsoft Word 2008 and the Apple software because neither of these have a contextual spell checker. They still correct some real word errors with grammar checking technology.
|MS Word 2007 – Windows||90.0%||40.8%|
|After the Deadline||89.4%||27.1%|
|MS Word 2008 – MacOS X||79.7%||17.7%|
|MacOS X Spell and Grammar Checker||88.5%||9.3%|
As you can see, Microsoft Word 2007 on Windows performs well in the recall department. Every error in the dyslexic corpus that is not in After the Deadline’s confusion sets is an automatic hit against recall. Still, the precision for both systems are similar.
Another evaluation of MS Word 2007’s real-word error detection was published by Graeme Hirst, University of Toronto in 2008. He found that MS Word has lower recall on errors. To evaluate MS Word, he randomly inserted (1:200 words) real-word errors into the Wall Street Journal corpus, and then measured the system performance. It would be a benefit to the research community (especially *cough*those of us outside of universities*cough*) if such tests were conducted on data that one could redistribute.
After running this experiment, I added the syntactically distinct words from Dr. Pedler’s confusion sets to AtD’s existing confusion sets, pushed the magical button to rebuild models, and reran these tests. I saw AtD’s recall rise to 33.9% with a precision of 86.6%. Expanding the confusion sets used in AtD will improve AtD’s real-word error correction performance.
I’ll continue to work on expanding the number of confusion sets in After the Deadline. One challenge to whole-sale importing several words is that some words create more false positives than others when checked using the method described here (hence the precision drop when adding several unevaluated new words to the misused word detector).
If you’re using Microsoft Word 2007 on Windows, you’re in good hands. Still, After the Deadline is comparable in the real-word error detection department when MS Word 2007 isn’t available.
If you’re using a Mac, you should use After the Deadline.