Making the spell checker… more awesome.

Posted in NLP Research by rsmudge on November 17, 2009

I don’t talk much about the After the Deadline spell checker. Many people believe spell checking is a solved problem and they must “work the same way”. AtD does more than spell checking, so I talk about these other things.

Despite the lack of talk, AtD’s spell checker is awesome and I’m constantly improving it. That’s what I’m going to write about today. This past weekend I was writing a post and I tried to spell “conoisur”. As you call tell, I don’t know how to spell it and my attempt isn’t even close. It’s bad enough that I make spell checkers and can’t spell, it’s terrible that AtD couldn’t give me any suggestions. I felt left out in the cold.

Suggestions before this post 🙂

Most spelling errors are one or two changes away from the intended word. AtD takes advantage of this and limits its suggestion search to words within two changes. By limiting the search space for words, AtD has less to choose from. This means AtD give you the right answer more often.

Peter Norvig conducted a quick experiment where he learned that 98.9% of the errors in the Birkbeck spelling error corpus were within two edits and 76% were one edit away. These numbers are close to what I found in my own experiments months ago. I’d present those numbers but I… lost them. *cough*

From these numbers, I felt it was safe to ignore words three or more edits away from the misspelling. This works well except when I try to spell words like connoisseur or bureaucracy. For my attempts, AtD generated no suggestions.

Then it hit me! If there are no suggestions within two edits… why don’t I try to find words within three edits. And if there are still no suggestions then, why I don’t I try words within four edits. I could go on like this forever. Unfortunately nothing is free and doing this would kill AtD’s performance. So I decided to limit my experiment to find words within three edits when no words are within two edits.

AtD’s spellchecker uses a neural network to score suggestions and compare them to each other. During training the neural network converges on an optimal weighting for each feature I give it. One of the features I give to the neural network is the edit distance normalized between 0.0 and 1.0. Edit distance is weighted highly as a value close to 1.0 is correct 76% of the time. I feared that introducing high edit distances that are almost always correct would create something that looks like a valley to the neural network. Meaning a high value is usually correct, a low value is usually correct, and the stuff in the middle is probably not so correct. Neural networks will make good guesses but they’re dependent on being told the story in the right way. This valley thing wasn’t going to work. So I modified the features so that an edit distance of 1 is a weight of 1.0 and everything else is 0.0.

Here are the numbers on AtD’s spell checker before I made these changes:

Word Pool Accuracy

Data Set	Accuracy
sp_test_wpcm_nocontext.txt	97.80%
sp_test_aspell_nocontext.txt	79.06%

Spelling Corrector Accuracy

Data Set	Neural Network	Freq/Edit	Random
sp_test_gutenberg_context1.txt	93.97%	83.16%	29.71%
sp_test_gutenberg_context2.txt	77.14%	59.68%	19.68%

The word pool accuracy is a measure of how often the pool of suggested words AtD generates has the intended word for a misspelled one in the test data. If the correct word is not here, AtD won’t be able to suggest it. The corrector accuracy is a measurement of how accurate AtD is at sorting the suggestion pool and putting the intended word on the top. The neural network score is what AtD uses in production and includes context. The joint frequency/edit distance score is similar to the statistical corrector in How to write a spelling corrector. Without context I don’t expect most spell checkers will do much better than that.

Here are the numbers with the changes:

Word Pool Accuracy

Data Set	Accuracy
sp_test_wpcm_nocontext.txt	98.19% +0.39%
sp_test_aspell_nocontext.txt	82.58% +3.52%

Spelling Corrector Accuracy

Data Set	Neural Network	Freq/Edit	Random
sp_test_gutenberg_context1.txt	94.51% +0.54%	83.08%	30.18%
sp_test_gutenberg_context2.txt	79.38% +2.24%	60.31%	23.69%

What this means for you

These little changes let AtD expand its search for suggestions when none were found. This means better recommendations for hard to spell words without impacting AtD’s stellar correction accuracy. I hope you enjoy it.

Suggestions you'll get now 🙂

How does this compare to MS Word, ASpell, and others?

If you want to compare these numbers with other systems, I presented numbers from similar data in another blog post. Be sure to multiply the spelling corrector accuracy with the word pool accuracy when comparing these numbers in the ones in the other post. For example: 0.9451 * 0.9891 = 0.9348 = 93.48%

If you’d like to play around with spelling correction and neural networks, consider downloading the After the Deadline source code. Everything you need to conduct your own experiments is included and it’s open source. The data sets used here are available in the bootstrap data distribution.

Tagged with: algorithms, edit distance, linguistics, NLP, science, spell check, spelling

Comments Off on Making the spell checker… more awesome.

After the Deadline