After the Deadline

Tweaking the AtD Spellchecker

Posted in NLP Research by rsmudge on September 4, 2009

Conventional wisdom says a spellchecker dictionary should have around 90,000 words.  Too few words and the spellchecker will mark many correct things as wrong.   Too many words and it’s more likely a typo could result in a rarely used word going unnoticed by most spellcheckers.

Assembling a good dictionary is a challenge.  Many wordlists are available online but often times these are either not comprehensive or they’re too comprehensive and contain many misspellings.

AtD tries to get around this problem by intersecting a collection of wordlists with words it sees used in a corpus ( a corpus is a directory full of books, Wikipedia articles, and blog posts I “borrowed” from you).  Currently AtD accepts any word seen once leading to a dictionary of  161,879 words.  Too many.

Today I decided to experiment with different thresholds for how many times a word needs to be seen before it’s allowed entrance into the coveted spellchecker wordlist.  My goal was to increase the accuracy of the AtD spellchecker and drop the number of misspelled words in the dictionary.

Here are the results, AtD:n means AtD requires a word be seen n times before AtD includes it in the dictionary.

ASpell Dataset (Hard to correct errors)

Engine Words Accuracy * Present Words
AtD:1 161,879 55.0% 73
AtD:2 116,876 55.8% 57
AtD:3 95,910 57.3% 38
AtD:4 82,782 58.0% 30
AtD:5 73,628 58.5% 27
AtD:6 66,666 59.1% 23
ASpell (normal) n/a 56.9% 14
Word 97 n/a 59.0% 18
Word 2000 n/a 62.6% 20
Word 2003 n/a 62.8% 20

Wikipedia Dataset (Easy to correct errors)

Engine Words Accuracy * Present Words
AtD:1 161,879 87.9% 233
AtD:2 116,876 87.8% 149
AtD:3 95,910 88.0% 104
AtD:4 82,782 88.3% 72
AtD:5 73,628 88.3% 59
AtD:6 66,666 88.62% 48
ASpell (normal) n/a 84.7% 44
Word 97 n/a 89.0% 31
Word 2000 n/a 92.5% 42
Word 2003 n/a 92.6% 41

Experiment data and comparison numbers from: Deorowicz, S., Ciura M. G., Correcting spelling errors by modelling their causes, International Journal of Applied Mathematics and Computer Science, 2005; 15(2):275–285.

* Accuracy numbers show spell checking without context as the Word and ASpell checkers are not contextual (and therefor the data isn’t either).

After seeing these results, I’ve decided to settle on a threshold of 2 to start and I’ll move to 3 after no one complains about 2.

I’m not too happy that the present word count is so sky high but as I add more data to AtD and up the minimum word threshold this problem should go away.  This is progress though.  Six months ago I had so little data I wouldn’t have been able to use a threshold of 2, even if I wanted to.