Tweaking the AtD Spellchecker

Posted in NLP Research by rsmudge on September 4, 2009

Conventional wisdom says a spellchecker dictionary should have around 90,000 words. Too few words and the spellchecker will mark many correct things as wrong. Too many words and it’s more likely a typo could result in a rarely used word going unnoticed by most spellcheckers.

Assembling a good dictionary is a challenge. Many wordlists are available online but often times these are either not comprehensive or they’re too comprehensive and contain many misspellings.

AtD tries to get around this problem by intersecting a collection of wordlists with words it sees used in a corpus ( a corpus is a directory full of books, Wikipedia articles, and blog posts I “borrowed” from you). Currently AtD accepts any word seen once leading to a dictionary of 161,879 words. Too many.

Today I decided to experiment with different thresholds for how many times a word needs to be seen before it’s allowed entrance into the coveted spellchecker wordlist. My goal was to increase the accuracy of the AtD spellchecker and drop the number of misspelled words in the dictionary.

Here are the results, AtD:n means AtD requires a word be seen n times before AtD includes it in the dictionary.

ASpell Dataset (Hard to correct errors)

Engine	Words	Accuracy *	Present Words
AtD:1	161,879	55.0%	73
AtD:2	116,876	55.8%	57
AtD:3	95,910	57.3%	38
AtD:4	82,782	58.0%	30
AtD:5	73,628	58.5%	27
AtD:6	66,666	59.1%	23
ASpell (normal)	n/a	56.9%	14
Word 97	n/a	59.0%	18
Word 2000	n/a	62.6%	20
Word 2003	n/a	62.8%	20

Wikipedia Dataset (Easy to correct errors)

Engine	Words	Accuracy *	Present Words
AtD:1	161,879	87.9%	233
AtD:2	116,876	87.8%	149
AtD:3	95,910	88.0%	104
AtD:4	82,782	88.3%	72
AtD:5	73,628	88.3%	59
AtD:6	66,666	88.62%	48
ASpell (normal)	n/a	84.7%	44
Word 97	n/a	89.0%	31
Word 2000	n/a	92.5%	42
Word 2003	n/a	92.6%	41

Experiment data and comparison numbers from: Deorowicz, S., Ciura M. G., Correcting spelling errors by modelling their causes, International Journal of Applied Mathematics and Computer Science, 2005; 15(2):275–285.

* Accuracy numbers show spell checking without context as the Word and ASpell checkers are not contextual (and therefor the data isn’t either).

After seeing these results, I’ve decided to settle on a threshold of 2 to start and I’ll move to 3 after no one complains about 2.

I’m not too happy that the present word count is so sky high but as I add more data to AtD and up the minimum word threshold this problem should go away. This is progress though. Six months ago I had so little data I wouldn’t have been able to use a threshold of 2, even if I wanted to.

Tagged with: computer science, dictionary, NLP, spelling

5 comments

5 Responses

Subscribe to comments with RSS.

After the Deadline Live for WP.com « Blog « WordPress.com said, on September 8, 2009 at 8:11 pm

[…] cool thing about this new technology is it’s getting better every day — Raphael is constantly adding new rules, heuristics, and learning from millions of blog posts on WP.com to make the contextual […]
After the Deadline Live for WP.com | Share Blog Tips said, on September 28, 2009 at 2:22 am

[…] cool thing about this new technology is that it’s getting better every day — Raphael is constantly adding new rules and heuristics, and the technology is learning from millions of blog posts on WP.com to […]
WP.com で After the Deadline が使えるようになりました！ « ブログ « WordPress.com said, on September 29, 2009 at 10:06 pm

[…] この新技術のもうひとつのすばらしい点は、ツールが日に日に良くなっていっているというところです。ラファエルは絶えず新しいルールや経験則を追加し続けていますし、文脈にそった解釈が必要な部分をより高性能にするため、何百万というブログ記事から学習していっています。 […]
Making the spell checker… more awesome. « After the Deadline said, on November 17, 2009 at 5:36 pm

[…] you want to compare these numbers with other systems, I presented numbers from similar data in another blog post. Be sure to multiply the spelling corrector accuracy with the word pool accuracy when comparing […]
All About Language Models « After the Deadline said, on March 4, 2010 at 11:11 pm

[…] words that occur two or more times. When I add enough data, I’ll raise this number to get a higher quality dictionary. To harvest a […]

Comments are closed.

After the Deadline