Thoughts on a tiny contextual spell checker
Spell checkers have a bad rap because they give poor suggestions, don’t catch real word errors, and usually have out of date dictionaries. With After the Deadline I’ve made progress on these three problems. The poor suggestions problem is solved by looking at context as AtD’s contextual spell checker does. AtD again uses context to help detect real word errors. It’s not flawless but it’s not bad either. AtD has also made progress on the dictionary front by querying multiple data sources (e.g., Wikipedia) to find missing words.
So despite this greatness, contextual spell checking isn’t very common. I believe this is because contextual spell checking requires a language model. Language models keep track of every sequence of two words seen in a large corpus of text. From this data the spell checker can calculate
P(currentWord|nextWord). For a client side application, this information amounts to a lot of memory or disk space.
Is it possible to deliver the benefits of a contextual spell checker in a smaller package?
Why would someone want to do this? If this could be done, then it’d be possible to embed the tiny contextual spell checker into programs like Firefox, OpenOffice, and others. Spell check as you type would be easy and responsive as the client could download the library and execute everything client side.
I believe it’s possible to reduce the accuracy of the language model without greatly impacting its benefits. Context makes a difference when spell checking (because it’s extra information), but I think the mere idea that “this word occurs in this context a lot more than this other one” is enough information to help the spell checker. Usually the spell checker is making a choice between 3-6 words anyways.
One way to store low fidelity language model information is to associate each word with some number of bloom filters. Each bloom filter would represent a band of probabilities. For example a word could have three bloom filters associated with it to keep track of words occurring in the top-25%, middle-50%, and bottom-25%. This means the data size for the spell checker will be
N*|dictionary| but this is better than having a language model that trends towards a size of
A bloom filter is a data structure for tracking whether something belongs to a set or not. They’re very small and the trade-off is they may give false positives but they won’t give false negatives. It’s also easy to calculate the false positive rate in a bloom filter given the number of set entries expected, the bloom filter size, and the number of hash functions used. To optimize for space, the size of the bloom filter for each band and word could be determined from the language model.
If this technique works for spelling, could it also work for misused word detection? Imagine tracking trigrams (sequences of three words) for each potentially misused word using a bloom filter.
After looking further into this, it looks like others have attacked the problem of using bloom filters to represent a language model. This makes the approach even more interesting now.