Tweaking the Misused Word Detector
I’ve been hard at work improving the statistical misused word detection in After the Deadline. Nothing in the world of NLP is ever perfect. If computers understood English we’d have bigger problems like rapidly evolving robots that try to enslave us. I’m no longer interested in making robots to enslave people, that’s why I left my old job.
AtD’s statistical misused word detection relies on a database of confusion sets. A confusion set is a group of words that are likely to be mistaken for each other. An example is cite, sight, and site. I tend to type right when I mean write, but only when I’m really tired.
AtD keeps track of about 1,500 of these likely to be confused words. When it encounters one in your writing, it checks if any of the other words from the confusion set are a statistically better fit. If one of the words is a better fit then your old word is marked as an error and you’re presented with the options.
AtD’s misused word detection uses the immediate left and right context of the word to decide which word is the better fit. I have access to this information using the bigram language model I assembled from public domain books, Wikipedia, and many of your blog entries. Here is how well the statistical misused word detection fares with this information:
Table 1. AtD Misused Word Detection Using Bigrams
Of course, I’ve been working to improve this. Today I got my memory usage under control and I can now afford to use trigrams or sequences of three words. With a trigram I can use the previous two words to try and predict the third word. Here is how the statistical misused word detection fares with this extra information:
Table 2. AtD Misused Word Detection Using Bigrams and Trigrams
AtD uses neural networks to decide how to weight all the information fed to it. These tables represent the before and after of introducing trigrams. As you can see the use of trigrams significantly boosts precision (how often AtD is right when it marks a word as wrong) and recall (how often AtD marks a word as wrong when it is wrong).
If you think these numbers are impressive (and they’re pretty good), you should know a few things. The separation between the training data and testing data isn’t very good. I generated datasets for training and testing, but both datasets were drawn from text used in the AtD corpus. More clearly put–these numbers assume 100% coverage of all use cases in the English language because all the training and test cases have trigrams and bigrams associated with them. If I were writing an academic paper, I’d take care to make this separation better. Since I’m benchmarking a feature and how new information affects it, I’m sticking with what I’ve got.
This feature is pretty good, and I’m gonna let you see it, BUT…
I have some work to do yet. In production I bias against false positives which hurts the recall of the system. I do this because you’re more likely to use the correct word and I don’t want AtD flagging your words unless it’s sure the word really is wrong.
The biasing of the bigram model put the statistical misused word detection at a recall of 65-70%. I expect to have 85-90% when I’m done experimenting with the trigrams. My target false positive rate is under 0.5% or 1:200 correctly used words identified as wrong.
With any luck this updated feature will be pushed into production tomorrow.