Learning from your mistakes — Some ideas
I’m often asked if AtD gets smarter the more it’s used. The answer is not yet. To stimulate the imagination and give an idea of what’s coming, this post presents some ideas about how AtD can learn the more you use it.
There are two immediate sources I can harvest: the text you send and the phrases you ignore.
Learning from Submitted Text
The text you send is an opportunity for AtD to learn from the exact type of writing that is being checked. The process would consist of saving parts of submitted documents that pass some correctness filter and retrieving these for use in the AtD language model later.
Submitted documents can be added directly to the AtD training corpus which will impact almost all features positively. AtD uses statistical information from the training data to find the best spelling suggestions, detect misused words, and find exceptions to the style and grammar rules.
More data will let me create a spell checker dictionary with less misspelled or irrelevant words. A larger training corpus means more words to derive a spell checker dictionary from. AtD’s spell check requires a word is seen a certain number of times before it’s included in the spell checker dictionary. With enough automatically obtained data, AtD can find the highest threshold that will result in a dictionary about 120K words.
Learning from Ignored Phrases
When you click “Ignore Always”, this preference is saved to a database. I have access to this information for the WordPress.com users and went through it once to see what I could learn.
This information could be used to find new words to add to AtD. Any word that is ignored by multiple people may be a correct spelling that the AtD dictionary is missing. I can set a threshold of how many times a word must be ignored before it’s added to the AtD word lists. This is a tough problem though as I don’t want commonly ignored errors like ‘alot’ finding their way into the AtD dictionary, some protection against this is necessary.
Misused words that are flagged as ignored represent an opportunity to focus data collection efforts on that word. This is tougher but I may be able to query a service like wordnik to get more context on the word and add the extra context to the AtD corpus. If a word passes a threshold of being ignored by a lot of people, it may also be a good idea to consider removing it from the misused word list automatically. The statistical detection approach doesn’t work well with some words.
Aiming Higher
These ideas represent some of the low hanging fruit to make AtD learn as you use the service. Let’s imagine I could make AtD could track any behavior and the context around it. There are good learning opportunities when you accept a suggestion and the misused word feature has more opportunity to improve if context is attached to your accepting or ignoring a suggestion.
These are my thoughts, what are yours?
There’s no reason to have everything be 100% algorithmic, a hybrid approach with certain human-edited rules (like for “alot”) is probably the best bet long-term.
I’m wondering what could be done to increase the recall or precision of the grammar checker? It seems if you gave users the ability to tag grammar corrections as “false positive” than you could build up a corpus of sentences where the rules failed. With this data you could train your rules to improve precision.
Conversely, you could give users the ability to flag grammar mistakes or, more likely, sentences as having one or more grammatical errors. In other words, you could build up a corpus of “false negatives” allowing you to improve recall.
What are your thoughts on this?
The trick is deciding what to do with the information when the system gets it. If we know a rule failed in a certain case, how should we feed this to the system so it benefits? One possibility is to add the sentence with that context into the corpora, that would at least aid the statistical filtering and improve the precision.
I haven’t done any research into mining rules or developing an algorithmic approach to improving the grammar checker recall, but I think this would be an interesting area to work in.
In case someone wants to pick it up, AtD has two data sets for evaluating the grammar checker. They’re data/tests/grammar_gutenberg.txt and data/tests/grammar_wikipedia.txt. These two files are merely phrases taken from the Wikipedia common grammar errors list merged with context from the sources in their name. There is also a script bin/testgr.sh that tests the precision and recall of the AtD grammar checker using these sources.
Before bringing user data in, I think using these datasets to measure the effectiveness of such and approach would make sense.
Human spot-check it.
[…] to improve AtD? Posted in Uncategorized by rsmudge on May 27, 2010 I’ve written about learning from AtD use in the past. The main ideas I had back then were to bring more data into AtD’s corpus and […]