Learning from your mistakes — Some ideas
I’m often asked if AtD gets smarter the more it’s used. The answer is not yet. To stimulate the imagination and give an idea of what’s coming, this post presents some ideas about how AtD can learn the more you use it.
There are two immediate sources I can harvest: the text you send and the phrases you ignore.
Learning from Submitted Text
The text you send is an opportunity for AtD to learn from the exact type of writing that is being checked. The process would consist of saving parts of submitted documents that pass some correctness filter and retrieving these for use in the AtD language model later.
Submitted documents can be added directly to the AtD training corpus which will impact almost all features positively. AtD uses statistical information from the training data to find the best spelling suggestions, detect misused words, and find exceptions to the style and grammar rules.
More data will let me create a spell checker dictionary with less misspelled or irrelevant words. A larger training corpus means more words to derive a spell checker dictionary from. AtD’s spell check requires a word is seen a certain number of times before it’s included in the spell checker dictionary. With enough automatically obtained data, AtD can find the highest threshold that will result in a dictionary about 120K words.
Learning from Ignored Phrases
When you click “Ignore Always”, this preference is saved to a database. I have access to this information for the WordPress.com users and went through it once to see what I could learn.
This information could be used to find new words to add to AtD. Any word that is ignored by multiple people may be a correct spelling that the AtD dictionary is missing. I can set a threshold of how many times a word must be ignored before it’s added to the AtD word lists. This is a tough problem though as I don’t want commonly ignored errors like ‘alot’ finding their way into the AtD dictionary, some protection against this is necessary.
Misused words that are flagged as ignored represent an opportunity to focus data collection efforts on that word. This is tougher but I may be able to query a service like wordnik to get more context on the word and add the extra context to the AtD corpus. If a word passes a threshold of being ignored by a lot of people, it may also be a good idea to consider removing it from the misused word list automatically. The statistical detection approach doesn’t work well with some words.
These ideas represent some of the low hanging fruit to make AtD learn as you use the service. Let’s imagine I could make AtD could track any behavior and the context around it. There are good learning opportunities when you accept a suggestion and the misused word feature has more opportunity to improve if context is attached to your accepting or ignoring a suggestion.
These are my thoughts, what are yours?