The number one search term on this blog is “After the Deadline Chrome”. You’d think some people wanted to see After the Deadline on Google Chrome.
After the Deadline is a powerful proofreading technology, that’s also available for Firefox. If you haven’t used After the Deadline before, check out our demonstrations. Now you can feel safe when you push send, tweet without alerting the writing police, and get your status updates correct on Facebook.
This extension is built to put you in control of the proofreading experience. You can disable AtD on certain sites, ignore phrases, enable more proofreading features, control auto-proofread, and set a keyboard shortcut.
For the best user experience, we recommend the latest beta channel of Chrome. The browser is constantly evolving and we took advantage of new features to bring the AtD experience to Chrome.
Before we begin: Did you notice my fancy and SEO friendly post title? Linguists refer to misused words as real word errors. When I write about real word errors in this post, I’m really referring to the misused word detector in After the Deadline.
One of my favorite features in After the Deadline is the real word error corrector. In this post, I’ll talk about this feature and how it works. I will also present an evaluation of this tool compared to Microsoft Word 2007 for Windows which has a similar feature, one they call a contextual spelling checker.
After the Deadline has a list of 1,603 words it looks at. In this list, words are grouped into confusion sets. A confusion set is two or more words that may be misused during the act of writing. Some surprise me, for example I saw someone write portrait on Hacker News when they meant portray. The words portrait and portray are an example of a confusion set.
Confusion sets are a band-aid and a limitation but they have their place for now. In an ideal program, the software would look at any word you use and try to decide if you meant some other word using various criteria at its disposal. After the Deadline doesn’t do this because it would mean storing more information about every word and how it’s used than my humble server could handle. Because of memory (and CPU constraints), I limit the words I check based on these fixed confusion sets.
To detect real word errors, After the Deadline scans your document looking for any words in that potentially misused word list. When AtD encounters one of these words, it looks at the word’s confusion set and checks if any other word is a better fit.
How does this happen? It’s pretty simple. AtD looks two words to the left and two words to the right of the potentially misused word and tries to decide which of the words from the confusion set are the best fit. This looks like:
I’ll admit, the site does portrait credibility.
Here, After the Deadline uses the following statistical features to decide which word W (portrait or portray) you want:
P(W | site, does)
P(W | credibility, END)
P(W | does)
P(W | credibility)
The probability of a word given the previous two words is calculated using a trigram. When After the Deadline learns a language, it stores every sequence of two words it finds and sequences of three words that begin or end with a confusion set. To calculate the probability of a word given the next two words requires trigrams and a little algebra using Bayes’ Theorem. I wrote a post on language models earlier. To bring all this together, After the Deadline uses a neural network to combine these statistical features into one score between 0.0 and 1.0. The word with the highest score, wins. To bias against false positives, the current word is multiplied by 10 to make sure it wins in ambiguous cases.
Let’s Measure It
Ok, so the natural question is, how well does this work? Every time I rebuild After the Deadline, I run a series of tests that check the neural network scoring function and tell me how often it’s correct and how often it’s wrong. This kind of evaluation serves as a good unit test but it’s hard to approximate real word world performance from it.
Fortunately, Dr. Jennifer Pedler’s PhD thesis has us covered. In her thesis she developed and evaluated techniques for detecting real word errors to help writers with dyslexia. Part of her research consisted of collecting writing samples from writer’s with dyslexia and annotating the errors along with the expected corrections. I took a look at her data and found that 97.8% of the 835 errors are real word errors. Perfect for an evaluation of a real word error corrector.
Many things we consider the realm of the grammar checker are actually real word errors. Errors that involve the wrong verb tense (e.g., built and build), indefinite articles (a and an), and wrong determiners (the, their, etc.) are real word errors. You may ask, can real word error detection be applied to grammar checking? Yes, others are working on it. It makes sense to test how well After the Deadline as a complete system (grammar checker, misused word detector, etc.) performs correcting these errors.
To test After the Deadline, I wrote a Sleep script to compare a corrected version of Dr. Pedler’s error corpus to the original corpus with errors. The software measures how many errors were found and changed to something (the recall) and how often these changes were correct (the precision). This test does not measure the number of elements outside the annotated errors that were changed correctly or incorrectly.
To run it:
grep -v ‘\-\-\-\-‘ corrected.txt >corrected.clean.txt
java -jar sleep.jar measure.sl corrected.clean.txt
Now we can compare one writing system to another using Dr. Pedler’s data. All we need to do is paste the error corpus into a system, accept every suggestion, and run the measure script against it. To generate the error file, I wrote another script that reads in Dr. Pedler’s data and removes the annotations:
java -jar sleep.jar errors.sl > errors.txt
Now we’re ready. Here are the numbers comparing After the Deadline to Microsoft Word 2007 on Windows, Microsoft Word 2008 on the Mac, and Apple’s Grammar and Spell checker built into MacOS X 10.6. I include Microsoft Word 2008 and the Apple software because neither of these have a contextual spell checker. They still correct some real word errors with grammar checking technology.
|MS Word 2007 – Windows||90.0%||40.8%|
|After the Deadline||89.4%||27.1%|
|MS Word 2008 – MacOS X||79.7%||17.7%|
|MacOS X Spell and Grammar Checker||88.5%||9.3%|
As you can see, Microsoft Word 2007 on Windows performs well in the recall department. Every error in the dyslexic corpus that is not in After the Deadline’s confusion sets is an automatic hit against recall. Still, the precision for both systems are similar.
Another evaluation of MS Word 2007’s real-word error detection was published by Graeme Hirst, University of Toronto in 2008. He found that MS Word has lower recall on errors. To evaluate MS Word, he randomly inserted (1:200 words) real-word errors into the Wall Street Journal corpus, and then measured the system performance. It would be a benefit to the research community (especially *cough*those of us outside of universities*cough*) if such tests were conducted on data that one could redistribute.
After running this experiment, I added the syntactically distinct words from Dr. Pedler’s confusion sets to AtD’s existing confusion sets, pushed the magical button to rebuild models, and reran these tests. I saw AtD’s recall rise to 33.9% with a precision of 86.6%. Expanding the confusion sets used in AtD will improve AtD’s real-word error correction performance.
I’ll continue to work on expanding the number of confusion sets in After the Deadline. One challenge to whole-sale importing several words is that some words create more false positives than others when checked using the method described here (hence the precision drop when adding several unevaluated new words to the misused word detector).
If you’re using Microsoft Word 2007 on Windows, you’re in good hands. Still, After the Deadline is comparable in the real-word error detection department when MS Word 2007 isn’t available.
If you’re using a Mac, you should use After the Deadline.
I’m a firm believer that the opportunity for innovation happens when suddenly the right pieces become available and someone puts them together.
If you consider AtD innovative, know that I didn’t invent anything crazy and new with After the Deadline. I applied simple algorithms to a lot of data and achieved good results. The availability of data, cheap CPU power, and a cultural readiness to accept a software feature that depends on a remote server made After the Deadline possible. I simply put the pieces together (and spent some sweat trying lots of simple algorithm/data combinations that didn’t work) and voilà, a proofreading system.
I’d like to share with you two resources that helped me appreciate the innovation process.
The first is the Myths of Innovation by Scott Berkun. Scott talked to the Automattic crew during our October meetup and I really enjoyed the time we had with him. I read his book and found the historical examples relevant. It’s easy for us to think that innovation happens in a vacuüm with one lone hero pulling it all together at the magic moment. Innovation is a lot of iteration and only becomes a magical moment when enough iteration has happened that others notice.
The other resource is the Connections documentary series by James Burke. I’m on my third time watching. It’s an amazing journey through history. The series is from the 70s and has a lot of speculation about the future and where technology may take us. James Burke speculates about the threat computers present to privacy and how technology might connect us in a way that folks couldn’t even imagine then. The series discusses history in terms of problems and the inventions they led to that later led to other problems and inventions. For example, the Faith in Numbers episode charts a historical journey from the Jacquard Loom, to the United States census, and finally to computers programmed with punch cards.
The message in Scott’s book and Burke’s series are the same. If you look at them, you won’t think about innovation the same way again.
I was cleaning up tickets on the AtD TRAC today and noticed a lot of sub-projects. If I were new, I’d be a little confused. I’m writing this post to clarify what projects exist under the AtD umbrella and how they fit together. I’ll end this post with resources related to these projects and let you know how you can get involved.
This diagram shows some of the projects and how they relate:
The Server Side
The heart of AtD is the server software. This is where the proofreading happens. The software is written in a mix of Sleep and Java. Applications are welcome to communicate with it through an XML protocol. It’s released under the GNU General Public License. There is a treasure trove of natural language processing [1, 2, 3, 4] code here.
Of course a server is no good without a client. After writing a plugin for TinyMCE and later jQuery, I noticed duplicate functionality. To ease my burden fixing bugs in both and make it easier to port AtD to other applications, the AtD core library was born. This library has functions to parse the AtD XML protocol into a data structure that it uses to highlight errors in a DOM. It’s also capable of resolving the correct suggestions for an error (given its context). This library is the foundation of AtD’s front-end. One change to it immediately benefits the WordPress plugin, the Firefox add-on, the TinyMCE plugin, and the jQuery plugin. This is why you will often see me announce multiple things at once.
TinyMCE is a WYSIWYG editor used in many applications including WordPress. It was the first editor I supported with After the Deadline. This plugin makes it possible to add AtD to any application that already has TinyMCE installed. Some things like persisting settings for “Ignore Always” are left to the developer. For the most part though, this is a complete package.
Missing from this diagram is the CKEditor plugin. I made an After the Deadline CKEditor plugin in 8 hours to prove the utility of the AtD Core library. It’s missing some of the polished functionality that the TinyMCE and jQuery plugins have. As none of our projects use CKEditor I haven’t had a need to update it. It’s looking for a maintainer.
These building blocks are released under the GNU LGPL license.
The pieces I’ve written about so far are merely building blocks. They’re useless unless they’re applied somewhere. The WordPress plugin is built on the TinyMCE and jQuery plugins. Through these plugins I’m able to offer AtD support in both the visual and HTML editor. Thanks to AtD core, I’m able to make sure both editors (mis)behave in the same way.
The Firefox add-on was the first Automattic project to use the AtD core library directly. With it, Mitcho was able to talk to start highlighting errors as soon as he got the add-on talking to the AtD server. This was pretty exciting as it meant we had a lot of functionality right away.
And of course having these building blocks also means you’re able to add After the Deadline to your application.
And finally, there are the project resources. As much as it seems like things are scattered around, it’s not really that bad.
Most of the AtD code maintained by Automattic is maintained in a common subversion repository as well.
This is the official After the Deadline blog for all the AtD projects maintained by Automattic.
Anything related to the front-end is hosted on the AtD developer’s page.
Information about the NLP research and the proofreading software service are kept at: http://open.afterthedeadline.com
It’s not very active, but there is a mailing list on Google Groups for all AtD related projects.
If you develop an After the Deadline plugin for an application, let me know. I’ll gladly link to it.
Hopefully this post helps lay out the landscape of the AtD project and some of the sub-projects involved with it.
I have a bag of goodies for you today.
Thanks to the help of wonderful volunteers, our WordPress plugin has been updated with translations for German, Spanish, Italian, Japanese, Polish, and Russian. It also includes updated translations for Portuguese and French.
This update also fixes several bugs related to finding and highlighting errors. I recommend this update for all AtD users.
You can get the latest from the WordPress plugin repository.
If you use After the Deadline for Firefox, the 1.2 release is available on the early access page. The early access release exists because it takes time for the volunteer editors to review the release. An approval is necessary for it to show up as an automatic update.
I like to think of it as a free code review from an experienced developer, just for participating in the Mozilla community. You can wait for the automatic update or get the early access release now. Don’t you want to be the first on your block to run the latest AtD for Firefox?
The latest bbPress plugin adds an auto-proofread option, the ability to select which errors you see, and an ignore always option.
If you’re a developer using AtD, don’t worry–I haven’t left you out. Our TinyMCE plugin is up to date with the latest bug fixes. Also if you’re using the jQuery plugin, the cross-domain AJAX calls now support all the languages AtD supports.
You can always find the latest AtD libraries on our Developer Resources page.
I found an old screenshot today and thought I’d share it to give you an idea of (1) how bad my design eye is and (2) some history of After the Deadline. After the Deadline started life as a web-based style checker hosted at PolishMyWriting.com. My goal? Convince people to paste in their documents to receive feedback on the fly, while I made tons of money from ad revenue. It seemed like a good idea at the time.
PolishMyWriting.com did not check spelling, misused words, or grammar. It relied on 2,283 rules to look for phrases to avoid, jargon terms, biased phrases, clichés, complex phrases, foreign words, and redundant phrases. The most sophisticated natural language processing in the system detected passive voice and hidden verbs.I wouldn’t call it sophisticated though. I referenced a chart and wrote simple heuristics to capture the different forms of passive voice. Funny, it’s the same passive voice detector used in After the Deadline today.
This rule-based system presents all of its suggestions (with no filtering or reordering) to the end-user. A hacker news user agreed with 50% of what it had to say and that’s not too bad. After the Deadline today looks at the context of any phrase it flags and tries to decide whether a suggestion is appropriate or not. A recent reviewer of After the Deadline says he agrees with 75% of what it says. An improvement!
How PolishMyWriting.com Worked
My favorite part of PolishMyWriting.com was how it stored rules. All the rules were collapsed into a tree. From each word position in the document, the system would the tree looking for the deepest match. In this way PolishMyWriting.com only had to evaluate rules that were relevant to a given position in the text. It was also easy for the match to fail right away (hey, the current word doesn’t match any of the possible starting words for a rule). With this I was able to create as many rules as I liked without impacting the performance of the system. The rule-tree in After the Deadline today has 33,331 end-states. Not too bad.
The rule-tree above matches six rules. If I were to give it a sentence: I did not remember to write a lot of posts. The system would start with I and realize there is nowhere to go. The next word is did. It would look at did and check if any children from did in the tree match the word after did in the sentence. In this case not matches. The system repeats this process from not. The next word that matches is succeed. Here PolishMyWriting.com would present the suggestions for did not succeed to the user. If the search would have failed to turn up an end state for did not succeed, the system would advance to not and repeat the process from the beginning. Since it found a match, the process would start again from to write a lot of posts. The result I [forgot] to write [many, much] blog posts.
What happened to PolishMyWriting.com?
I eventually dumped the PolishMyWriting.com name. Mainly because I kept hearing jokes from people online (and in person) about how great it was that I developed a writing resource for the Polish people. It stills exists, mainly as a place to demonstrate After the Deadline.
And don’t forget, if PolishMyWriting.com helps you out.. you can link to it using our nifty banner graphic. ;)
Its common for users to rely entirely on the in built proofreading capabilities of a word processor. Since the technology became standard in Microsofts Word in the 90’s countless cubicle dwellers and students have stopped carefully proofreading they’re own writing they have instead trust the automated spellcheck and grammar correcting features of their office product of choice to identify errors. We have carefully crafted this text to test the accuracy of these features, there are roughly 10 common grammatical mistakes in this paragraph. No matter good these tools perform there no replacement for carefully rereading you’re writing.
I agree and I think it’s time people rethink their relationship with their spell checker.
My friend Karen once told me a story about giving her husband feedback on a school paper. She noticed that he really liked semicolons. She confronted him on this and he said that Microsoft Word kept suggesting them and he kept accepting them. This is not a good situation.
Many writers rely on their spell checker to a fault. They see their spell checker as a tool to verify that a document is correct and ready to go with no effort on their part. If you want to verify that a document is correct, you need to reread it and look for errors. A great technique is to read the document backwards. Purdue’s Online Writing Lab has more tips like this.
If writers need to reread their documents, then what is the use of tools like After the Deadline? I look at After the Deadline as a tool that teaches users about writing. When asked what I do, I sometimes reply that I’m an English teacher with many thousands of students. No one gets the joke. It’s ok.
After the Deadline does a good job of finding its/it’s errors. It does not find all of them. I think this is OK. If a user checks their document and has a habit of misusing its/it’s, they’ll probably see a lot of errors. If this user is inquisitive, they may quick click explain. By doing this they’ll learn why the error is an error. By reading the feedback during the writing process, the lesson has the most potential to sink in.
Feedback is most valuable when it’s immediate. After the Deadline makes you a better writer through immediate feedback.
One of the challenges with most natural language processing tasks is getting data and collapsing it into a usable model. Prepping a large data set is hard enough. Once you’ve prepped it, you have to put it into a language model. My old NLP lab (consisting of two computers I paid $100 for from Cornell University), it took 18 hours to build my language models. You probably have better hardware than I did.
Save the Pain
I want to save you some pain and trouble if I can. That’s why I’m writing today’s blog post. Did you know After the Deadline has prebuilt bigram language models for English, German, Spanish, French, Italian, Polish, Indonesian, Russian, Dutch, and Portuguese? That’s 10 languages!
Also, did you know that the After the Deadline language model is a simple serialized Java object. In fact, the only dependency to use it is one Java class. Now that I’ve got you excited, let’s ask… what can you do with a language model?
Language Model API
A bigram language model has the count of every sequence of two words seen in a collection of text. From this information you can calculate all kinds of interesting things.
As an administrative note, I will use the Sleep programming language for these examples. This code is trivial to port to Java but I’m on an airplane and too lazy to whip out the compiler.
Let’s load up the Sleep interactive interpreter and load the English language model. You may assume all these commands are executed from the top-level directory of the After the Deadline source code distribution.
$ java -Xmx1536M -jar lib/sleep.jar >> Welcome to the Sleep scripting language > interact >> Welcome to interactive mode. Type your code and then '.' on a line by itself to execute the code. Type Ctrl+D or 'done' on a line by itself to leave interactive mode. import * from: lib/spellutils.jar; $handle = openf('models/model.bin'); $model = readObject($handle); closef($handle); println("Loaded $model"); . Loaded org.dashnine.preditor.LanguageModel@5cc145f9 done
And there you have it. A language model ready for your use. I’ll walk you through each API method.
The count method returns the number of times the specified word was seen.
> x [$model count: "hello"] 153 > x [$model count: "world"] 26355 > x [$model count: "the"] 3046771
The Pword method returns the probability of a word. The Java way to call this is model.Pword(“word”).
> x [$model Pword: "the"] 0.061422322118108906 > x [$model Pword: "Automattic"] 8.063923690767558E-7 > x [$model Pword: "fjsljnfnsk"] 0.0
Word Probability with Context
That’s the simple stuff. The fun part of the language model comes in when you can look at context. Imagine the sentence: “I want to bee an actor”. With the language model we can compare the fit of the word bee with the fit of the word be given the context. The contextual probability functions let you do that.
This method calculates P(word|previous) or the probability of the specified word given the previous word. This is the most straight forward application of our bigram language model. After all, we have every count of “previous word” that was seen in the corpus we trained with. We simply divide this by the count of previous to arrive at an answer. Here we use our contextual probability to look at be vs. bee.
> x [$model Pbigram1: "to", "bee"] 1.8397294594205855E-5 > x [$model Pbigram1: "to", "be"] 0.06296975819264979
This method calculates
P(word|next) or the probability of the specified word given the next word. How does it do it? It’s a simple application of Bayes Theorem. Bayes Theorem lets us flip the conditional in a probability. It’s calculated as:
P(word|next) = P(next|word) * P(word) / P(next). Here we use it to further investigate the probability of be vs. bee:
> x [$model Pbigram2: "bee", "an"] 0.0 > x [$model Pbigram2: "be", "an"] 0.014840446919206074
If you were a computer, which word would you assume the writer meant?
A Little Trick
These methods will also accept a sequence of two words in the parameter that you’re calculating the probability of. I use this trick to segment a misspelled word with a space between each pair of letters and compare the results to the other spelling suggestions.
> x [$model Pword: "New York"] 3.3241509434266565E-4 > x [$model Pword: "a lot"] 2.1988303923800437E-4 > x [$model Pbigram1: "it", "a lot"] 8.972553689218159E-5 > x [$model Pbigram2: "a lot", '0END.0'] 6.511467636360339E-7
0END.0. This is a special word. It represents the end of a sentence.
0BEGIN.0 represents the beginning of a sentence. The only punctuation tracked by these models is the
','. You can refer to it directly.
Harvest a Dictionary
One of my uses for the language model is to dump a spell checker dictionary. I do this by harvesting all words that occur two or more times. When I add enough data, I’ll raise this number to get a higher quality dictionary. To harvest a dictionary:
> x [$model harvest: 1000000] [a, of, to, and, the, 0END.0, 0BEGIN.0]
This command harvests all words that occur a million or more times. As you can see, there aren’t too many. The language model I have now was derived 75 million words of text.
The Next Step
That’s the After the Deadline language model in a nutshell. There is also a method to get the probability of a word given the two words that came before it. This is done using trigrams. I didn’t write about it here because AtD stores trigrams for words tracked by the misused word detector only.
That said, there’s a lot of fun you can have with this kind of data.
Download the After the Deadline open source distribution. You’ll find the English language model at models/model.bin. You can also get spellutils.jar from the lib directory.
If you want to experiment with bigrams in other languages, the After the Deadline language pack has the trained language models for nine other languages.
Good luck and have fun.
After the Deadline is an open source grammar, style, and spell checker. Unlike other tools, it uses context to make smart suggestions for errors. Plugins are available for Firefox, WordPress, and others.
This release of After the Deadline for Firefox works in more places. Here is a screenshot of After the Deadline working with Google Docs:
This release also:
- Adds proofreading for French, German, Portuguese, and Spanish
- Fixes several bugs and reported add-on conflicts
You can read the full list of changes at http://firefox.afterthedeadline.com/upgrades/1.1/
So many things to announce, how do I do it in one blog post? Let’s do a list. Drum role roll please.
5. Good-bye API keys
We’ve gotten rid of the AtD API keys. I was pushing to ask for more information and force folks to download a white paper before getting anything. Needless to say, I lost that battle. Using After the Deadline no longer requires registering with us. It’s still free for personal use. If you have a commercial need, grab our open source software.
4. Open Source Software – Updated Release
Finally, after all this time, After the Deadline’s server software is in a public subversion repository. We’ve also repackaged the current code and updated some of the documentation. Now you can check out the server software and stay in sync with what we’re using. We also have a mechanism (a local.sl file) where you can make local changes and not worry about us breaking them during future updates.
3. AtD speaks multiple languages
Yes, now AtD speaks multiple languages. We’ve put servers in place for French, German, Portuguese, and Spanish. We have more languages ready to go and we’ll make those available in the future. We’re providing contextual spell checking for these languages. French and German have grammar checking courtesy of the excellent Language Tool project. Misused word detection is under development.
The AtD Language Pack on our open source server page has everything you need.
2. bbPress Plugin Update
As if some otherworldly power was driving him, Gautam released an update to AtD/bbPress with support for French, German, Portuguese, and Spanish on Friday. How he knew about all this stuff before us, I don’t know :) But it’s great and if you use bbPress you need to get the plugin.
1. WordPress Plugin Update with Translations
And yes, our WordPress plugin has been updated to banish the API key nag-screen and to support proofreading in French, German, Portuguese, and Spanish.
The updated WordPress plugin uses your WPLANG setting to decide which language it should proofread in. If you blog in many languages or this setting doesn’t work for you, visit your profile page (the same place where all the AtD settings are) and enable the proofread with detected language option. With this turned on, After the Deadline will detect your language and apply the correct proofreader to it.
Thanks to the wonderful WordPress community volunteers, the AtD plugin has translations for Portuguese, Hindi, Japanese, French, Finnish, Bosnian, and Persian.
0. An Extra Bonus
I originally wanted to provide 10 exciting news items and this post became way too long with too much stuff at the top. So now you get a bonus item. We’ve also released updates to the AtD front-end components. They’re L10n ready and AtD/jQuery is now compatible with jQuery 1.4.