After the Deadline

After the Deadline: Acquired

Posted in News by rsmudge on September 8, 2009

Today I have big news to announce for After the Deadline.  But first, I have to tell you a story.

I left the Air Force in March 2008 to pursue my dream of launching a startup and to finish graduate school.

Coming from the US Air Force Research Lab, I wanted to solve a problem and invent something cool.  The most recent problem I had when leaving the Air Force was writing technical reports.  I knew what I wanted to say but always had doubts about my style.  So I decided to hunker down and write a style checker.   I launched this tool as PolishMyWriting.com in July 08 and it went…. nowhere.

Later I wrote to a friend of mine in NYC who showed PolishMyWriting.com to his boss at TheLadders.com.  His boss wrote something about it in their customer newsletter and thousands of people came to my site.  They processed many documents and wrote to tell me how much this style checking tool helped them.  This inspired me.  I asked myself “if I’m selling umbrellas, where is it raining?” and I saw an opportunity in the web application space.  My goal–bring word processor quality proofreading tools to the web.  It was at this moment After the Deadline was born.

I submitted the style checker embedded into TinyMCE as part of an application to Y-Combinator, Spring 09.  Later, I was greeted with a rejection letter.  But that was ok!  I knew I didn’t need permission to start a business.  So on I went.  I adhered to the proposed schedule and milestones.

By Mar 09, I had a pretty kick ass system going.  The spellchecker accuracy was showing potential to rival even MS Word (context makes a big difference) and I knew the style checker was comparable to similar commercial software.  My favorite feature though was the misused word detection.  Outside of the latest MS Word–no one else really had this.

I applied to several other seed funds and after an encouraging meeting with a seed program in Boston, I found a trusted and qualified friend to handle business development if we were funded.  At this time I took some cost cutting measures (*cough*sold everything, moved in with my sister*cough*) to keep going.  My partner and I landed an interview and prepped like crazy for it.  We weren’t funded and I received feedback that the idea was good but the problem was too hard.  I’m thankful we had that interview though and was encouraged to make the first cut.

My short-lived partner went to a real job and I devoted another month to coding and launched in June 09.  As part of the promotion process, I posted about AtD to Hacker News. I also left this comment, hoping to impress someone:

This paragraph is from a NY Times article. Can you find the error in it?

Still Mr. Franken said the whole experience had been disconcerting. “It’s a weird thing: people are always asking me and Franni, ‘Are you okay?’ ” he said, referring to his wife. “As sort of life crises go, this is low on the totem poll. But it is weird, it’s a strange thing.”

Neither can the spell checker in your browser. Why? Because most spell checkers do not look at context. After the Deadline does.

Besides misused word detection (and contextual spell checking), After the Deadline checks grammar and style as well.

Visit http://www.polishmywriting.com/nyt.html to see the answer.

That must have worked, because later, I received an email from Matt Mullenweg asking me about bringing After the Deadline to Automattic.  Matt and I are both big believers in open source.  We like to eat but at the same time see a bigger picture where impact matters.   He is also an incredibly smooth chatter and anti-aliasing does wonders for his online presence.  I knew this was an opportunity I couldn’t say no to.

And so here I am.  We did the deal in July 09 and since then I’ve moved After the Deadline to Automattic’s infrastructure, rewrote the plugin, improved the algorithms, and today it started checking the spelling, style, and grammar for millions of bloggers.

So what’s next?

I’m continuing this natural language processing research under the Automattic banner.   We’re planning to expand AtD to support other languages.

After the Deadline will stay free for non-commercial use and we hope to see others build on the service.

And finally, our goal is to raise the quality of writing on the internet and give folks confidence in their voice.  We’re planning to open source the After the Deadline engine and the rule-sets that go with it.  This will be the most comprehensive proofreading suite available under an open source license.  I’m excited about the opportunity to be a part of this contribution.

And some thanks…

The hardest part of waiting to make this announcement is I haven’t yet had a chance to publicly thank those who believed in this project from the beginning.  I’d like to thank Mr. Elmer White for his legal counsel and support.  Congratulations on your second grandson.  Ms. Hye Yon Yi for her support and putting up with me when I was completely unavailable.  Mr. Dug Song, Patron Saint of MI Hackers, for advising me on the business side. Ms. Katrina Campau for coaching me on the investor interview.  The A2 New Tech (Ann Arbor, MI) for letting me present and being encouraging. Mr. David Groesbeck and Ms. Michelle Evanson at Why International for being my earliest business cheerleaders.  Mr. Brandon Mumby for providing hosting, even after AtD caused a hardware failure.  My colleagues, who serve at the Air Force Research Lab, for inspiring me to stay curious and keeping me in the community.  The crew in #startups on FreeNode and the Hacker News community–thanks for showing it can be done.  Of course my family (esp. my sister who let me turn the basement into a “command center”) and the makers of Mint Chocolate Chip ice-cream.

Tweaking the AtD Spellchecker

Posted in NLP Research by rsmudge on September 4, 2009

Conventional wisdom says a spellchecker dictionary should have around 90,000 words.  Too few words and the spellchecker will mark many correct things as wrong.   Too many words and it’s more likely a typo could result in a rarely used word going unnoticed by most spellcheckers.

Assembling a good dictionary is a challenge.  Many wordlists are available online but often times these are either not comprehensive or they’re too comprehensive and contain many misspellings.

AtD tries to get around this problem by intersecting a collection of wordlists with words it sees used in a corpus ( a corpus is a directory full of books, Wikipedia articles, and blog posts I “borrowed” from you).  Currently AtD accepts any word seen once leading to a dictionary of  161,879 words.  Too many.

Today I decided to experiment with different thresholds for how many times a word needs to be seen before it’s allowed entrance into the coveted spellchecker wordlist.  My goal was to increase the accuracy of the AtD spellchecker and drop the number of misspelled words in the dictionary.

Here are the results, AtD:n means AtD requires a word be seen n times before AtD includes it in the dictionary.

ASpell Dataset (Hard to correct errors)

Engine Words Accuracy * Present Words
AtD:1 161,879 55.0% 73
AtD:2 116,876 55.8% 57
AtD:3 95,910 57.3% 38
AtD:4 82,782 58.0% 30
AtD:5 73,628 58.5% 27
AtD:6 66,666 59.1% 23
ASpell (normal) n/a 56.9% 14
Word 97 n/a 59.0% 18
Word 2000 n/a 62.6% 20
Word 2003 n/a 62.8% 20

Wikipedia Dataset (Easy to correct errors)

Engine Words Accuracy * Present Words
AtD:1 161,879 87.9% 233
AtD:2 116,876 87.8% 149
AtD:3 95,910 88.0% 104
AtD:4 82,782 88.3% 72
AtD:5 73,628 88.3% 59
AtD:6 66,666 88.62% 48
ASpell (normal) n/a 84.7% 44
Word 97 n/a 89.0% 31
Word 2000 n/a 92.5% 42
Word 2003 n/a 92.6% 41

Experiment data and comparison numbers from: Deorowicz, S., Ciura M. G., Correcting spelling errors by modelling their causes, International Journal of Applied Mathematics and Computer Science, 2005; 15(2):275–285.

* Accuracy numbers show spell checking without context as the Word and ASpell checkers are not contextual (and therefor the data isn’t either).

After seeing these results, I’ve decided to settle on a threshold of 2 to start and I’ll move to 3 after no one complains about 2.

I’m not too happy that the present word count is so sky high but as I add more data to AtD and up the minimum word threshold this problem should go away.  This is progress though.  Six months ago I had so little data I wouldn’t have been able to use a threshold of 2, even if I wanted to.

WordPress Plugin Update

Posted in News by rsmudge on September 3, 2009

Today, I’ve released an updated WordPress plugin for After the Deadline.  This plugin really revamps the AtD experience, I think you’ll like it.  [Get it here]

The most obvious change, most errors are optional and disabled by default.  AtD still checks your spelling and grammar but most of the style options are available on your profile page (/wp-admin/profile.php).

atdwporg

Here is a summary of the options:

  • Bias language may offend or alienate different groups of readers.
  • Clichés are overused phrases with little reader impact.
  • Complex phrases are words or phrases with simpler every-day alternatives.
  • A double negative is one negative phrase followed by another. The negatives cancel each other out, making the meaning hard to understand.
  • A hidden verb is a verb made into a noun. These often need extra verbs to make sense.
  • Jargon phrases are foreign words and phrases that only make sense to certain people.
  • Passive voice obscures or omits the sentence subject. Frequent use of passive voice makes your writing hard to understand.
  • Phrases to avoid are wishy-washy or indecisive phrases.
  • Redundant phrases can be shortened by removing an unneeded word.

These settings follow the user so each user on your blog can have different settings.

Also the AtD plugin code has been updated to the WordPress coding standards, much cleaner.  The TinyMCE plugin has experienced a rigorous overhaul as well.  If you have any issues, let me know.

— Raphael

(Update 4 Sept 09 – 10am) well, for those of you who had issues, thanks for letting me know. 🙂 One of the AtD files had whitespace at the end causing WordPress to die with a “Cannot modify header information” warning.  To make this especially maddening, it only affects those who have output_buffering disabled in php.ini which is dependent on who packaged PHP and/or how it was installed.   The issue has been resolved and the latest is in the WordPress repository.

Grammar Checkers – Not so much magic

Posted in NLP Research by rsmudge on August 31, 2009

I once read a quote where an early pioneer of computerized grammar checkers expressed disappointment about how little the technology has evolved.  It’s amazing to me how grammar checking is simultaneously simple (rule-based) and complicated.  The complicated part is the technologies that make the abstractions about the raw text possible.

It helps to start with, what is a grammar checker?  When I write grammar checker here I’m really referring to a writing checker that looks for phrases that represent an error.  Despite advances in AI, complete understanding of unstructured text is still beyond the reach of computers.   Grammar checkers work by finding patterns that a human created and flagging text that matches these patterns.

Patterns can be as simple as always flagging “your welcome” and suggesting “you’re welcome”.  While these patterns are simple to make a checker for, they don’t offer much power.  A grammar checker with any kind of coverage would need tens or hundreds of thousands  of rules to be useful to anyone.

Realizing this, NLP researchers decided to come up with ways to infer information about text at a higher level and write rules that take  advantage of this higher level information.  One example is part-of-speech tagging.  In elementary school you may have learned grammar by labeling words in a sentence with verb, noun, adjective, etc.  Most grammar checkers do this too and the process is called tagging.  With part-of-speech information, one rule can capture many writing errors.  In After the Deadline, I use part-of-speech tags to find errors where a plural noun is used with a determiner that expects a singular noun e.g. “There is categories”

While tagging gives extra information about each word, the rules we can write are still limited.  Some words should be grouped together, for example proper nouns like “New York”.  The process of grouping words that belong together is known as chunking.  Through chunking rule writers can have more confidence that what follows a tagged word (or chunk) really has the characteristics (plural, singular) assumed by the rule.  For example “There is many categories” should be flagged and a decent chunker makes it easier for a rule to realize the phrase following “There is” represents a plural noun phrase.

The next abstraction is the full parse.  A full-parse is where the computer tries to infer the subject, object, and verbs in a sentence.  The structure of the sentence is placed into a tree-like data structure that the rules can refer to at will.  With a full-parse a grammar checker can offer suggestions that drastically restructure the sentence (e.g. reword passive voice), decide which sentences make no sense, and find errors that are multiple words apart (subject verb agreement errors).

Regardless of the abstraction level used, grammar checkers are still rule-based.  The techniques that back these abstractions can become very sophisticated.  It seems much research has focused on improving these abstractions.

To move grammar checking to a new level, I expect we will need new abstractions one can use when writing rules.  I also expect developing techniques to automatically mine rules will be a valuable research subject.

It’s about time…

Posted in News by rsmudge on August 27, 2009

It’s time for After the Deadline (abbreviated AtD) to have its own blog.  For those of you who found this from a random Google search, After the Deadline is a technology that checks spelling, misused words, style, and grammar.  After a stint as a researcher for the government, this is my attempt to apply the scientific method and make good things happen.

Here I’ll keep you up to date the latest news on the technology and start sharing my results and methods.  I’m a big believer in open source and open information.  I am also frustrated by how “Computer Science” papers present lots of great ideas but often little in the way of code or data to independently verify the same results.

It’s my mission to make something immensely useful but also to share my methods and advance the science behind it.  Please join me on this journey.