After the Deadline

Sentence Segmentation Survey for Java

Posted in Multi-Lingual AtD by rsmudge on November 17, 2009

Well, it’s time to get AtD working with more languages. A good first place to start is sentence segmentation. Sentence segmentation is the problem of taking a bunch of raw text and breaking it into sentences.

Like any researcher, I start my task with a search to see what others have done. Here is what I found:

  1. There is a standard out there called SRX for Segmentation Rules Exchange. SRX files are XML and there is an open source Segment Java library for segmenting sentences using these rule files. There is also an editor called Ratel that lets folks edit these SRX files. LanguageTool has support for SRX files.
  2. Another option is to use the OpenNLP project’s tools. They have a SentenceDetectorME class that might do the trick. The problem is models are only available for English, German, Spanish, and Thai.
  3. I also learned that Java 1.6 has built-in tools for sentence segmentation in the java.text.* package. These were donated by IBM. Here is a quick dump of the locales supported by this package:

    java -jar sleep.jar -e 'println(join(", ", [java.text.BreakIterator getAvailableLocales]));'

    ja_JP, es_PE, en, ja_JP_JP, es_PA, sr_BA, mk, es_GT, ar_AE, no_NO, sq_AL, bg, ar_IQ, ar_YE, hu, pt_PT, el_CY, ar_QA, mk_MK, sv, de_CH, en_US, fi_FI, is, cs, en_MT, sl_SI, sk_SK, it, tr_TR, zh, th, ar_SA, no, en_GB, sr_CS, lt, ro, en_NZ, no_NO_NY, lt_LT, es_NI, nl, ga_IE, fr_BE, es_ES, ar_LB, ko, fr_CA, et_EE, ar_KW, sr_RS, es_US, es_MX, ar_SD, in_ID, ru, lv, es_UY, lv_LV, iw, pt_BR, ar_SY, hr, et, es_DO, fr_CH, hi_IN, es_VE, ar_BH, en_PH, ar_TN, fi, de_AT, es, nl_NL, es_EC, zh_TW, ar_JO, be, is_IS, es_CO, es_CR, es_CL, ar_EG, en_ZA, th_TH, el_GR, it_IT, ca, hu_HU, fr, en_IE, uk_UA, pl_PL, fr_LU, nl_BE, en_IN, ca_ES, ar_MA, es_BO, en_AU, sr, zh_SG, pt, uk, es_SV, ru_RU, ko_KR, vi, ar_DZ, vi_VN, sr_ME, sq, ar_LY, ar, zh_CN, be_BY, zh_HK, ja, iw_IL, bg_BG, in, mt_MT, es_PY, sl, fr_FR, cs_CZ, it_CH, ro_RO, es_PR, en_CA, de_DE, ga, de_LU, de, es_AR, sk, ms_MY, hr_HR, en_SG, da, mt, pl, ar_OM, tr, th_TH_TH, el, ms, sv_SE, da_DK, es_HN

A good survey of tools from the corpora-l mailing list is at http://mailman.uib.no/public/corpora/2007-October/005429.htm

I think I found my winner with Java’s built-in sentence segmentation tools. I haven’t evaluated the quality of the output yet (a task for tomorrow) but the fact it supports so many locales out of the box is very appealing to me. AtD-English has made it far on my simple rule-based sentence segmentation. If this API is near (or I suspect better than) what I have, this will do quite nicely.

Tagged with: , ,

Comments Off on Sentence Segmentation Survey for Java

WordCamp NYC – AtD Wrap Up

Posted in Talking to myself by rsmudge on November 15, 2009

So today is day two of WordCamp. This was my first one and I have to say it was definitely a good time. I learned a lot, got to interact with many WordPress “personalities”, and showed off AtD a bit.

I gave two presentations. At yesterday’s 2:30pm session I showed After the Deadline and its features to a packed room. To those of you who made it as far as this blog, good to see you, I hope you stick around.

I also gave a talk at 10pm showing how to add After the Deadline to a web application using jQuery. Those present seemed like a strong jQuery crowd so this was a positive thing. I hope some of you try it out. For those who couldn’t make it (but wanted to) here is the presentation:

View this document on Scribd

As a side note: I just noticed AtD corrects wordcamp to WordCamp. I’m on the ball for you guys 🙂

After the Deadline: Acquired

Posted in News by rsmudge on September 8, 2009

Today I have big news to announce for After the Deadline.  But first, I have to tell you a story.

I left the Air Force in March 2008 to pursue my dream of launching a startup and to finish graduate school.

Coming from the US Air Force Research Lab, I wanted to solve a problem and invent something cool.  The most recent problem I had when leaving the Air Force was writing technical reports.  I knew what I wanted to say but always had doubts about my style.  So I decided to hunker down and write a style checker.   I launched this tool as PolishMyWriting.com in July 08 and it went…. nowhere.

Later I wrote to a friend of mine in NYC who showed PolishMyWriting.com to his boss at TheLadders.com.  His boss wrote something about it in their customer newsletter and thousands of people came to my site.  They processed many documents and wrote to tell me how much this style checking tool helped them.  This inspired me.  I asked myself “if I’m selling umbrellas, where is it raining?” and I saw an opportunity in the web application space.  My goal–bring word processor quality proofreading tools to the web.  It was at this moment After the Deadline was born.

I submitted the style checker embedded into TinyMCE as part of an application to Y-Combinator, Spring 09.  Later, I was greeted with a rejection letter.  But that was ok!  I knew I didn’t need permission to start a business.  So on I went.  I adhered to the proposed schedule and milestones.

By Mar 09, I had a pretty kick ass system going.  The spellchecker accuracy was showing potential to rival even MS Word (context makes a big difference) and I knew the style checker was comparable to similar commercial software.  My favorite feature though was the misused word detection.  Outside of the latest MS Word–no one else really had this.

I applied to several other seed funds and after an encouraging meeting with a seed program in Boston, I found a trusted and qualified friend to handle business development if we were funded.  At this time I took some cost cutting measures (*cough*sold everything, moved in with my sister*cough*) to keep going.  My partner and I landed an interview and prepped like crazy for it.  We weren’t funded and I received feedback that the idea was good but the problem was too hard.  I’m thankful we had that interview though and was encouraged to make the first cut.

My short-lived partner went to a real job and I devoted another month to coding and launched in June 09.  As part of the promotion process, I posted about AtD to Hacker News. I also left this comment, hoping to impress someone:

This paragraph is from a NY Times article. Can you find the error in it?

Still Mr. Franken said the whole experience had been disconcerting. “It’s a weird thing: people are always asking me and Franni, ‘Are you okay?’ ” he said, referring to his wife. “As sort of life crises go, this is low on the totem poll. But it is weird, it’s a strange thing.”

Neither can the spell checker in your browser. Why? Because most spell checkers do not look at context. After the Deadline does.

Besides misused word detection (and contextual spell checking), After the Deadline checks grammar and style as well.

Visit http://www.polishmywriting.com/nyt.html to see the answer.

That must have worked, because later, I received an email from Matt Mullenweg asking me about bringing After the Deadline to Automattic.  Matt and I are both big believers in open source.  We like to eat but at the same time see a bigger picture where impact matters.   He is also an incredibly smooth chatter and anti-aliasing does wonders for his online presence.  I knew this was an opportunity I couldn’t say no to.

And so here I am.  We did the deal in July 09 and since then I’ve moved After the Deadline to Automattic’s infrastructure, rewrote the plugin, improved the algorithms, and today it started checking the spelling, style, and grammar for millions of bloggers.

So what’s next?

I’m continuing this natural language processing research under the Automattic banner.   We’re planning to expand AtD to support other languages.

After the Deadline will stay free for non-commercial use and we hope to see others build on the service.

And finally, our goal is to raise the quality of writing on the internet and give folks confidence in their voice.  We’re planning to open source the After the Deadline engine and the rule-sets that go with it.  This will be the most comprehensive proofreading suite available under an open source license.  I’m excited about the opportunity to be a part of this contribution.

And some thanks…

The hardest part of waiting to make this announcement is I haven’t yet had a chance to publicly thank those who believed in this project from the beginning.  I’d like to thank Mr. Elmer White for his legal counsel and support.  Congratulations on your second grandson.  Ms. Hye Yon Yi for her support and putting up with me when I was completely unavailable.  Mr. Dug Song, Patron Saint of MI Hackers, for advising me on the business side. Ms. Katrina Campau for coaching me on the investor interview.  The A2 New Tech (Ann Arbor, MI) for letting me present and being encouraging. Mr. David Groesbeck and Ms. Michelle Evanson at Why International for being my earliest business cheerleaders.  Mr. Brandon Mumby for providing hosting, even after AtD caused a hardware failure.  My colleagues, who serve at the Air Force Research Lab, for inspiring me to stay curious and keeping me in the community.  The crew in #startups on FreeNode and the Hacker News community–thanks for showing it can be done.  Of course my family (esp. my sister who let me turn the basement into a “command center”) and the makers of Mint Chocolate Chip ice-cream.