Sentence Segmentation Survey for Java
Well, it’s time to get AtD working with more languages. A good first place to start is sentence segmentation. Sentence segmentation is the problem of taking a bunch of raw text and breaking it into sentences.
Like any researcher, I start my task with a search to see what others have done. Here is what I found:
- There is a standard out there called SRX for Segmentation Rules Exchange. SRX files are XML and there is an open source Segment Java library for segmenting sentences using these rule files. There is also an editor called Ratel that lets folks edit these SRX files. LanguageTool has support for SRX files.
- Another option is to use the OpenNLP project’s tools. They have a SentenceDetectorME class that might do the trick. The problem is models are only available for English, German, Spanish, and Thai.
- I also learned that Java 1.6 has built-in tools for sentence segmentation in the java.text.* package. These were donated by IBM. Here is a quick dump of the locales supported by this package:
java -jar sleep.jar -e 'println(join(", ", [java.text.BreakIterator getAvailableLocales]));'
ja_JP, es_PE, en, ja_JP_JP, es_PA, sr_BA, mk, es_GT, ar_AE, no_NO, sq_AL, bg, ar_IQ, ar_YE, hu, pt_PT, el_CY, ar_QA, mk_MK, sv, de_CH, en_US, fi_FI, is, cs, en_MT, sl_SI, sk_SK, it, tr_TR, zh, th, ar_SA, no, en_GB, sr_CS, lt, ro, en_NZ, no_NO_NY, lt_LT, es_NI, nl, ga_IE, fr_BE, es_ES, ar_LB, ko, fr_CA, et_EE, ar_KW, sr_RS, es_US, es_MX, ar_SD, in_ID, ru, lv, es_UY, lv_LV, iw, pt_BR, ar_SY, hr, et, es_DO, fr_CH, hi_IN, es_VE, ar_BH, en_PH, ar_TN, fi, de_AT, es, nl_NL, es_EC, zh_TW, ar_JO, be, is_IS, es_CO, es_CR, es_CL, ar_EG, en_ZA, th_TH, el_GR, it_IT, ca, hu_HU, fr, en_IE, uk_UA, pl_PL, fr_LU, nl_BE, en_IN, ca_ES, ar_MA, es_BO, en_AU, sr, zh_SG, pt, uk, es_SV, ru_RU, ko_KR, vi, ar_DZ, vi_VN, sr_ME, sq, ar_LY, ar, zh_CN, be_BY, zh_HK, ja, iw_IL, bg_BG, in, mt_MT, es_PY, sl, fr_FR, cs_CZ, it_CH, ro_RO, es_PR, en_CA, de_DE, ga, de_LU, de, es_AR, sk, ms_MY, hr_HR, en_SG, da, mt, pl, ar_OM, tr, th_TH_TH, el, ms, sv_SE, da_DK, es_HN
A good survey of tools from the corpora-l mailing list is at http://mailman.uib.no/public/corpora/2007-October/005429.htm
I think I found my winner with Java’s built-in sentence segmentation tools. I haven’t evaluated the quality of the output yet (a task for tomorrow) but the fact it supports so many locales out of the box is very appealing to me. AtD-English has made it far on my simple rule-based sentence segmentation. If this API is near (or I suspect better than) what I have, this will do quite nicely.