Text Segmentation Follow Up
My first goal with making AtD multi-lingual is to get the spell checker going. Yesterday I found what looks like a promising solution for splitting text into sentences and words. This is an important step as AtD uses a statistical approach for spell checking.
Here is the Sleep code I used to test out the Java sentence and word segmentation technology:
$handle = openf(@ARGV[1]); $text = join(" ", readAll($handle)); closef($handle); import java.text.*; $locale = [new Locale: @ARGV[0]]; $bi = [BreakIterator getSentenceInstance: $locale]; assert $bi !is $null : "Language fail: $locale"; [$bi setText: $text]; $index = 0; while ([$bi next] != [BreakIterator DONE]) { $sentence = substr($text, $index, [$bi current]); println($sentence); # print out individual words. $wi = [BreakIterator getWordInstance: $locale]; [$wi setText: $sentence]; $ind = 0; while ([$wi next] != [BreakIterator DONE]) { println("\t" . substr($sentence, $ind, [$wi current])); $ind = [$wi current]; } $index = [$bi current]; }
You can run this with: java -jar sleep.jar segment.sl [locale name] [text file]
. I tried it against English, Japanese, Hebrew, and Swedish. I found the Java text segmentation isn’t smart about abbreviations which is a shame. I had friends look at some trivial Hebrew and Swedish output and they said it looked good.
This is a key piece to being able to bring AtD spell and misused word checking to another language.
Comments Off on Text Segmentation Follow Up