Text Segmentation Follow Up
My first goal with making AtD multi-lingual is to get the spell checker going. Yesterday I found what looks like a promising solution for splitting text into sentences and words. This is an important step as AtD uses a statistical approach for spell checking.
Here is the Sleep code I used to test out the Java sentence and word segmentation technology:
$handle = openf(@ARGV[1]);
$text = join(" ", readAll($handle));
closef($handle);
import java.text.*;
$locale = [new Locale: @ARGV[0]];
$bi = [BreakIterator getSentenceInstance: $locale];
assert $bi !is $null : "Language fail: $locale";
[$bi setText: $text];
$index = 0;
while ([$bi next] != [BreakIterator DONE])
{
$sentence = substr($text, $index, [$bi current]);
println($sentence);
# print out individual words.
$wi = [BreakIterator getWordInstance: $locale];
[$wi setText: $sentence];
$ind = 0;
while ([$wi next] != [BreakIterator DONE])
{
println("\t" . substr($sentence, $ind, [$wi current]));
$ind = [$wi current];
}
$index = [$bi current];
}
You can run this with: java -jar sleep.jar segment.sl [locale name] [text file]. I tried it against English, Japanese, Hebrew, and Swedish. I found the Java text segmentation isn’t smart about abbreviations which is a shame. I had friends look at some trivial Hebrew and Swedish output and they said it looked good.
This is a key piece to being able to bring AtD spell and misused word checking to another language.
Comments Off on Text Segmentation Follow Up