After the Deadline

Text Segmentation Follow Up

Posted in Multi-Lingual AtD by rsmudge on November 18, 2009

My first goal with making AtD multi-lingual is to get the spell checker going. Yesterday I found what looks like a promising solution for splitting text into sentences and words. This is an important step as AtD uses a statistical approach for spell checking.

Here is the Sleep code I used to test out the Java sentence and word segmentation technology:

$handle = openf(@ARGV[1]);
$text = join(" ", readAll($handle));
closef($handle);

import java.text.*;

$locale = [new Locale: @ARGV[0]];
$bi = [BreakIterator getSentenceInstance: $locale];

assert $bi !is $null : "Language fail: $locale";

[$bi setText: $text];

$index = 0;

while ([$bi next] != [BreakIterator DONE])
{
   $sentence = substr($text, $index, [$bi current]);
   println($sentence);

   # print out individual words.
   $wi = [BreakIterator getWordInstance: $locale];
   [$wi setText: $sentence];

   $ind = 0;

   while ([$wi next] != [BreakIterator DONE])
   {
      println("\t" . substr($sentence, $ind, [$wi current]));
      $ind = [$wi current];
   }

   $index = [$bi current];
}

You can run this with: java -jar sleep.jar segment.sl [locale name] [text file]. I tried it against English, Japanese, Hebrew, and Swedish. I found the Java text segmentation isn’t smart about abbreviations which is a shame. I had friends look at some trivial Hebrew and Swedish output and they said it looked good.

This is a key piece to being able to bring AtD spell and misused word checking to another language.

Tagged with: ,

Comments Off on Text Segmentation Follow Up