After the Deadline

N-Gram Language Guessing with NGramJ

Posted in Multi-Lingual AtD, NLP Research by rsmudge on February 8, 2010

NGramJ is a Java library for language recognition. It uses language profiles (counts of character sequences) to guess what language some arbitrary text is. In this post I’ll briefly show you how to use it from the command-line and the Java API. I’ll also show you how to generate a new language profile. I’m doing this so I don’t have to figure out how to do it again.

Running

You can get a feel for how well NGramJ works by trying it on the command line. For example:

$ cat >a.txt
This is a test.
$ java -jar cngram.jar -lang2 a.txt
speed: en:0.667 ru:0.000 pt:0.000 .. de:0.000 |0.0E0 |0.0E0 dt=2812
$ cat >b.txt
Wikipedia ist ein Projekt zum Aufbau einer Enzyklopädie aus freien Inhalten in allen Sprachen der Welt.
$ java -jar cngram.jar -lang2 b.txt
speed: de:0.857 ru:0.000 pt:0.000 .. en:0.000 |0.0E0 |0.0E0 dt=2077

Using

Something I like about the API for this program–it’s simple. It is also thread-safe. You can instantiate a static reference for the library and call it from any thread later. Here is some code adopted from the Flaptor Utils library.

import de.spieleck.app.cngram.NGramProfiles;

protected NGramProfiles profiles = new NGramProfiles();

public String getLanguage(String text) {
 NGramProfiles.Ranker ranker = profiles.getRanker();
 ranker.account(text);
 NGramProfiles.RankResult result = ranker.getRankResult();
 return result.getName(0);
}

Now that you know how to use the library for language guessing, I’ll show you how to add a new language.

Adding a New Language

NGramJ comes with several language profiles but you may have a need to generate one yourself. A great source of language data is Wikipedia. I’ve written about extracting plain-text from Wikipedia here before. Today, I needed to generate a profile for Indonesian. The first step is to create a raw language profile. You can do this with the cngram.jar file:

$ java -jar cngram.jar -create id_big id_corpus.txt
new profile 'id_big.ngp' was created.

This will create an id.ngp file. I also noticed this file is huge. Several hundred kilobytes compared to the 30K of the other language profiles. The next step is to clean the language profile up. To do this, I created a short Sleep script to read in the id.ngp file and cut any 3-gram and 4-gram sequences that occur less than 20K times. I chose 20K because it leaves me with a file that is about 30K. If you have less data, you’ll want to adjust this number downwards. The other language profiles use 1000 as a cut-off. This leads me to believe they were trained on 6MB of text data versus my 114MB of Indonesian text.

Here is the script:

%grams = ohash();
setMissPolicy(%grams, { return @(); });

$handle = openf(@ARGV[0]);
$banner = readln($handle);
readln($handle); # consume the ngram_count value

while $text (readln($handle)) {
   ($gram, $count) = split(' ', $text);

   if (strlen($gram) <= 2 || $count > 20000) {
      push(%grams[strlen($gram)], @($gram, $count));
   }
}
closef($handle);

sub sortTuple {
   return $2[1] <=> $1[1];
}

println($banner);

printAll(map({ return join(" ", $1); }, sort(&sortTuple, %grams[1])));
printAll(map({ return join(" ", $1); }, sort(&sortTuple, %grams[2])));
printAll(map({ return join(" ", $1); }, sort(&sortTuple, %grams[3])));
printAll(map({ return join(" ", $1); }, sort(&sortTuple, %grams[4])));

To run the script:

$ java -jar lib/sleep.jar sortit.sl id_big.ngp >id.ngp

The last step is to copy id.ngp into src/de/spieleck/app/cngram/ and edit src/de/spieleck/app/cngram/profiles.lst to contain the id resource. Type ant in the top-level directory of the NGramJ source code to rebuild cngram.jar and then you’re ready to test:

$ cat >c.txt
Selamat datang di Wikipedia bahasa Indonesia, ensiklopedia bebas berbahasa Indonesia
$ java -jar cngram.jar -lang2 c.txt
speed: id:0.857 ru:0.000 pt:0.000 .. de:0.000 |0.0E0 |0.0E0 dt=1872

As you can see NGramJ is an easy to work with library. If you need to do language guessing, I recommend it.