N-Gram Language Guessing with NGramJ
NGramJ is a Java library for language recognition. It uses language profiles (counts of character sequences) to guess what language some arbitrary text is. In this post I’ll briefly show you how to use it from the command-line and the Java API. I’ll also show you how to generate a new language profile. I’m doing this so I don’t have to figure out how to do it again.
Running
You can get a feel for how well NGramJ works by trying it on the command line. For example:
$ cat >a.txt This is a test. $ java -jar cngram.jar -lang2 a.txt speed: en:0.667 ru:0.000 pt:0.000 .. de:0.000 |0.0E0 |0.0E0 dt=2812 $ cat >b.txt Wikipedia ist ein Projekt zum Aufbau einer Enzyklopädie aus freien Inhalten in allen Sprachen der Welt. $ java -jar cngram.jar -lang2 b.txt speed: de:0.857 ru:0.000 pt:0.000 .. en:0.000 |0.0E0 |0.0E0 dt=2077
Using
Something I like about the API for this program–it’s simple. It is also thread-safe. You can instantiate a static reference for the library and call it from any thread later. Here is some code adopted from the Flaptor Utils library.
import de.spieleck.app.cngram.NGramProfiles; protected NGramProfiles profiles = new NGramProfiles(); public String getLanguage(String text) { NGramProfiles.Ranker ranker = profiles.getRanker(); ranker.account(text); NGramProfiles.RankResult result = ranker.getRankResult(); return result.getName(0); }
Now that you know how to use the library for language guessing, I’ll show you how to add a new language.
Adding a New Language
NGramJ comes with several language profiles but you may have a need to generate one yourself. A great source of language data is Wikipedia. I’ve written about extracting plain-text from Wikipedia here before. Today, I needed to generate a profile for Indonesian. The first step is to create a raw language profile. You can do this with the cngram.jar file:
$ java -jar cngram.jar -create id_big id_corpus.txt new profile 'id_big.ngp' was created.
This will create an id.ngp file. I also noticed this file is huge. Several hundred kilobytes compared to the 30K of the other language profiles. The next step is to clean the language profile up. To do this, I created a short Sleep script to read in the id.ngp file and cut any 3-gram and 4-gram sequences that occur less than 20K times. I chose 20K because it leaves me with a file that is about 30K. If you have less data, you’ll want to adjust this number downwards. The other language profiles use 1000 as a cut-off. This leads me to believe they were trained on 6MB of text data versus my 114MB of Indonesian text.
Here is the script:
%grams = ohash(); setMissPolicy(%grams, { return @(); }); $handle = openf(@ARGV[0]); $banner = readln($handle); readln($handle); # consume the ngram_count value while $text (readln($handle)) { ($gram, $count) = split(' ', $text); if (strlen($gram) <= 2 || $count > 20000) { push(%grams[strlen($gram)], @($gram, $count)); } } closef($handle); sub sortTuple { return $2[1] <=> $1[1]; } println($banner); printAll(map({ return join(" ", $1); }, sort(&sortTuple, %grams[1]))); printAll(map({ return join(" ", $1); }, sort(&sortTuple, %grams[2]))); printAll(map({ return join(" ", $1); }, sort(&sortTuple, %grams[3]))); printAll(map({ return join(" ", $1); }, sort(&sortTuple, %grams[4])));
To run the script:
$ java -jar lib/sleep.jar sortit.sl id_big.ngp >id.ngp
The last step is to copy id.ngp into src/de/spieleck/app/cngram/ and edit src/de/spieleck/app/cngram/profiles.lst to contain the id resource. Type ant in the top-level directory of the NGramJ source code to rebuild cngram.jar and then you’re ready to test:
$ cat >c.txt Selamat datang di Wikipedia bahasa Indonesia, ensiklopedia bebas berbahasa Indonesia $ java -jar cngram.jar -lang2 c.txt speed: id:0.857 ru:0.000 pt:0.000 .. de:0.000 |0.0E0 |0.0E0 dt=1872
As you can see NGramJ is an easy to work with library. If you need to do language guessing, I recommend it.
6 comments