N-Gram Language Guessing with NGramJ

Posted in Multi-Lingual AtD, NLP Research by rsmudge on February 8, 2010

NGramJ is a Java library for language recognition. It uses language profiles (counts of character sequences) to guess what language some arbitrary text is. In this post I’ll briefly show you how to use it from the command-line and the Java API. I’ll also show you how to generate a new language profile. I’m doing this so I don’t have to figure out how to do it again.

Running

You can get a feel for how well NGramJ works by trying it on the command line. For example:

$ cat >a.txt
This is a test.
$ java -jar cngram.jar -lang2 a.txt
speed: en:0.667 ru:0.000 pt:0.000 .. de:0.000 |0.0E0 |0.0E0 dt=2812
$ cat >b.txt
Wikipedia ist ein Projekt zum Aufbau einer Enzyklopädie aus freien Inhalten in allen Sprachen der Welt.
$ java -jar cngram.jar -lang2 b.txt
speed: de:0.857 ru:0.000 pt:0.000 .. en:0.000 |0.0E0 |0.0E0 dt=2077

Using

Something I like about the API for this program–it’s simple. It is also thread-safe. You can instantiate a static reference for the library and call it from any thread later. Here is some code adopted from the Flaptor Utils library.

import de.spieleck.app.cngram.NGramProfiles;

protected NGramProfiles profiles = new NGramProfiles();

public String getLanguage(String text) {
 NGramProfiles.Ranker ranker = profiles.getRanker();
 ranker.account(text);
 NGramProfiles.RankResult result = ranker.getRankResult();
 return result.getName(0);
}

Now that you know how to use the library for language guessing, I’ll show you how to add a new language.

Adding a New Language

NGramJ comes with several language profiles but you may have a need to generate one yourself. A great source of language data is Wikipedia. I’ve written about extracting plain-text from Wikipedia here before. Today, I needed to generate a profile for Indonesian. The first step is to create a raw language profile. You can do this with the cngram.jar file:

$ java -jar cngram.jar -create id_big id_corpus.txt
new profile 'id_big.ngp' was created.

This will create an id.ngp file. I also noticed this file is huge. Several hundred kilobytes compared to the 30K of the other language profiles. The next step is to clean the language profile up. To do this, I created a short Sleep script to read in the id.ngp file and cut any 3-gram and 4-gram sequences that occur less than 20K times. I chose 20K because it leaves me with a file that is about 30K. If you have less data, you’ll want to adjust this number downwards. The other language profiles use 1000 as a cut-off. This leads me to believe they were trained on 6MB of text data versus my 114MB of Indonesian text.

Here is the script:

%grams = ohash();
setMissPolicy(%grams, { return @(); });

$handle = openf(@ARGV[0]);
$banner = readln($handle);
readln($handle); # consume the ngram_count value

while $text (readln($handle)) {
   ($gram, $count) = split(' ', $text);

   if (strlen($gram) <= 2 || $count > 20000) {
      push(%grams[strlen($gram)], @($gram, $count));
   }
}
closef($handle);

sub sortTuple {
   return $2[1] <=> $1[1];
}

println($banner);

printAll(map({ return join(" ", $1); }, sort(&sortTuple, %grams[1])));
printAll(map({ return join(" ", $1); }, sort(&sortTuple, %grams[2])));
printAll(map({ return join(" ", $1); }, sort(&sortTuple, %grams[3])));
printAll(map({ return join(" ", $1); }, sort(&sortTuple, %grams[4])));

To run the script:

$ java -jar lib/sleep.jar sortit.sl id_big.ngp >id.ngp

The last step is to copy id.ngp into src/de/spieleck/app/cngram/ and edit src/de/spieleck/app/cngram/profiles.lst to contain the id resource. Type ant in the top-level directory of the NGramJ source code to rebuild cngram.jar and then you’re ready to test:

$ cat >c.txt
Selamat datang di Wikipedia bahasa Indonesia, ensiklopedia bebas berbahasa Indonesia
$ java -jar cngram.jar -lang2 c.txt
speed: id:0.857 ru:0.000 pt:0.000 .. de:0.000 |0.0E0 |0.0E0 dt=1872

As you can see NGramJ is an easy to work with library. If you need to do language guessing, I recommend it.

Tagged with: Java, language, language guessing, ngramj, NLP

6 comments

6 Responses

Subscribe to comments with RSS.

Kevin said, on March 11, 2010 at 8:59 am

How does it compare with libtextcat? I found libtextcat extremely simple to use, building an LM is just

$ createfp lang-fingerprint.txt

and it seemed to work fine to guess between the two variants of Norwegian (it characterised my bad attempts at spelling Dutch as Middle-Frisian =P)
- rsmudge said, on March 11, 2010 at 1:32 pm
  
  They’re probably pretty similar. I use ngramj as it’s Java and AtD is written mostly in Sleep/Java. In my own tests I’ve found one or two words is a coin toss as to what it will characterize it as. Once you get beyond a full sentence that says something it’s always correct. I have to add the says something caveat because I’ve found it will mischaracterize a list of names, addresses, and phone numbers.
Dominique said, on March 23, 2010 at 8:07 pm

3 very interesting articles (with “Generating a Plain Text Corpus from Wikipedia” and “All about Language Model”). Did you add Arabic or Cyrillic languages or Japanese, Chinese ?

I created an Arabian ngp file, it works fine, but strangly Persian doesn’t work (score is always 0). Same thing, with Russian doesn’t work (with the ru.ngp file provided in ngramj). I suppose I make a mistake with the cngram API.
- rsmudge said, on March 24, 2010 at 3:44 pm
  
  I haven’t tried these languages yet. One key I’ve found is to make sure your character encoding is correct all around. I use UTF-8.
  
  1. Set it on the command line
  
  export LC_CTYPE=en_US.UTF-8
  export LANG=en_US.UTF-8
  
  2. Make sure your files are encoded with UTF-8
  3. Make sure Java is getting the UTF-8 hint with -Dfile.encoding=UTF-8
  
  etc.
Dominique said, on March 23, 2010 at 8:11 pm

There is a syntax error in sortid.sl

return $2[1] <=> $1[1];
- rsmudge said, on March 24, 2010 at 3:46 pm
  
  Thanks, this is fixed now.

Comments are closed.

After the Deadline