After the Deadline

N-Gram Language Guessing with NGramJ

Posted in Multi-Lingual AtD, NLP Research by rsmudge on February 8, 2010

NGramJ is a Java library for language recognition. It uses language profiles (counts of character sequences) to guess what language some arbitrary text is. In this post I’ll briefly show you how to use it from the command-line and the Java API. I’ll also show you how to generate a new language profile. I’m doing this so I don’t have to figure out how to do it again.

Running

You can get a feel for how well NGramJ works by trying it on the command line. For example:

$ cat >a.txt
This is a test.
$ java -jar cngram.jar -lang2 a.txt
speed: en:0.667 ru:0.000 pt:0.000 .. de:0.000 |0.0E0 |0.0E0 dt=2812
$ cat >b.txt
Wikipedia ist ein Projekt zum Aufbau einer Enzyklopädie aus freien Inhalten in allen Sprachen der Welt.
$ java -jar cngram.jar -lang2 b.txt
speed: de:0.857 ru:0.000 pt:0.000 .. en:0.000 |0.0E0 |0.0E0 dt=2077

Using

Something I like about the API for this program–it’s simple. It is also thread-safe. You can instantiate a static reference for the library and call it from any thread later. Here is some code adopted from the Flaptor Utils library.

import de.spieleck.app.cngram.NGramProfiles;

protected NGramProfiles profiles = new NGramProfiles();

public String getLanguage(String text) {
 NGramProfiles.Ranker ranker = profiles.getRanker();
 ranker.account(text);
 NGramProfiles.RankResult result = ranker.getRankResult();
 return result.getName(0);
}

Now that you know how to use the library for language guessing, I’ll show you how to add a new language.

Adding a New Language

NGramJ comes with several language profiles but you may have a need to generate one yourself. A great source of language data is Wikipedia. I’ve written about extracting plain-text from Wikipedia here before. Today, I needed to generate a profile for Indonesian. The first step is to create a raw language profile. You can do this with the cngram.jar file:

$ java -jar cngram.jar -create id_big id_corpus.txt
new profile 'id_big.ngp' was created.

This will create an id.ngp file. I also noticed this file is huge. Several hundred kilobytes compared to the 30K of the other language profiles. The next step is to clean the language profile up. To do this, I created a short Sleep script to read in the id.ngp file and cut any 3-gram and 4-gram sequences that occur less than 20K times. I chose 20K because it leaves me with a file that is about 30K. If you have less data, you’ll want to adjust this number downwards. The other language profiles use 1000 as a cut-off. This leads me to believe they were trained on 6MB of text data versus my 114MB of Indonesian text.

Here is the script:

%grams = ohash();
setMissPolicy(%grams, { return @(); });

$handle = openf(@ARGV[0]);
$banner = readln($handle);
readln($handle); # consume the ngram_count value

while $text (readln($handle)) {
   ($gram, $count) = split(' ', $text);

   if (strlen($gram) <= 2 || $count > 20000) {
      push(%grams[strlen($gram)], @($gram, $count));
   }
}
closef($handle);

sub sortTuple {
   return $2[1] <=> $1[1];
}

println($banner);

printAll(map({ return join(" ", $1); }, sort(&sortTuple, %grams[1])));
printAll(map({ return join(" ", $1); }, sort(&sortTuple, %grams[2])));
printAll(map({ return join(" ", $1); }, sort(&sortTuple, %grams[3])));
printAll(map({ return join(" ", $1); }, sort(&sortTuple, %grams[4])));

To run the script:

$ java -jar lib/sleep.jar sortit.sl id_big.ngp >id.ngp

The last step is to copy id.ngp into src/de/spieleck/app/cngram/ and edit src/de/spieleck/app/cngram/profiles.lst to contain the id resource. Type ant in the top-level directory of the NGramJ source code to rebuild cngram.jar and then you’re ready to test:

$ cat >c.txt
Selamat datang di Wikipedia bahasa Indonesia, ensiklopedia bebas berbahasa Indonesia
$ java -jar cngram.jar -lang2 c.txt
speed: id:0.857 ru:0.000 pt:0.000 .. de:0.000 |0.0E0 |0.0E0 dt=1872

As you can see NGramJ is an easy to work with library. If you need to do language guessing, I recommend it.

Generating a Plain Text Corpus from Wikipedia

Posted in Multi-Lingual AtD, NLP Research by rsmudge on December 4, 2009

AtD *thrives* on data and one of the best places for a variety of data is Wikipedia. This post describes how to generate a plain text corpus from a complete Wikipedia dump. This process is a modification of Extracting Text from Wikipedia by Evan Jones.

Evan’s post shows how to extract the top articles from the English Wikipedia and make a plain text file. Here I’ll show how to extract all articles from a Wikipedia dump with two helpful constraints. Each step should:

  • finish before I’m old  enough to collect social security
  • tolerate errors and run to completion without my intervention

Today, we’re going to do the French Wikipedia. I’m working on multi-lingual AtD and French seems like a fun language to go with. Our systems guy, Stephane speaks French. That’s as good of a reason as any.

Step 1: Download the Wikipedia Extractors Toolkit

Evan made available a bunch of code for extracting plaintext from Wikipedia. To meet the two goals above I made some modifications*. So the first thing you’ll want to do is download this toolkit and extract it somewhere:

wget http://www.polishmywriting.com/download/wikipedia2text_rsm_mods.tgz
tar zxvf wikipedia2text_rsm_mods.tgz
cd wikipedia2text

(* see the CHANGES file to learn what modifications were made)

Step 2: Download and Extract the Wikipedia Data Dump

You can do this from http://download.wikimedia.org/. The archive you’ll want for any language is *-pages-articles.xml.bz2. Here is what I did:

wget http://download.wikimedia.org/frwiki/20091129/frwiki-20091129-pages-articles.xml.bz2
bunzip2 frwiki-20091129-pages-articles.xml.bz2

Step 3: Extract Article Data from the Wikipedia Data

Now you have a big XML file full of all the Wikipedia articles. Congratulations. The next step is to extract the articles and strip all the other stuff.

Create a directory for your output and run xmldump2files.py against the .XML file you obtained in the last step:

mkdir out
./xmldump2files.py frwiki-20091129-pages-articles.xml out

This step will take a few hours depending on your hardware.

Step 4: Parse the Article Wiki Markup into XML

The next step is to take the extracted articles and parse the Wikimedia markup into an XML form that we can later recover the plain text from.There is a shell script to generate XML files for all the files in our out directory. If you have a multi-core machine, I don’t recommend running it. I prefer using a shell script for each core that executes the Wikimedia to XML command on part of the file set (aka poor man’s concurrent programming).

To generate these shell scripts:

find out -type f | grep '\.txt$' >fr.files

To split this fr.files into several .sh files.

java -jar sleep.jar into8.sl fr.files

You may find it helpful to create a launch.sh file to launch the shell scripts created by into8.sl.

cat >launch.sh
./files0.sh &
./files1.sh &
./files2.sh &
...
./files15.sh &
^D

Next, launch these shell scripts.

./launch.sh

Unfortunately this journey is filled with peril. The command run by these scripts for each file has the following comment:  Converts Wikipedia articles in wiki format into an XML format. It might segfault or go into an “infinite” loop sometimes. This statement is true. The PHP processes will freeze or crash. My first time through this process I had to manually watching top and kill errant processes. This makes the process take longer than it should and it’s time-consuming. To help I’ve written a script that kills any php process that has run for more than two minutes. To launch it:

java -jar sleep.jar watchthem.sl

Just let this program run and it will do its job. Expect this step to take twelve or more hours depending on your hardware.

Step 5: Extract Plain Text from the Articles

Next we want to extract the article plaintext from the XML files. To do this:

./wikiextract.py out french_plaintext.txt

This command will create a file called french_plaintext.txt with the entire plain text content of the French Wikipedia. Expect this command to take a few hours depending on your hardware.

Step 6 (OPTIONAL): Split Plain Text into Multiple Files for Easier Processing

If you plan to use this data in AtD, you may want to split it up into several files so AtD can parse through it in pieces. I’ve included a script to do this:

mkdir corpus
java -jar sleep.jar makecorpus.sl french_plaintext.txt corpus

And that’s it. You now have a language corpus extracted from Wikipedia.

Progress on the Multi-Lingual Front

Posted in Multi-Lingual AtD by rsmudge on December 3, 2009

I’m making progress on multi-lingual AtD. I’ve integrated LanguageTool into AtD. LanguageTool is a language checking tool with support for 18 languages. Creating grammar rules is a human intensive process and I’d prefer to go with an established project with a successful community process.

I’m also working on creating corpus data from Wikipedia. I have a pipeline of four steps. The longest step for each language takes 12+ hours to run and ties up my entire development server. So I’m limited to generating data for one language each night.

With this corpus data I have the ability to provide contextual spell checking for that language and crude statistical filtering for the LanguageTool results (assuming LT supports that language).

Here are some stats to motivate this:

66% of the blogs on WordPress.com are English. This limits the utility of AtD to 66% of our userbase. By supporting the next six languages with AtD, we can provide proofreading tools to nearly 90% of the WordPress.com community. That’s pretty exciting.

Right now this work is in the proof of concept stage. I expect to have a French AtD (spell checking + LanguageTool grammar checking) soon. I’ll have some folks try it and tell me what their experience is. If you want to volunteer to try this out, contact me.

Text Segmentation Follow Up

Posted in Multi-Lingual AtD by rsmudge on November 18, 2009

My first goal with making AtD multi-lingual is to get the spell checker going. Yesterday I found what looks like a promising solution for splitting text into sentences and words. This is an important step as AtD uses a statistical approach for spell checking.

Here is the Sleep code I used to test out the Java sentence and word segmentation technology:

$handle = openf(@ARGV[1]);
$text = join(&quot; &quot;, readAll($handle));
closef($handle);

import java.text.*;

$locale = [new Locale: @ARGV[0]];
$bi = [BreakIterator getSentenceInstance: $locale];

assert $bi !is $null : &quot;Language fail: $locale&quot;;

[$bi setText: $text];

$index = 0;

while ([$bi next] != [BreakIterator DONE])
{
   $sentence = substr($text, $index, [$bi current]);
   println($sentence);

   # print out individual words.
   $wi = [BreakIterator getWordInstance: $locale];
   [$wi setText: $sentence];

   $ind = 0;

   while ([$wi next] != [BreakIterator DONE])
   {
      println(&quot;\t&quot; . substr($sentence, $ind, [$wi current]));
      $ind = [$wi current];
   }

   $index = [$bi current];
}

You can run this with: java -jar sleep.jar segment.sl [locale name] [text file]. I tried it against English, Japanese, Hebrew, and Swedish. I found the Java text segmentation isn’t smart about abbreviations which is a shame. I had friends look at some trivial Hebrew and Swedish output and they said it looked good.

This is a key piece to being able to bring AtD spell and misused word checking to another language.

Tagged with: ,

Sentence Segmentation Survey for Java

Posted in Multi-Lingual AtD by rsmudge on November 17, 2009

Well, it’s time to get AtD working with more languages. A good first place to start is sentence segmentation. Sentence segmentation is the problem of taking a bunch of raw text and breaking it into sentences.

Like any researcher, I start my task with a search to see what others have done. Here is what I found:

  1. There is a standard out there called SRX for Segmentation Rules Exchange. SRX files are XML and there is an open source Segment Java library for segmenting sentences using these rule files. There is also an editor called Ratel that lets folks edit these SRX files. LanguageTool has support for SRX files.
  2. Another option is to use the OpenNLP project’s tools. They have a SentenceDetectorME class that might do the trick. The problem is models are only available for English, German, Spanish, and Thai.
  3. I also learned that Java 1.6 has built-in tools for sentence segmentation in the java.text.* package. These were donated by IBM. Here is a quick dump of the locales supported by this package:

    java -jar sleep.jar -e 'println(join(", ", [java.text.BreakIterator getAvailableLocales]));'

    ja_JP, es_PE, en, ja_JP_JP, es_PA, sr_BA, mk, es_GT, ar_AE, no_NO, sq_AL, bg, ar_IQ, ar_YE, hu, pt_PT, el_CY, ar_QA, mk_MK, sv, de_CH, en_US, fi_FI, is, cs, en_MT, sl_SI, sk_SK, it, tr_TR, zh, th, ar_SA, no, en_GB, sr_CS, lt, ro, en_NZ, no_NO_NY, lt_LT, es_NI, nl, ga_IE, fr_BE, es_ES, ar_LB, ko, fr_CA, et_EE, ar_KW, sr_RS, es_US, es_MX, ar_SD, in_ID, ru, lv, es_UY, lv_LV, iw, pt_BR, ar_SY, hr, et, es_DO, fr_CH, hi_IN, es_VE, ar_BH, en_PH, ar_TN, fi, de_AT, es, nl_NL, es_EC, zh_TW, ar_JO, be, is_IS, es_CO, es_CR, es_CL, ar_EG, en_ZA, th_TH, el_GR, it_IT, ca, hu_HU, fr, en_IE, uk_UA, pl_PL, fr_LU, nl_BE, en_IN, ca_ES, ar_MA, es_BO, en_AU, sr, zh_SG, pt, uk, es_SV, ru_RU, ko_KR, vi, ar_DZ, vi_VN, sr_ME, sq, ar_LY, ar, zh_CN, be_BY, zh_HK, ja, iw_IL, bg_BG, in, mt_MT, es_PY, sl, fr_FR, cs_CZ, it_CH, ro_RO, es_PR, en_CA, de_DE, ga, de_LU, de, es_AR, sk, ms_MY, hr_HR, en_SG, da, mt, pl, ar_OM, tr, th_TH_TH, el, ms, sv_SE, da_DK, es_HN

A good survey of tools from the corpora-l mailing list is at http://mailman.uib.no/public/corpora/2007-October/005429.htm

I think I found my winner with Java’s built-in sentence segmentation tools. I haven’t evaluated the quality of the output yet (a task for tomorrow) but the fact it supports so many locales out of the box is very appealing to me. AtD-English has made it far on my simple rule-based sentence segmentation. If this API is near (or I suspect better than) what I have, this will do quite nicely.

Tagged with: , ,