<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>After the Deadline &#187; Multi-Lingual AtD</title>
	<atom:link href="http://blog.afterthedeadline.com/category/multi-lingual-atd/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.afterthedeadline.com</link>
	<description>Natural language processing blog.</description>
	<lastBuildDate>Wed, 30 Mar 2011 01:41:40 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='blog.afterthedeadline.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>After the Deadline &#187; Multi-Lingual AtD</title>
		<link>http://blog.afterthedeadline.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://blog.afterthedeadline.com/osd.xml" title="After the Deadline" />
	<atom:link rel='hub' href='http://blog.afterthedeadline.com/?pushpress=hub'/>
		<item>
		<title>N-Gram Language Guessing with NGramJ</title>
		<link>http://blog.afterthedeadline.com/2010/02/08/n-gram-language-guessing-with-ngramj/</link>
		<comments>http://blog.afterthedeadline.com/2010/02/08/n-gram-language-guessing-with-ngramj/#comments</comments>
		<pubDate>Mon, 08 Feb 2010 23:14:04 +0000</pubDate>
		<dc:creator>rsmudge</dc:creator>
				<category><![CDATA[Multi-Lingual AtD]]></category>
		<category><![CDATA[NLP Research]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[language guessing]]></category>
		<category><![CDATA[ngramj]]></category>
		<category><![CDATA[NLP]]></category>

		<guid isPermaLink="false">http://blog.afterthedeadline.com/?p=523</guid>
		<description><![CDATA[NGramJ is a Java library for language recognition. It uses language profiles (counts of character sequences) to guess what language some arbitrary text is. In this post I&#8217;ll briefly show you how to use it from the command-line and the Java API. I&#8217;ll also show you how to generate a new language profile. I&#8217;m doing [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.afterthedeadline.com&amp;blog=9202244&amp;post=523&amp;subd=atdresearch&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><a href="http://ngramj.sourceforge.net/">NGramJ</a> is a Java library for language recognition. It uses language profiles (counts of character sequences) to guess what language some arbitrary text is. In this post I&#8217;ll briefly show you how to use it from the command-line and the Java API. I&#8217;ll also show you how to generate a new language profile. I&#8217;m doing this so I don&#8217;t have to figure out how to do it again.</p>
<h2>Running</h2>
<p>You can get a feel for how well NGramJ works by trying it on the command line. For example:</p>
<pre>$ <strong>cat &gt;a.txt</strong>
This is a test.
$ <strong>java -jar cngram.jar -lang2 a.txt</strong>
speed: en:0.667 ru:0.000 pt:0.000 .. de:0.000 |0.0E0 |0.0E0 dt=2812
$ <strong>cat &gt;b.txt</strong>
Wikipedia ist ein Projekt zum Aufbau einer Enzyklopädie aus freien Inhalten in allen Sprachen der Welt.
$ <strong>java -jar cngram.jar -lang2 b.txt</strong>
speed: de:0.857 ru:0.000 pt:0.000 .. en:0.000 |0.0E0 |0.0E0 dt=2077
</pre>
<h2>Using</h2>
<p>Something I like about the API for this program&#8211;it&#8217;s simple. It is also thread-safe. You can instantiate a static reference for the library and call it from any thread later. Here is some code <a href="http://code.google.com/p/flaptor-util/source/browse/trunk/src/com/flaptor/util/NgramJLanguageIdentifier.java">adopted from</a> the <a href="http://code.google.com/p/flaptor-util/">Flaptor Utils</a> library.</p>
<p><pre class="brush: java;">import de.spieleck.app.cngram.NGramProfiles;

protected NGramProfiles profiles = new NGramProfiles();

public String getLanguage(String text) {
 NGramProfiles.Ranker ranker = profiles.getRanker();
 ranker.account(text);
 NGramProfiles.RankResult result = ranker.getRankResult();
 return result.getName(0);
}</pre></p>
<p>Now that you know how to use the library for language guessing, I&#8217;ll show you how to add a new language.</p>
<h2>Adding a New Language</h2>
<p>NGramJ comes with several language profiles but you may have a need to generate one yourself. A great source of language data is Wikipedia. I&#8217;ve written about <a href="http://blog.afterthedeadline.com/2009/12/04/generating-a-plain-text-corpus-from-wikipedia/">extracting plain-text from Wikipedia</a> here before. Today, I needed to generate a profile for Indonesian. The first step is to create a raw language profile. You can do this with the cngram.jar file:</p>
<pre>$ <strong>java -jar cngram.jar -create id_big id_corpus.txt</strong>
new profile 'id_big.ngp' was created.
</pre>
<p>This will create an id.ngp file. I also noticed this file is huge. Several hundred kilobytes compared to the 30K of the other language profiles. The next step is to clean the language profile up. To do this, I created a short <a href="http://sleep.dashnine.org">Sleep</a> script to read in the id.ngp file and cut any 3-gram and 4-gram sequences that occur less than 20K times. I chose 20K because it leaves me with a file that is about 30K. If you have less data, you&#8217;ll want to adjust this number downwards. The other language profiles use 1000 as a cut-off. This leads me to believe they were trained on 6MB of text data versus my 114MB of Indonesian text.</p>
<p>Here is the script:</p>
<p><pre class="brush: perl;">%grams = ohash();
setMissPolicy(%grams, { return @(); });

$handle = openf(@ARGV[0]);
$banner = readln($handle);
readln($handle); # consume the ngram_count value

while $text (readln($handle)) {
   ($gram, $count) = split(' ', $text);

   if (strlen($gram) &lt;= 2 || $count &gt; 20000) {
      push(%grams[strlen($gram)], @($gram, $count));
   }
}
closef($handle);

sub sortTuple {
   return $2[1] &lt;=&gt; $1[1];
}

println($banner);

printAll(map({ return join(&quot; &quot;, $1); }, sort(&amp;sortTuple, %grams[1])));
printAll(map({ return join(&quot; &quot;, $1); }, sort(&amp;sortTuple, %grams[2])));
printAll(map({ return join(&quot; &quot;, $1); }, sort(&amp;sortTuple, %grams[3])));
printAll(map({ return join(&quot; &quot;, $1); }, sort(&amp;sortTuple, %grams[4])));
</pre></p>
<p>To run the script:</p>
<pre>$ <strong>java -jar lib/sleep.jar sortit.sl id_big.ngp &gt;id.ngp</strong></pre>
<p>The last step is to copy id.ngp into src/de/spieleck/app/cngram/ and edit src/de/spieleck/app/cngram/profiles.lst to contain the id resource. Type <strong>ant</strong> in the top-level directory of the NGramJ source code to rebuild cngram.jar and then you&#8217;re ready to test:</p>
<pre>$ <strong>cat &gt;c.txt</strong>
Selamat datang di Wikipedia bahasa Indonesia, ensiklopedia bebas berbahasa Indonesia
$ <strong>java -jar cngram.jar -lang2 c.txt</strong>
speed: id:0.857 ru:0.000 pt:0.000 .. de:0.000 |0.0E0 |0.0E0 dt=1872</pre>
<p>As you can see NGramJ is an easy to work with library. If you need to do language guessing, I recommend it.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/atdresearch.wordpress.com/523/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/atdresearch.wordpress.com/523/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/atdresearch.wordpress.com/523/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/atdresearch.wordpress.com/523/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/atdresearch.wordpress.com/523/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/atdresearch.wordpress.com/523/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/atdresearch.wordpress.com/523/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/atdresearch.wordpress.com/523/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/atdresearch.wordpress.com/523/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/atdresearch.wordpress.com/523/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/atdresearch.wordpress.com/523/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/atdresearch.wordpress.com/523/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/atdresearch.wordpress.com/523/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/atdresearch.wordpress.com/523/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.afterthedeadline.com&amp;blog=9202244&amp;post=523&amp;subd=atdresearch&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.afterthedeadline.com/2010/02/08/n-gram-language-guessing-with-ngramj/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/44a44db75f21982b563b1febf38b27ad?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">rsmudge</media:title>
		</media:content>
	</item>
		<item>
		<title>Generating a Plain Text Corpus from Wikipedia</title>
		<link>http://blog.afterthedeadline.com/2009/12/04/generating-a-plain-text-corpus-from-wikipedia/</link>
		<comments>http://blog.afterthedeadline.com/2009/12/04/generating-a-plain-text-corpus-from-wikipedia/#comments</comments>
		<pubDate>Fri, 04 Dec 2009 22:42:45 +0000</pubDate>
		<dc:creator>rsmudge</dc:creator>
				<category><![CDATA[Multi-Lingual AtD]]></category>
		<category><![CDATA[NLP Research]]></category>

		<guid isPermaLink="false">http://blog.afterthedeadline.com/?p=324</guid>
		<description><![CDATA[AtD *thrives* on data and one of the best places for a variety of data is Wikipedia. This post describes how to generate a plain text corpus from a complete Wikipedia dump. This process is a modification of Extracting Text from Wikipedia by Evan Jones. Evan&#8217;s post shows how to extract the top articles from [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.afterthedeadline.com&amp;blog=9202244&amp;post=324&amp;subd=atdresearch&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>AtD *thrives* on data and one of the best places for a variety of data is Wikipedia. This post describes how to generate a plain text corpus from a complete Wikipedia dump. This process is a modification of <a href="http://evanjones.ca/software/wikipedia2text.html">Extracting Text from Wikipedia</a> by <a href="http://evanjones.ca/">Evan Jones</a>.</p>
<p>Evan&#8217;s post shows how to extract the top articles from the English Wikipedia and make a plain text file. Here I&#8217;ll show how to extract all articles from a Wikipedia dump with two helpful constraints. Each step should:</p>
<ul>
<li>finish before I&#8217;m old  enough to collect social security</li>
<li>tolerate errors and run to completion without my intervention</li>
</ul>
<p>Today, we&#8217;re going to do the French Wikipedia. I&#8217;m working on multi-lingual AtD and French seems like a fun language to go with. Our systems guy, <a href="http://tekartist.org/">Stephane</a> speaks French. That&#8217;s as good of a reason as any.</p>
<h3>Step 1: Download the Wikipedia Extractors Toolkit</h3>
<p>Evan made available a bunch of code for extracting plaintext from Wikipedia. To meet the two goals above I made some modifications*. So the first thing you&#8217;ll want to do is <a href="http://www.polishmywriting.com/download/wikipedia2text_rsm_mods.tgz">download this toolkit</a> and extract it somewhere:</p>
<p><pre class="brush: bash;">wget http://www.polishmywriting.com/download/wikipedia2text_rsm_mods.tgz
tar zxvf wikipedia2text_rsm_mods.tgz
cd wikipedia2text</pre></p>
<p>(* see the CHANGES file to learn what modifications were made)</p>
<h3>Step 2: Download and Extract the Wikipedia Data Dump</h3>
<p>You can do this from <a href="http://download.wikimedia.org/">http://download.wikimedia.org/</a>. The archive you&#8217;ll want for any language is *-pages-articles.xml.bz2. Here is what I did:</p>
<p><pre class="brush: bash;">wget http://download.wikimedia.org/frwiki/20091129/frwiki-20091129-pages-articles.xml.bz2
bunzip2 frwiki-20091129-pages-articles.xml.bz2</pre></p>
<h3>Step 3: Extract Article Data from the Wikipedia Data</h3>
<p>Now you have a big XML file full of all the Wikipedia articles. Congratulations. The next step is to extract the articles and strip all the other stuff.</p>
<p>Create a directory for your output and run xmldump2files.py against the .XML file you obtained in the last step:</p>
<p><pre class="brush: bash;">mkdir out
./xmldump2files.py frwiki-20091129-pages-articles.xml out</pre></p>
<p>This step will take a few hours depending on your hardware.</p>
<h3>Step 4: Parse the Article Wiki Markup into XML</h3>
<p>The next step is to take the extracted articles and parse the Wikimedia markup into an XML form that we can later recover the plain text from.There is a shell script to generate XML files for all the files in our out directory. If you have a multi-core machine, I don&#8217;t recommend running it. I prefer using a shell script for each core that executes the Wikimedia to XML command on part of the file set (aka poor man&#8217;s concurrent programming).</p>
<p>To generate these shell scripts:</p>
<p><pre class="brush: bash;">find out -type f | grep '\.txt$' &gt;fr.files</pre></p>
<p>To split this fr.files into several .sh files.</p>
<p><pre class="brush: bash;">java -jar sleep.jar into8.sl fr.files</pre></p>
<p>You may find it helpful to create a launch.sh file to launch the shell scripts created by into8.sl.</p>
<p><pre class="brush: bash;">cat &gt;launch.sh
./files0.sh &amp;
./files1.sh &amp;
./files2.sh &amp;
...
./files15.sh &amp;
^D</pre></p>
<p>Next, launch these shell scripts.</p>
<p><pre class="brush: plain;">./launch.sh</pre></p>
<p>Unfortunately this journey is filled with peril. The command run by these scripts for each file has the following comment:  <em>Converts Wikipedia articles in wiki format into an XML format. It might segfault or go into an &#8220;infinite&#8221; loop sometimes</em>. This statement is true. The PHP processes will freeze or crash. My first time through this process I had to manually watching top and kill errant processes. This makes the process take longer than it should and it&#8217;s time-consuming. To help I&#8217;ve written a script that kills any php process that has run for more than two minutes. To launch it:</p>
<p><pre class="brush: plain;">java -jar sleep.jar watchthem.sl</pre></p>
<p>Just let this program run and it will do its job. Expect this step to take twelve or more hours depending on your hardware.</p>
<h3>Step 5: Extract Plain Text from the Articles</h3>
<p>Next we want to extract the article plaintext from the XML files. To do this:</p>
<p><pre class="brush: plain;">./wikiextract.py out french_plaintext.txt</pre></p>
<p>This command will create a file called french_plaintext.txt with the entire plain text content of the French Wikipedia. Expect this command to take a few hours depending on your hardware.</p>
<h3>Step 6 (OPTIONAL): Split Plain Text into Multiple Files for Easier Processing</h3>
<p>If you plan to use this data in AtD, you may want to split it up into several files so AtD can parse through it in pieces. I&#8217;ve included a script to do this:</p>
<p><pre class="brush: bash;">mkdir corpus
java -jar sleep.jar makecorpus.sl french_plaintext.txt corpus</pre></p>
<p>And that&#8217;s it. You now have a language corpus extracted from Wikipedia.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/atdresearch.wordpress.com/324/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/atdresearch.wordpress.com/324/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/atdresearch.wordpress.com/324/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/atdresearch.wordpress.com/324/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/atdresearch.wordpress.com/324/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/atdresearch.wordpress.com/324/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/atdresearch.wordpress.com/324/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/atdresearch.wordpress.com/324/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/atdresearch.wordpress.com/324/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/atdresearch.wordpress.com/324/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/atdresearch.wordpress.com/324/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/atdresearch.wordpress.com/324/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/atdresearch.wordpress.com/324/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/atdresearch.wordpress.com/324/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.afterthedeadline.com&amp;blog=9202244&amp;post=324&amp;subd=atdresearch&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.afterthedeadline.com/2009/12/04/generating-a-plain-text-corpus-from-wikipedia/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/44a44db75f21982b563b1febf38b27ad?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">rsmudge</media:title>
		</media:content>
	</item>
		<item>
		<title>Progress on the Multi-Lingual Front</title>
		<link>http://blog.afterthedeadline.com/2009/12/03/progress-on-the-multi-lingual-front/</link>
		<comments>http://blog.afterthedeadline.com/2009/12/03/progress-on-the-multi-lingual-front/#comments</comments>
		<pubDate>Thu, 03 Dec 2009 00:09:48 +0000</pubDate>
		<dc:creator>rsmudge</dc:creator>
				<category><![CDATA[Multi-Lingual AtD]]></category>

		<guid isPermaLink="false">http://blog.afterthedeadline.com/?p=340</guid>
		<description><![CDATA[I&#8217;m making progress on multi-lingual AtD. I&#8217;ve integrated LanguageTool into AtD. LanguageTool is a language checking tool with support for 18 languages. Creating grammar rules is a human intensive process and I&#8217;d prefer to go with an established project with a successful community process. I&#8217;m also working on creating corpus data from Wikipedia. I have [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.afterthedeadline.com&amp;blog=9202244&amp;post=340&amp;subd=atdresearch&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m making progress on multi-lingual AtD. I&#8217;ve integrated LanguageTool into AtD. <a href="http://www.languagetool.org">LanguageTool</a> is a language checking tool with support for <a href="http://www.languagetool.org/languages/">18 languages</a>. Creating grammar rules is a human intensive process and I&#8217;d prefer to go with an established project with a successful community process.</p>
<p>I&#8217;m also working on creating corpus data from Wikipedia. I have a pipeline of four steps. The longest step for each language takes 12+ hours to run and ties up my entire development server. So I&#8217;m limited to generating data for one language each night.</p>
<p>With this corpus data I have the ability to provide contextual spell checking for that language and crude statistical filtering for the LanguageTool results (assuming LT supports that language).</p>
<p>Here are <a href="http://en.wordpress.com/stats/">some stats</a> to motivate this:</p>
<p>66% of the blogs on WordPress.com are English. This limits the utility of AtD to 66% of our userbase. By supporting the next six languages with AtD, we can provide proofreading tools to nearly 90% of the WordPress.com community. That&#8217;s pretty exciting.</p>
<p>Right now this work is in the proof of concept stage. I expect to have a French AtD (spell checking + LanguageTool grammar checking) soon. I&#8217;ll have some folks try it and tell me what their experience is. If you want to volunteer to try this out, <a href="http://www.afterthedeadline.com/contact.slp">contact me</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/atdresearch.wordpress.com/340/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/atdresearch.wordpress.com/340/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/atdresearch.wordpress.com/340/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/atdresearch.wordpress.com/340/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/atdresearch.wordpress.com/340/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/atdresearch.wordpress.com/340/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/atdresearch.wordpress.com/340/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/atdresearch.wordpress.com/340/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/atdresearch.wordpress.com/340/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/atdresearch.wordpress.com/340/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/atdresearch.wordpress.com/340/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/atdresearch.wordpress.com/340/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/atdresearch.wordpress.com/340/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/atdresearch.wordpress.com/340/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.afterthedeadline.com&amp;blog=9202244&amp;post=340&amp;subd=atdresearch&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.afterthedeadline.com/2009/12/03/progress-on-the-multi-lingual-front/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/44a44db75f21982b563b1febf38b27ad?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">rsmudge</media:title>
		</media:content>
	</item>
		<item>
		<title>Text Segmentation Follow Up</title>
		<link>http://blog.afterthedeadline.com/2009/11/18/text-segmentation-follow-up/</link>
		<comments>http://blog.afterthedeadline.com/2009/11/18/text-segmentation-follow-up/#comments</comments>
		<pubDate>Wed, 18 Nov 2009 13:00:37 +0000</pubDate>
		<dc:creator>rsmudge</dc:creator>
				<category><![CDATA[Multi-Lingual AtD]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[text segmentation]]></category>

		<guid isPermaLink="false">http://blog.afterthedeadline.com/?p=307</guid>
		<description><![CDATA[My first goal with making AtD multi-lingual is to get the spell checker going. Yesterday I found what looks like a promising solution for splitting text into sentences and words. This is an important step as AtD uses a statistical approach for spell checking. Here is the Sleep code I used to test out the [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.afterthedeadline.com&amp;blog=9202244&amp;post=307&amp;subd=atdresearch&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>My first goal with making AtD multi-lingual is to get the spell checker going. Yesterday I found what looks like a promising solution for splitting text into sentences and words. This is an important step as AtD uses a statistical approach for spell checking. </p>
<p>Here is the <a href="http://sleep.dashnine.org/">Sleep</a> code I used to test out the Java sentence and word segmentation technology:</p>
<p><pre class="brush: perl;">$handle = openf(@ARGV[1]);
$text = join(&quot; &quot;, readAll($handle));
closef($handle);

import java.text.*;

$locale = [new Locale: @ARGV[0]];
$bi = [BreakIterator getSentenceInstance: $locale];

assert $bi !is $null : &quot;Language fail: $locale&quot;;

[$bi setText: $text];

$index = 0;

while ([$bi next] != [BreakIterator DONE])
{
   $sentence = substr($text, $index, [$bi current]);
   println($sentence);

   # print out individual words.
   $wi = [BreakIterator getWordInstance: $locale];
   [$wi setText: $sentence];

   $ind = 0;

   while ([$wi next] != [BreakIterator DONE])
   {
      println(&quot;\t&quot; . substr($sentence, $ind, [$wi current]));
      $ind = [$wi current];
   }

   $index = [$bi current];
}</pre></p>
<p>You can run this with: <code>java -jar sleep.jar segment.sl [locale name] [text file]</code>. I tried it against English, Japanese, Hebrew, and Swedish. I found the Java text segmentation isn&#8217;t smart about abbreviations which is a shame. I had friends look at some trivial Hebrew and Swedish output and they said it looked good. </p>
<p>This is a key piece to being able to bring AtD spell and misused word checking to another language. </p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/atdresearch.wordpress.com/307/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/atdresearch.wordpress.com/307/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/atdresearch.wordpress.com/307/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/atdresearch.wordpress.com/307/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/atdresearch.wordpress.com/307/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/atdresearch.wordpress.com/307/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/atdresearch.wordpress.com/307/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/atdresearch.wordpress.com/307/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/atdresearch.wordpress.com/307/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/atdresearch.wordpress.com/307/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/atdresearch.wordpress.com/307/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/atdresearch.wordpress.com/307/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/atdresearch.wordpress.com/307/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/atdresearch.wordpress.com/307/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.afterthedeadline.com&amp;blog=9202244&amp;post=307&amp;subd=atdresearch&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.afterthedeadline.com/2009/11/18/text-segmentation-follow-up/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/44a44db75f21982b563b1febf38b27ad?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">rsmudge</media:title>
		</media:content>
	</item>
		<item>
		<title>Sentence Segmentation Survey for Java</title>
		<link>http://blog.afterthedeadline.com/2009/11/17/sentence-segmentation-survey-for-java/</link>
		<comments>http://blog.afterthedeadline.com/2009/11/17/sentence-segmentation-survey-for-java/#comments</comments>
		<pubDate>Tue, 17 Nov 2009 23:27:52 +0000</pubDate>
		<dc:creator>rsmudge</dc:creator>
				<category><![CDATA[Multi-Lingual AtD]]></category>
		<category><![CDATA[AtD]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[sentence segmentation]]></category>

		<guid isPermaLink="false">http://blog.afterthedeadline.com/?p=301</guid>
		<description><![CDATA[Well, it&#8217;s time to get AtD working with more languages. A good first place to start is sentence segmentation. Sentence segmentation is the problem of taking a bunch of raw text and breaking it into sentences. Like any researcher, I start my task with a search to see what others have done. Here is what [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.afterthedeadline.com&amp;blog=9202244&amp;post=301&amp;subd=atdresearch&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Well, it&#8217;s time to get AtD working with more languages. A good first place to start is sentence segmentation. Sentence segmentation is the problem of taking a bunch of raw text and breaking it into sentences.</p>
<p>Like any researcher, I start my task with a search to see what others have done. Here is what I found:</p>
<ol>
<li>There is a standard out there called SRX for Segmentation Rules Exchange. SRX files are XML and there is an open source Segment Java library for segmenting sentences using these rule files. There is also an editor called Ratel that lets folks edit these SRX files. <a href="http://languagetool.wikidot.com/customizing-sentence-segmentation-in-srx-rules">LanguageTool</a> has support for SRX files.</li>
<li>Another option is to use the <a href="http://opennlp.sourceforge.net/">OpenNLP</a> project&#8217;s tools. They have a SentenceDetectorME class that might do the trick. The problem is models are only available for English, German, Spanish, and Thai.</li>
<li>I also learned that Java 1.6 has built-in tools for sentence segmentation in the java.text.* package. These were donated by IBM. Here is a quick dump of the locales supported by this package:
<p><code>java -jar sleep.jar -e 'println(join(", ", [java.text.BreakIterator getAvailableLocales]));'</code></p>
<p>ja_JP, es_PE, en, ja_JP_JP, es_PA, sr_BA, mk, es_GT, ar_AE, no_NO, sq_AL, bg, ar_IQ, ar_YE, hu, pt_PT, el_CY, ar_QA, mk_MK, sv, de_CH, en_US, fi_FI, is, cs, en_MT, sl_SI, sk_SK, it, tr_TR, zh, th, ar_SA, no, en_GB, sr_CS, lt, ro, en_NZ, no_NO_NY, lt_LT, es_NI, nl, ga_IE, fr_BE, es_ES, ar_LB, ko, fr_CA, et_EE, ar_KW, sr_RS, es_US, es_MX, ar_SD, in_ID, ru, lv, es_UY, lv_LV, iw, pt_BR, ar_SY, hr, et, es_DO, fr_CH, hi_IN, es_VE, ar_BH, en_PH, ar_TN, fi, de_AT, es, nl_NL, es_EC, zh_TW, ar_JO, be, is_IS, es_CO, es_CR, es_CL, ar_EG, en_ZA, th_TH, el_GR, it_IT, ca, hu_HU, fr, en_IE, uk_UA, pl_PL, fr_LU, nl_BE, en_IN, ca_ES, ar_MA, es_BO, en_AU, sr, zh_SG, pt, uk, es_SV, ru_RU, ko_KR, vi, ar_DZ, vi_VN, sr_ME, sq, ar_LY, ar, zh_CN, be_BY, zh_HK, ja, iw_IL, bg_BG, in, mt_MT, es_PY, sl, fr_FR, cs_CZ, it_CH, ro_RO, es_PR, en_CA, de_DE, ga, de_LU, de, es_AR, sk, ms_MY, hr_HR, en_SG, da, mt, pl, ar_OM, tr, th_TH_TH, el, ms, sv_SE, da_DK, es_HN</p>
</li>
</ol>
<p>A good survey of tools from the corpora-l mailing list is at <a href="http://mailman.uib.no/public/corpora/2007-October/005429.html">http://mailman.uib.no/public/corpora/2007-October/005429.htm</a></p>
<p>I think I found my winner with Java&#8217;s built-in sentence segmentation tools. I haven&#8217;t evaluated the quality of the output yet (a task for tomorrow) but the fact it supports so many locales out of the box is very appealing to me. AtD-English has made it far on my simple rule-based sentence segmentation. If this API is near (or I suspect better than) what I have, this will do quite nicely.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/atdresearch.wordpress.com/301/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/atdresearch.wordpress.com/301/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/atdresearch.wordpress.com/301/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/atdresearch.wordpress.com/301/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/atdresearch.wordpress.com/301/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/atdresearch.wordpress.com/301/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/atdresearch.wordpress.com/301/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/atdresearch.wordpress.com/301/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/atdresearch.wordpress.com/301/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/atdresearch.wordpress.com/301/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/atdresearch.wordpress.com/301/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/atdresearch.wordpress.com/301/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/atdresearch.wordpress.com/301/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/atdresearch.wordpress.com/301/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.afterthedeadline.com&amp;blog=9202244&amp;post=301&amp;subd=atdresearch&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.afterthedeadline.com/2009/11/17/sentence-segmentation-survey-for-java/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/44a44db75f21982b563b1febf38b27ad?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">rsmudge</media:title>
		</media:content>
	</item>
	</channel>
</rss>
