<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
		>
<channel>
	<title>Comments on: N-Gram Language Guessing with NGramJ</title>
	<atom:link href="http://blog.afterthedeadline.com/2010/02/08/n-gram-language-guessing-with-ngramj/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.afterthedeadline.com/2010/02/08/n-gram-language-guessing-with-ngramj/</link>
	<description>Natural language processing blog.</description>
	<lastBuildDate>Tue, 10 Aug 2010 21:54:09 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
	<item>
		<title>By: rsmudge</title>
		<link>http://blog.afterthedeadline.com/2010/02/08/n-gram-language-guessing-with-ngramj/#comment-962</link>
		<dc:creator><![CDATA[rsmudge]]></dc:creator>
		<pubDate>Wed, 24 Mar 2010 15:46:18 +0000</pubDate>
		<guid isPermaLink="false">http://blog.afterthedeadline.com/?p=523#comment-962</guid>
		<description><![CDATA[Thanks, this is fixed now.]]></description>
		<content:encoded><![CDATA[<p>Thanks, this is fixed now.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: rsmudge</title>
		<link>http://blog.afterthedeadline.com/2010/02/08/n-gram-language-guessing-with-ngramj/#comment-961</link>
		<dc:creator><![CDATA[rsmudge]]></dc:creator>
		<pubDate>Wed, 24 Mar 2010 15:44:24 +0000</pubDate>
		<guid isPermaLink="false">http://blog.afterthedeadline.com/?p=523#comment-961</guid>
		<description><![CDATA[I haven&#039;t tried these languages yet. One key I&#039;ve found is to make sure your character encoding is correct all around. I use UTF-8. 

1. Set it on the command line 

export LC_CTYPE=en_US.UTF-8
export LANG=en_US.UTF-8

2. Make sure your files are encoded with UTF-8
3. Make sure Java is getting the UTF-8 hint with -Dfile.encoding=UTF-8

etc.]]></description>
		<content:encoded><![CDATA[<p>I haven&#8217;t tried these languages yet. One key I&#8217;ve found is to make sure your character encoding is correct all around. I use UTF-8. </p>
<p>1. Set it on the command line </p>
<p>export LC_CTYPE=en_US.UTF-8<br />
export LANG=en_US.UTF-8</p>
<p>2. Make sure your files are encoded with UTF-8<br />
3. Make sure Java is getting the UTF-8 hint with -Dfile.encoding=UTF-8</p>
<p>etc.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Dominique</title>
		<link>http://blog.afterthedeadline.com/2010/02/08/n-gram-language-guessing-with-ngramj/#comment-958</link>
		<dc:creator><![CDATA[Dominique]]></dc:creator>
		<pubDate>Tue, 23 Mar 2010 20:11:11 +0000</pubDate>
		<guid isPermaLink="false">http://blog.afterthedeadline.com/?p=523#comment-958</guid>
		<description><![CDATA[There is a syntax error in sortid.sl

return $2[1] &lt;=&gt; $1[1];]]></description>
		<content:encoded><![CDATA[<p>There is a syntax error in sortid.sl</p>
<p>return $2[1] &lt;=&gt; $1[1];</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Dominique</title>
		<link>http://blog.afterthedeadline.com/2010/02/08/n-gram-language-guessing-with-ngramj/#comment-956</link>
		<dc:creator><![CDATA[Dominique]]></dc:creator>
		<pubDate>Tue, 23 Mar 2010 20:07:42 +0000</pubDate>
		<guid isPermaLink="false">http://blog.afterthedeadline.com/?p=523#comment-956</guid>
		<description><![CDATA[3 very interesting articles (with &quot;Generating a Plain Text Corpus from Wikipedia&quot; and &quot;All about Language Model&quot;). Did you add Arabic or Cyrillic languages or Japanese, Chinese ?

I created an Arabian ngp file, it works fine, but strangly Persian doesn&#039;t work (score is always 0). Same thing, with Russian doesn&#039;t work (with the ru.ngp file provided in ngramj). I suppose I make a mistake with the cngram API.]]></description>
		<content:encoded><![CDATA[<p>3 very interesting articles (with &#8220;Generating a Plain Text Corpus from Wikipedia&#8221; and &#8220;All about Language Model&#8221;). Did you add Arabic or Cyrillic languages or Japanese, Chinese ?</p>
<p>I created an Arabian ngp file, it works fine, but strangly Persian doesn&#8217;t work (score is always 0). Same thing, with Russian doesn&#8217;t work (with the ru.ngp file provided in ngramj). I suppose I make a mistake with the cngram API.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: rsmudge</title>
		<link>http://blog.afterthedeadline.com/2010/02/08/n-gram-language-guessing-with-ngramj/#comment-908</link>
		<dc:creator><![CDATA[rsmudge]]></dc:creator>
		<pubDate>Thu, 11 Mar 2010 13:32:36 +0000</pubDate>
		<guid isPermaLink="false">http://blog.afterthedeadline.com/?p=523#comment-908</guid>
		<description><![CDATA[They&#039;re probably pretty similar. I use ngramj as it&#039;s Java and AtD is written mostly in Sleep/Java. In my own tests I&#039;ve found one or two words is a coin toss as to what it will characterize it as. Once you get beyond a full sentence that says something it&#039;s always correct. I have to add the says something caveat because I&#039;ve found it will mischaracterize a list of names, addresses, and phone numbers.]]></description>
		<content:encoded><![CDATA[<p>They&#8217;re probably pretty similar. I use ngramj as it&#8217;s Java and AtD is written mostly in Sleep/Java. In my own tests I&#8217;ve found one or two words is a coin toss as to what it will characterize it as. Once you get beyond a full sentence that says something it&#8217;s always correct. I have to add the says something caveat because I&#8217;ve found it will mischaracterize a list of names, addresses, and phone numbers.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Kevin</title>
		<link>http://blog.afterthedeadline.com/2010/02/08/n-gram-language-guessing-with-ngramj/#comment-906</link>
		<dc:creator><![CDATA[Kevin]]></dc:creator>
		<pubDate>Thu, 11 Mar 2010 08:59:50 +0000</pubDate>
		<guid isPermaLink="false">http://blog.afterthedeadline.com/?p=523#comment-906</guid>
		<description><![CDATA[How does it compare with &lt;a href=&quot;http://software.wise-guys.nl/libtextcat/&quot; rel=&quot;nofollow&quot;&gt;libtextcat&lt;/a&gt;? I found libtextcat extremely simple to use, building an LM is just

$ createfp  lang-fingerprint.txt

and it seemed to work fine to guess between the two variants of Norwegian (it characterised my bad attempts at spelling Dutch as Middle-Frisian =P)]]></description>
		<content:encoded><![CDATA[<p>How does it compare with <a href="http://software.wise-guys.nl/libtextcat/" rel="nofollow">libtextcat</a>? I found libtextcat extremely simple to use, building an LM is just</p>
<p>$ createfp  lang-fingerprint.txt</p>
<p>and it seemed to work fine to guess between the two variants of Norwegian (it characterised my bad attempts at spelling Dutch as Middle-Frisian =P)</p>
]]></content:encoded>
	</item>
</channel>
</rss>

