NLP | After the Deadline

N-Gram Language Guessing with NGramJ

Posted in Multi-Lingual AtD, NLP Research by rsmudge on February 8, 2010

NGramJ is a Java library for language recognition. It uses language profiles (counts of character sequences) to guess what language some arbitrary text is. In this post I’ll briefly show you how to use it from the command-line and the Java API. I’ll also show you how to generate a new language profile. I’m doing this so I don’t have to figure out how to do it again.

Running

You can get a feel for how well NGramJ works by trying it on the command line. For example:

$ cat >a.txt
This is a test.
$ java -jar cngram.jar -lang2 a.txt
speed: en:0.667 ru:0.000 pt:0.000 .. de:0.000 |0.0E0 |0.0E0 dt=2812
$ cat >b.txt
Wikipedia ist ein Projekt zum Aufbau einer Enzyklopädie aus freien Inhalten in allen Sprachen der Welt.
$ java -jar cngram.jar -lang2 b.txt
speed: de:0.857 ru:0.000 pt:0.000 .. en:0.000 |0.0E0 |0.0E0 dt=2077

Using

Something I like about the API for this program–it’s simple. It is also thread-safe. You can instantiate a static reference for the library and call it from any thread later. Here is some code adopted from the Flaptor Utils library.

import de.spieleck.app.cngram.NGramProfiles;

protected NGramProfiles profiles = new NGramProfiles();

public String getLanguage(String text) {
 NGramProfiles.Ranker ranker = profiles.getRanker();
 ranker.account(text);
 NGramProfiles.RankResult result = ranker.getRankResult();
 return result.getName(0);
}

Now that you know how to use the library for language guessing, I’ll show you how to add a new language.

Adding a New Language

NGramJ comes with several language profiles but you may have a need to generate one yourself. A great source of language data is Wikipedia. I’ve written about extracting plain-text from Wikipedia here before. Today, I needed to generate a profile for Indonesian. The first step is to create a raw language profile. You can do this with the cngram.jar file:

$ java -jar cngram.jar -create id_big id_corpus.txt
new profile 'id_big.ngp' was created.

This will create an id.ngp file. I also noticed this file is huge. Several hundred kilobytes compared to the 30K of the other language profiles. The next step is to clean the language profile up. To do this, I created a short Sleep script to read in the id.ngp file and cut any 3-gram and 4-gram sequences that occur less than 20K times. I chose 20K because it leaves me with a file that is about 30K. If you have less data, you’ll want to adjust this number downwards. The other language profiles use 1000 as a cut-off. This leads me to believe they were trained on 6MB of text data versus my 114MB of Indonesian text.

Here is the script:

%grams = ohash();
setMissPolicy(%grams, { return @(); });

$handle = openf(@ARGV[0]);
$banner = readln($handle);
readln($handle); # consume the ngram_count value

while $text (readln($handle)) {
   ($gram, $count) = split(' ', $text);

   if (strlen($gram) <= 2 || $count > 20000) {
      push(%grams[strlen($gram)], @($gram, $count));
   }
}
closef($handle);

sub sortTuple {
   return $2[1] <=> $1[1];
}

println($banner);

printAll(map({ return join(" ", $1); }, sort(&sortTuple, %grams[1])));
printAll(map({ return join(" ", $1); }, sort(&sortTuple, %grams[2])));
printAll(map({ return join(" ", $1); }, sort(&sortTuple, %grams[3])));
printAll(map({ return join(" ", $1); }, sort(&sortTuple, %grams[4])));

To run the script:

$ java -jar lib/sleep.jar sortit.sl id_big.ngp >id.ngp

The last step is to copy id.ngp into src/de/spieleck/app/cngram/ and edit src/de/spieleck/app/cngram/profiles.lst to contain the id resource. Type ant in the top-level directory of the NGramJ source code to rebuild cngram.jar and then you’re ready to test:

$ cat >c.txt
Selamat datang di Wikipedia bahasa Indonesia, ensiklopedia bebas berbahasa Indonesia
$ java -jar cngram.jar -lang2 c.txt
speed: id:0.857 ru:0.000 pt:0.000 .. de:0.000 |0.0E0 |0.0E0 dt=1872

As you can see NGramJ is an easy to work with library. If you need to do language guessing, I recommend it.

Tagged with: Java, language, language guessing, ngramj, NLP

6 comments

How I Trie to Make Spelling Suggestions

Posted in NLP Research by rsmudge on January 29, 2010

Spell checking is a three-step process. Check if a word is in a dictionary, generate potential suggestions, and then sort the suggestions–hopefully with the intended word on top. Dr. Peter Norvig wrote a great essay on this and I suggest you read it.

Knowing about this three-step process, can you guess what the slowest part of it is? It’s not the dictionary check. It’s not the sorting of the suggestions either. Generating the potential suggestions is the killer. Today I’ll discuss a clever way to use to speed this process up using Tries.

The Problem: How I Generate Suggestions

The AtD spelling corrector generates potential suggestions by considering all words within two edits of the misspelled word. An edit is inserting, deleting, or substituting a letter. Transposing two letters is also an edit. From my own tests I found 96% of the time the correct word is within two edits. 1.5% of the time the desired word is within three edits. Beyond that, who cares?

Why is this operation so slow? The naïve algorithm is to generate all possible edits to the misspelled word to get every edit within an edit distance of one. For a word of length n, we have 54n+25 edits assuming a lowercase alphabet of a-z. To get all words within an edit distance of two, apply all possible edits to the previously generated edits. This number gets big quickly and forget three edits using this algorithm.

The Trie

A Trie is a tree data structure for storing values associated with a key. Each character of the key represents a branch of the Trie. Retrieving a value consists of finding the branch associated with the first character of the key, chopping it off, and repeating this process until there is no key. Tries are nice because the common values of the keys are shared making them space efficient for storing things like dictionaries.

Here is the Sleep code to build a Trie:

sub trie {
	local('$trie $word $x $root $letter');

	$trie = %();

	foreach $word ($1) {
		$root = $trie;

		for ($x = 0; $x < strlen($word); $x++) {
			$letter = charAt($word, $x);

			if ($letter !in $root) {
				$root[$letter] = %();
			}
			$root = $root[$letter];
		}

		$root['__'] = $word;
	}
	return $trie;
}

This function creates a Trie represented as a Sleep hash. Each branch in the Trie has letters which point to other Trie branches. A branch that represents a word will have a __ key with the associated word.

A word is added to the Trie by looking for the first letter of the word in the root branch. If the letter is not there, a new branch is made. This first letter is then removed from the word and the branch associated with the letter becomes the root branch. This process is repeated until there are no letters left in the word. At this point the __ key is set to mark that this branch represents a complete word.

$ java -jar lib/sleep.jar
>> Welcome to the Sleep scripting language
>
> interact
>> Welcome to interactive mode.
Type your code and then '.' on a line by itself to execute the code.
Type Ctrl+D or 'done' on a line by itself to leave interactive mode.
include("trie.sl");
$trie = trie(@("apple", "bat", "bart", "battle", "cat"));
println($trie);
.
%(b =>
       %(a =>
              %(t =>
                     %(t =>
                            %(l =>
                                   %(e =>
                                          %(__ => 'battle')
                                    )
                             ),
                __ => 'bat'),
                r =>
                     %(t =>
                            %(__ => 'bart')
                      )
                )
         ),
  c => %(a => %(t => %(__ => 'cat'))),
  a => %(p => %(p => %(l => %(e => %(__ => 'apple')))))
 )

If you’ve been exposed to Tries before, you’ve probably seen them used to check if a word is in the dictionary or not. Here is the Sleep code to test a word:

sub inTrie {
	local('$x $word $trie $letter');
	($word, $trie) = @_;

	for ($x = 0; $x < strlen($word); $x++) {
		$letter = charAt($word, $x);

		if ($letter !in $trie) {
			return;
		}
		$trie = $trie[$letter];
	}

	return iff ($trie['__'] eq $1);
}

This function walks the Trie using a process similar to adding a word. If there is no branch for the current letter then the test fails. If we’ve walked through all the letters, the __ key is checked to see if we indeed have a word match.

Typically __ doesn’t contain the word. It’s usually a boolean flag. I’m lazy though and put the word in this slot to make debugging the Trie code easier and to make it easier for me when testing if a word is in the Trie.

Generating Suggestions

AtD uses this Trie structure to generate all edits for a misspelled word. The Trie is nice because I only have to traverse paths that may yield words and I don’t have to keep a lot of traversal information in memory. Here is the code:

sub edits {
    local('$trie $word $results $depth $root $branch $letter');
	($trie, $word, $results, $depth) = @_;

	if (strlen($word) == 0 && $depth >= 0 && $trie['__'] ne '') {
		$results[ $trie['__'] ] = 1;
	}

	if ($depth >= 1) {

		# deletion. [remove the current letter, and try it on the current branch--see what happens]
		if (strlen($word) > 1) {
			edits($trie, substr($word, 1), $results, $depth - 1);
		}
		else {
			edits($trie, "", $results, $depth - 1);
		}

		foreach $letter => $branch ($trie) {
			if ($letter eq '__') { continue; }

			# insertion. [pass the current word, no changes, to each of the branches for processing]
			edits($branch, $word, $results, $depth - 1);

			# substitution. [pass the current word, sans first letter, to each of the branches for processing]
			if (strlen($word) > 1) {
				edits($branch, substr($word, 1), $results, $depth - 1);
			}
			else {
				edits($branch, "", $results, $depth - 1);
			}
		}

		# transposition. [swap the first and second letters]
		if (strlen($word) > 2) {
			edits($trie, charAt($word, 1) . charAt($word, 0) . substr($word, 2), $results, $depth - 1);
		}
		else if (strlen($word) == 2) {
			edits($trie, charAt($word, 1) . charAt($word, 0), $results, $depth - 1);
		}
	}

	# move on to the next letter. (no edits have happened)

	if (strlen($word) >= 1 && charAt($word, 0) in $trie) {
		$letter = charAt($2, 0);
		if (strlen($word) > 1) {
			edits($trie[$letter], substr($word, 1), $results, $depth);
		}
		else if (strlen($word) == 1) {
			edits($trie[$letter], "", $results, $depth);
		}
	}

	# results are stored in a hash to prevent duplicate words
	return keys($results);
}

This function considers four types of edits: deletion, insertion, substitution, and transposition. This function returns a list of words within N edits from the specified word. It does this by considering the first letter of the word, the rest of the word, and the current branch of the Trie.

It handles deletions by applying this function to the current Trie branch and the rest of the word. In this way, all edits that could be generated from this letter being deleted are found.

With each call to edits, the function subtracts 1 from the max depth counter. When the max depth value is zero, no more traversals into the Trie are allowed. This counter makes sure we generate only words within the desired number of edits.

Insertion iterates through all child branches (representing each possible next letter) and applies the edit function to those branches with the current as an argument. In this way each branch is treated as a possible insertion before the current letter in the word.

Substitution iterates through each child branch and applies the edit function to those branches with the rest of the word as an argument.

Transposition swaps the first and second letters of the current word and applies the edit function on this new word using the current branch.

A Trie branch that maps to a word is added to the suggestion pool when it’s called with an empty string for the word and the depth counter is greater than zero (meaning we’re within the allowed number of edits).

Once these steps have completed, the algorithm finds the Trie branch for the next letter in the word, and calls edits with the branch for that letter as the Trie and the rest of the word. The depth counter isn’t changed as no edits were made, this is just advancing the algorithm to the next letter.

Here is a printout of edits applied to ‘sucess’ using a Trie constructed from the public domain 1939 Webster’s dictionary.

$ java -jar lib/sleep.jar
>> Welcome to the Sleep scripting language
> interact
>> Welcome to interactive mode.
Type your code and then '.' on a line by itself to execute the code.
Type Ctrl+D or 'done' on a line by itself to leave interactive mode.

include("trie.sl");

$dictionary = readAll(openf("/usr/share/dict/words"));
println(size($dictionary));
.
98569

$trie = trie($dictionary);
printAll(edits($trie, "sucess", %(), 2));
.
surest
sunless
sauces
sucks
sauce's
excess
recess
stress
Luce's
sices
access
duress
saucers
duchess
supers
guess
suers
suckers
success
sues

I’ve found this method is faster than the naïve edits generation algorithm. Let me know how it works for you.

This code is from the After the Deadline software service which is available under the LGPL license. You can find the code to the rest of the system at http://open.afterthedeadline.com

Tagged with: algorithms, computer science, data structures, NLP, spell checking, Trie

5 comments

Text Segmentation Follow Up

Posted in Multi-Lingual AtD by rsmudge on November 18, 2009

My first goal with making AtD multi-lingual is to get the spell checker going. Yesterday I found what looks like a promising solution for splitting text into sentences and words. This is an important step as AtD uses a statistical approach for spell checking.

Here is the Sleep code I used to test out the Java sentence and word segmentation technology:

$handle = openf(@ARGV[1]);
$text = join(&quot; &quot;, readAll($handle));
closef($handle);

import java.text.*;

$locale = [new Locale: @ARGV[0]];
$bi = [BreakIterator getSentenceInstance: $locale];

assert $bi !is $null : &quot;Language fail: $locale&quot;;

[$bi setText: $text];

$index = 0;

while ([$bi next] != [BreakIterator DONE])
{
   $sentence = substr($text, $index, [$bi current]);
   println($sentence);

   # print out individual words.
   $wi = [BreakIterator getWordInstance: $locale];
   [$wi setText: $sentence];

   $ind = 0;

   while ([$wi next] != [BreakIterator DONE])
   {
      println(&quot;\t&quot; . substr($sentence, $ind, [$wi current]));
      $ind = [$wi current];
   }

   $index = [$bi current];
}

You can run this with: java -jar sleep.jar segment.sl [locale name] [text file]. I tried it against English, Japanese, Hebrew, and Swedish. I found the Java text segmentation isn’t smart about abbreviations which is a shame. I had friends look at some trivial Hebrew and Swedish output and they said it looked good.

This is a key piece to being able to bring AtD spell and misused word checking to another language.

Tagged with: NLP, text segmentation

Comments Off on Text Segmentation Follow Up

Sentence Segmentation Survey for Java

Posted in Multi-Lingual AtD by rsmudge on November 17, 2009

Well, it’s time to get AtD working with more languages. A good first place to start is sentence segmentation. Sentence segmentation is the problem of taking a bunch of raw text and breaking it into sentences.

Like any researcher, I start my task with a search to see what others have done. Here is what I found:

There is a standard out there called SRX for Segmentation Rules Exchange. SRX files are XML and there is an open source Segment Java library for segmenting sentences using these rule files. There is also an editor called Ratel that lets folks edit these SRX files. LanguageTool has support for SRX files.
Another option is to use the OpenNLP project’s tools. They have a SentenceDetectorME class that might do the trick. The problem is models are only available for English, German, Spanish, and Thai.
I also learned that Java 1.6 has built-in tools for sentence segmentation in the java.text.* package. These were donated by IBM. Here is a quick dump of the locales supported by this package:
java -jar sleep.jar -e 'println(join(", ", [java.text.BreakIterator getAvailableLocales]));'

ja_JP, es_PE, en, ja_JP_JP, es_PA, sr_BA, mk, es_GT, ar_AE, no_NO, sq_AL, bg, ar_IQ, ar_YE, hu, pt_PT, el_CY, ar_QA, mk_MK, sv, de_CH, en_US, fi_FI, is, cs, en_MT, sl_SI, sk_SK, it, tr_TR, zh, th, ar_SA, no, en_GB, sr_CS, lt, ro, en_NZ, no_NO_NY, lt_LT, es_NI, nl, ga_IE, fr_BE, es_ES, ar_LB, ko, fr_CA, et_EE, ar_KW, sr_RS, es_US, es_MX, ar_SD, in_ID, ru, lv, es_UY, lv_LV, iw, pt_BR, ar_SY, hr, et, es_DO, fr_CH, hi_IN, es_VE, ar_BH, en_PH, ar_TN, fi, de_AT, es, nl_NL, es_EC, zh_TW, ar_JO, be, is_IS, es_CO, es_CR, es_CL, ar_EG, en_ZA, th_TH, el_GR, it_IT, ca, hu_HU, fr, en_IE, uk_UA, pl_PL, fr_LU, nl_BE, en_IN, ca_ES, ar_MA, es_BO, en_AU, sr, zh_SG, pt, uk, es_SV, ru_RU, ko_KR, vi, ar_DZ, vi_VN, sr_ME, sq, ar_LY, ar, zh_CN, be_BY, zh_HK, ja, iw_IL, bg_BG, in, mt_MT, es_PY, sl, fr_FR, cs_CZ, it_CH, ro_RO, es_PR, en_CA, de_DE, ga, de_LU, de, es_AR, sk, ms_MY, hr_HR, en_SG, da, mt, pl, ar_OM, tr, th_TH_TH, el, ms, sv_SE, da_DK, es_HN

A good survey of tools from the corpora-l mailing list is at http://mailman.uib.no/public/corpora/2007-October/005429.htm

I think I found my winner with Java’s built-in sentence segmentation tools. I haven’t evaluated the quality of the output yet (a task for tomorrow) but the fact it supports so many locales out of the box is very appealing to me. AtD-English has made it far on my simple rule-based sentence segmentation. If this API is near (or I suspect better than) what I have, this will do quite nicely.

Tagged with: AtD, NLP, sentence segmentation

Comments Off on Sentence Segmentation Survey for Java

Making the spell checker… more awesome.

Posted in NLP Research by rsmudge on November 17, 2009

I don’t talk much about the After the Deadline spell checker. Many people believe spell checking is a solved problem and they must “work the same way”. AtD does more than spell checking, so I talk about these other things.

Despite the lack of talk, AtD’s spell checker is awesome and I’m constantly improving it. That’s what I’m going to write about today. This past weekend I was writing a post and I tried to spell “conoisur”. As you call tell, I don’t know how to spell it and my attempt isn’t even close. It’s bad enough that I make spell checkers and can’t spell, it’s terrible that AtD couldn’t give me any suggestions. I felt left out in the cold.

Suggestions before this post 🙂

Most spelling errors are one or two changes away from the intended word. AtD takes advantage of this and limits its suggestion search to words within two changes. By limiting the search space for words, AtD has less to choose from. This means AtD give you the right answer more often.

Peter Norvig conducted a quick experiment where he learned that 98.9% of the errors in the Birkbeck spelling error corpus were within two edits and 76% were one edit away. These numbers are close to what I found in my own experiments months ago. I’d present those numbers but I… lost them. *cough*

From these numbers, I felt it was safe to ignore words three or more edits away from the misspelling. This works well except when I try to spell words like connoisseur or bureaucracy. For my attempts, AtD generated no suggestions.

Then it hit me! If there are no suggestions within two edits… why don’t I try to find words within three edits. And if there are still no suggestions then, why I don’t I try words within four edits. I could go on like this forever. Unfortunately nothing is free and doing this would kill AtD’s performance. So I decided to limit my experiment to find words within three edits when no words are within two edits.

AtD’s spellchecker uses a neural network to score suggestions and compare them to each other. During training the neural network converges on an optimal weighting for each feature I give it. One of the features I give to the neural network is the edit distance normalized between 0.0 and 1.0. Edit distance is weighted highly as a value close to 1.0 is correct 76% of the time. I feared that introducing high edit distances that are almost always correct would create something that looks like a valley to the neural network. Meaning a high value is usually correct, a low value is usually correct, and the stuff in the middle is probably not so correct. Neural networks will make good guesses but they’re dependent on being told the story in the right way. This valley thing wasn’t going to work. So I modified the features so that an edit distance of 1 is a weight of 1.0 and everything else is 0.0.

Here are the numbers on AtD’s spell checker before I made these changes:

Word Pool Accuracy

Data Set	Accuracy
sp_test_wpcm_nocontext.txt	97.80%
sp_test_aspell_nocontext.txt	79.06%

Spelling Corrector Accuracy

Data Set	Neural Network	Freq/Edit	Random
sp_test_gutenberg_context1.txt	93.97%	83.16%	29.71%
sp_test_gutenberg_context2.txt	77.14%	59.68%	19.68%

The word pool accuracy is a measure of how often the pool of suggested words AtD generates has the intended word for a misspelled one in the test data. If the correct word is not here, AtD won’t be able to suggest it. The corrector accuracy is a measurement of how accurate AtD is at sorting the suggestion pool and putting the intended word on the top. The neural network score is what AtD uses in production and includes context. The joint frequency/edit distance score is similar to the statistical corrector in How to write a spelling corrector. Without context I don’t expect most spell checkers will do much better than that.

Here are the numbers with the changes:

Word Pool Accuracy

Data Set	Accuracy
sp_test_wpcm_nocontext.txt	98.19% +0.39%
sp_test_aspell_nocontext.txt	82.58% +3.52%

Spelling Corrector Accuracy

Data Set	Neural Network	Freq/Edit	Random
sp_test_gutenberg_context1.txt	94.51% +0.54%	83.08%	30.18%
sp_test_gutenberg_context2.txt	79.38% +2.24%	60.31%	23.69%

What this means for you

These little changes let AtD expand its search for suggestions when none were found. This means better recommendations for hard to spell words without impacting AtD’s stellar correction accuracy. I hope you enjoy it.

Suggestions you'll get now 🙂

How does this compare to MS Word, ASpell, and others?

If you want to compare these numbers with other systems, I presented numbers from similar data in another blog post. Be sure to multiply the spelling corrector accuracy with the word pool accuracy when comparing these numbers in the ones in the other post. For example: 0.9451 * 0.9891 = 0.9348 = 93.48%

If you’d like to play around with spelling correction and neural networks, consider downloading the After the Deadline source code. Everything you need to conduct your own experiments is included and it’s open source. The data sets used here are available in the bootstrap data distribution.

Tagged with: algorithms, edit distance, linguistics, NLP, science, spell check, spelling

Comments Off on Making the spell checker… more awesome.

Tweaking the AtD Spellchecker

Posted in NLP Research by rsmudge on September 4, 2009

Conventional wisdom says a spellchecker dictionary should have around 90,000 words. Too few words and the spellchecker will mark many correct things as wrong. Too many words and it’s more likely a typo could result in a rarely used word going unnoticed by most spellcheckers.

Assembling a good dictionary is a challenge. Many wordlists are available online but often times these are either not comprehensive or they’re too comprehensive and contain many misspellings.

AtD tries to get around this problem by intersecting a collection of wordlists with words it sees used in a corpus ( a corpus is a directory full of books, Wikipedia articles, and blog posts I “borrowed” from you). Currently AtD accepts any word seen once leading to a dictionary of 161,879 words. Too many.

Today I decided to experiment with different thresholds for how many times a word needs to be seen before it’s allowed entrance into the coveted spellchecker wordlist. My goal was to increase the accuracy of the AtD spellchecker and drop the number of misspelled words in the dictionary.

Here are the results, AtD:n means AtD requires a word be seen n times before AtD includes it in the dictionary.

ASpell Dataset (Hard to correct errors)

Engine	Words	Accuracy *	Present Words
AtD:1	161,879	55.0%	73
AtD:2	116,876	55.8%	57
AtD:3	95,910	57.3%	38
AtD:4	82,782	58.0%	30
AtD:5	73,628	58.5%	27
AtD:6	66,666	59.1%	23
ASpell (normal)	n/a	56.9%	14
Word 97	n/a	59.0%	18
Word 2000	n/a	62.6%	20
Word 2003	n/a	62.8%	20

Wikipedia Dataset (Easy to correct errors)

Engine	Words	Accuracy *	Present Words
AtD:1	161,879	87.9%	233
AtD:2	116,876	87.8%	149
AtD:3	95,910	88.0%	104
AtD:4	82,782	88.3%	72
AtD:5	73,628	88.3%	59
AtD:6	66,666	88.62%	48
ASpell (normal)	n/a	84.7%	44
Word 97	n/a	89.0%	31
Word 2000	n/a	92.5%	42
Word 2003	n/a	92.6%	41

Experiment data and comparison numbers from: Deorowicz, S., Ciura M. G., Correcting spelling errors by modelling their causes, International Journal of Applied Mathematics and Computer Science, 2005; 15(2):275–285.

* Accuracy numbers show spell checking without context as the Word and ASpell checkers are not contextual (and therefor the data isn’t either).

After seeing these results, I’ve decided to settle on a threshold of 2 to start and I’ll move to 3 after no one complains about 2.

I’m not too happy that the present word count is so sky high but as I add more data to AtD and up the minimum word threshold this problem should go away. This is progress though. Six months ago I had so little data I wouldn’t have been able to use a threshold of 2, even if I wanted to.

Tagged with: computer science, dictionary, NLP, spelling

5 comments

Grammar Checkers – Not so much magic

Posted in NLP Research by rsmudge on August 31, 2009

I once read a quote where an early pioneer of computerized grammar checkers expressed disappointment about how little the technology has evolved. It’s amazing to me how grammar checking is simultaneously simple (rule-based) and complicated. The complicated part is the technologies that make the abstractions about the raw text possible.

It helps to start with, what is a grammar checker? When I write grammar checker here I’m really referring to a writing checker that looks for phrases that represent an error. Despite advances in AI, complete understanding of unstructured text is still beyond the reach of computers. Grammar checkers work by finding patterns that a human created and flagging text that matches these patterns.

Patterns can be as simple as always flagging “your welcome” and suggesting “you’re welcome”. While these patterns are simple to make a checker for, they don’t offer much power. A grammar checker with any kind of coverage would need tens or hundreds of thousands of rules to be useful to anyone.

Realizing this, NLP researchers decided to come up with ways to infer information about text at a higher level and write rules that take advantage of this higher level information. One example is part-of-speech tagging. In elementary school you may have learned grammar by labeling words in a sentence with verb, noun, adjective, etc. Most grammar checkers do this too and the process is called tagging. With part-of-speech information, one rule can capture many writing errors. In After the Deadline, I use part-of-speech tags to find errors where a plural noun is used with a determiner that expects a singular noun e.g. “There is categories”

While tagging gives extra information about each word, the rules we can write are still limited. Some words should be grouped together, for example proper nouns like “New York”. The process of grouping words that belong together is known as chunking. Through chunking rule writers can have more confidence that what follows a tagged word (or chunk) really has the characteristics (plural, singular) assumed by the rule. For example “There is many categories” should be flagged and a decent chunker makes it easier for a rule to realize the phrase following “There is” represents a plural noun phrase.

The next abstraction is the full parse. A full-parse is where the computer tries to infer the subject, object, and verbs in a sentence. The structure of the sentence is placed into a tree-like data structure that the rules can refer to at will. With a full-parse a grammar checker can offer suggestions that drastically restructure the sentence (e.g. reword passive voice), decide which sentences make no sense, and find errors that are multiple words apart (subject verb agreement errors).

Regardless of the abstraction level used, grammar checkers are still rule-based. The techniques that back these abstractions can become very sophisticated. It seems much research has focused on improving these abstractions.

To move grammar checking to a new level, I expect we will need new abstractions one can use when writing rules. I also expect developing techniques to automatically mine rules will be a valuable research subject.

Tagged with: AI, chunkers, grammar checkers, NLP, taggers