computer science | After the Deadline

How I Trie to Make Spelling Suggestions

Posted in NLP Research by rsmudge on January 29, 2010

Spell checking is a three-step process. Check if a word is in a dictionary, generate potential suggestions, and then sort the suggestions–hopefully with the intended word on top. Dr. Peter Norvig wrote a great essay on this and I suggest you read it.

Knowing about this three-step process, can you guess what the slowest part of it is? It’s not the dictionary check. It’s not the sorting of the suggestions either. Generating the potential suggestions is the killer. Today I’ll discuss a clever way to use to speed this process up using Tries.

The Problem: How I Generate Suggestions

The AtD spelling corrector generates potential suggestions by considering all words within two edits of the misspelled word. An edit is inserting, deleting, or substituting a letter. Transposing two letters is also an edit. From my own tests I found 96% of the time the correct word is within two edits. 1.5% of the time the desired word is within three edits. Beyond that, who cares?

Why is this operation so slow? The naïve algorithm is to generate all possible edits to the misspelled word to get every edit within an edit distance of one. For a word of length n, we have 54n+25 edits assuming a lowercase alphabet of a-z. To get all words within an edit distance of two, apply all possible edits to the previously generated edits. This number gets big quickly and forget three edits using this algorithm.

The Trie

A Trie is a tree data structure for storing values associated with a key. Each character of the key represents a branch of the Trie. Retrieving a value consists of finding the branch associated with the first character of the key, chopping it off, and repeating this process until there is no key. Tries are nice because the common values of the keys are shared making them space efficient for storing things like dictionaries.

Here is the Sleep code to build a Trie:

sub trie {
	local('$trie $word $x $root $letter');

	$trie = %();

	foreach $word ($1) {
		$root = $trie;

		for ($x = 0; $x < strlen($word); $x++) {
			$letter = charAt($word, $x);

			if ($letter !in $root) {
				$root[$letter] = %();
			}
			$root = $root[$letter];
		}

		$root['__'] = $word;
	}
	return $trie;
}

This function creates a Trie represented as a Sleep hash. Each branch in the Trie has letters which point to other Trie branches. A branch that represents a word will have a __ key with the associated word.

A word is added to the Trie by looking for the first letter of the word in the root branch. If the letter is not there, a new branch is made. This first letter is then removed from the word and the branch associated with the letter becomes the root branch. This process is repeated until there are no letters left in the word. At this point the __ key is set to mark that this branch represents a complete word.

$ java -jar lib/sleep.jar
>> Welcome to the Sleep scripting language
>
> interact
>> Welcome to interactive mode.
Type your code and then '.' on a line by itself to execute the code.
Type Ctrl+D or 'done' on a line by itself to leave interactive mode.
include("trie.sl");
$trie = trie(@("apple", "bat", "bart", "battle", "cat"));
println($trie);
.
%(b =>
       %(a =>
              %(t =>
                     %(t =>
                            %(l =>
                                   %(e =>
                                          %(__ => 'battle')
                                    )
                             ),
                __ => 'bat'),
                r =>
                     %(t =>
                            %(__ => 'bart')
                      )
                )
         ),
  c => %(a => %(t => %(__ => 'cat'))),
  a => %(p => %(p => %(l => %(e => %(__ => 'apple')))))
 )

If you’ve been exposed to Tries before, you’ve probably seen them used to check if a word is in the dictionary or not. Here is the Sleep code to test a word:

sub inTrie {
	local('$x $word $trie $letter');
	($word, $trie) = @_;

	for ($x = 0; $x < strlen($word); $x++) {
		$letter = charAt($word, $x);

		if ($letter !in $trie) {
			return;
		}
		$trie = $trie[$letter];
	}

	return iff ($trie['__'] eq $1);
}

This function walks the Trie using a process similar to adding a word. If there is no branch for the current letter then the test fails. If we’ve walked through all the letters, the __ key is checked to see if we indeed have a word match.

Typically __ doesn’t contain the word. It’s usually a boolean flag. I’m lazy though and put the word in this slot to make debugging the Trie code easier and to make it easier for me when testing if a word is in the Trie.

Generating Suggestions

AtD uses this Trie structure to generate all edits for a misspelled word. The Trie is nice because I only have to traverse paths that may yield words and I don’t have to keep a lot of traversal information in memory. Here is the code:

sub edits {
    local('$trie $word $results $depth $root $branch $letter');
	($trie, $word, $results, $depth) = @_;

	if (strlen($word) == 0 && $depth >= 0 && $trie['__'] ne '') {
		$results[ $trie['__'] ] = 1;
	}

	if ($depth >= 1) {

		# deletion. [remove the current letter, and try it on the current branch--see what happens]
		if (strlen($word) > 1) {
			edits($trie, substr($word, 1), $results, $depth - 1);
		}
		else {
			edits($trie, "", $results, $depth - 1);
		}

		foreach $letter => $branch ($trie) {
			if ($letter eq '__') { continue; }

			# insertion. [pass the current word, no changes, to each of the branches for processing]
			edits($branch, $word, $results, $depth - 1);

			# substitution. [pass the current word, sans first letter, to each of the branches for processing]
			if (strlen($word) > 1) {
				edits($branch, substr($word, 1), $results, $depth - 1);
			}
			else {
				edits($branch, "", $results, $depth - 1);
			}
		}

		# transposition. [swap the first and second letters]
		if (strlen($word) > 2) {
			edits($trie, charAt($word, 1) . charAt($word, 0) . substr($word, 2), $results, $depth - 1);
		}
		else if (strlen($word) == 2) {
			edits($trie, charAt($word, 1) . charAt($word, 0), $results, $depth - 1);
		}
	}

	# move on to the next letter. (no edits have happened)

	if (strlen($word) >= 1 && charAt($word, 0) in $trie) {
		$letter = charAt($2, 0);
		if (strlen($word) > 1) {
			edits($trie[$letter], substr($word, 1), $results, $depth);
		}
		else if (strlen($word) == 1) {
			edits($trie[$letter], "", $results, $depth);
		}
	}

	# results are stored in a hash to prevent duplicate words
	return keys($results);
}

This function considers four types of edits: deletion, insertion, substitution, and transposition. This function returns a list of words within N edits from the specified word. It does this by considering the first letter of the word, the rest of the word, and the current branch of the Trie.

It handles deletions by applying this function to the current Trie branch and the rest of the word. In this way, all edits that could be generated from this letter being deleted are found.

With each call to edits, the function subtracts 1 from the max depth counter. When the max depth value is zero, no more traversals into the Trie are allowed. This counter makes sure we generate only words within the desired number of edits.

Insertion iterates through all child branches (representing each possible next letter) and applies the edit function to those branches with the current as an argument. In this way each branch is treated as a possible insertion before the current letter in the word.

Substitution iterates through each child branch and applies the edit function to those branches with the rest of the word as an argument.

Transposition swaps the first and second letters of the current word and applies the edit function on this new word using the current branch.

A Trie branch that maps to a word is added to the suggestion pool when it’s called with an empty string for the word and the depth counter is greater than zero (meaning we’re within the allowed number of edits).

Once these steps have completed, the algorithm finds the Trie branch for the next letter in the word, and calls edits with the branch for that letter as the Trie and the rest of the word. The depth counter isn’t changed as no edits were made, this is just advancing the algorithm to the next letter.

Here is a printout of edits applied to ‘sucess’ using a Trie constructed from the public domain 1939 Webster’s dictionary.

$ java -jar lib/sleep.jar
>> Welcome to the Sleep scripting language
> interact
>> Welcome to interactive mode.
Type your code and then '.' on a line by itself to execute the code.
Type Ctrl+D or 'done' on a line by itself to leave interactive mode.

include("trie.sl");

$dictionary = readAll(openf("/usr/share/dict/words"));
println(size($dictionary));
.
98569

$trie = trie($dictionary);
printAll(edits($trie, "sucess", %(), 2));
.
surest
sunless
sauces
sucks
sauce's
excess
recess
stress
Luce's
sices
access
duress
saucers
duchess
supers
guess
suers
suckers
success
sues

I’ve found this method is faster than the naïve edits generation algorithm. Let me know how it works for you.

This code is from the After the Deadline software service which is available under the LGPL license. You can find the code to the rest of the system at http://open.afterthedeadline.com

Tagged with: algorithms, computer science, data structures, NLP, spell checking, Trie

5 comments

Tweaking the AtD Spellchecker

Posted in NLP Research by rsmudge on September 4, 2009

Conventional wisdom says a spellchecker dictionary should have around 90,000 words. Too few words and the spellchecker will mark many correct things as wrong. Too many words and it’s more likely a typo could result in a rarely used word going unnoticed by most spellcheckers.

Assembling a good dictionary is a challenge. Many wordlists are available online but often times these are either not comprehensive or they’re too comprehensive and contain many misspellings.

AtD tries to get around this problem by intersecting a collection of wordlists with words it sees used in a corpus ( a corpus is a directory full of books, Wikipedia articles, and blog posts I “borrowed” from you). Currently AtD accepts any word seen once leading to a dictionary of 161,879 words. Too many.

Today I decided to experiment with different thresholds for how many times a word needs to be seen before it’s allowed entrance into the coveted spellchecker wordlist. My goal was to increase the accuracy of the AtD spellchecker and drop the number of misspelled words in the dictionary.

Here are the results, AtD:n means AtD requires a word be seen n times before AtD includes it in the dictionary.

ASpell Dataset (Hard to correct errors)

Engine	Words	Accuracy *	Present Words
AtD:1	161,879	55.0%	73
AtD:2	116,876	55.8%	57
AtD:3	95,910	57.3%	38
AtD:4	82,782	58.0%	30
AtD:5	73,628	58.5%	27
AtD:6	66,666	59.1%	23
ASpell (normal)	n/a	56.9%	14
Word 97	n/a	59.0%	18
Word 2000	n/a	62.6%	20
Word 2003	n/a	62.8%	20

Wikipedia Dataset (Easy to correct errors)

Engine	Words	Accuracy *	Present Words
AtD:1	161,879	87.9%	233
AtD:2	116,876	87.8%	149
AtD:3	95,910	88.0%	104
AtD:4	82,782	88.3%	72
AtD:5	73,628	88.3%	59
AtD:6	66,666	88.62%	48
ASpell (normal)	n/a	84.7%	44
Word 97	n/a	89.0%	31
Word 2000	n/a	92.5%	42
Word 2003	n/a	92.6%	41

Experiment data and comparison numbers from: Deorowicz, S., Ciura M. G., Correcting spelling errors by modelling their causes, International Journal of Applied Mathematics and Computer Science, 2005; 15(2):275–285.

* Accuracy numbers show spell checking without context as the Word and ASpell checkers are not contextual (and therefor the data isn’t either).

After seeing these results, I’ve decided to settle on a threshold of 2 to start and I’ll move to 3 after no one complains about 2.

I’m not too happy that the present word count is so sky high but as I add more data to AtD and up the minimum word threshold this problem should go away. This is progress though. Six months ago I had so little data I wouldn’t have been able to use a threshold of 2, even if I wanted to.

Tagged with: computer science, dictionary, NLP, spelling

5 comments

After the Deadline