After the Deadline

How I Trie to Make Spelling Suggestions

Posted in NLP Research by rsmudge on January 29, 2010

Spell checking is a three-step process.  Check if a word is in a dictionary, generate potential suggestions, and then sort the suggestions–hopefully with the intended word on top.  Dr. Peter Norvig wrote a great essay on this and I suggest you read it.

Knowing about this three-step process, can you guess what the slowest part of it is?  It’s not the dictionary check.  It’s not the sorting of the suggestions either.  Generating the potential suggestions is the killer.  Today I’ll discuss a clever way to use to speed this process up using Tries.

The Problem: How I Generate Suggestions

The AtD spelling corrector generates potential suggestions by considering all words within two edits of the misspelled word.  An edit is inserting, deleting, or substituting a letter.  Transposing two letters is also an edit.  From my own tests I found 96% of the time the correct word is within two edits.  1.5% of the time the desired word is within three edits.  Beyond that, who cares?

Why is this operation so slow?  The naïve algorithm is to generate all possible edits to the misspelled word to get every edit within an edit distance of one.  For a word of length n, we have 54n+25 edits assuming a lowercase alphabet of a-z.  To get all words within an edit distance of two, apply all possible edits to the previously generated edits.  This number gets big quickly and forget three edits using this algorithm.

The Trie

A Trie is a tree data structure for storing values associated with a key.  Each character of the key represents a branch of the Trie.   Retrieving a value consists of finding the branch associated with the first character of the key, chopping it off, and repeating this process until there is no key.  Tries are nice because the common values of the keys are shared making them space efficient for storing things like dictionaries.

Here is the Sleep code to build a Trie:

sub trie {
	local('$trie $word $x $root $letter');

	$trie = %();

	foreach $word ($1) {
		$root = $trie;

		for ($x = 0; $x < strlen($word); $x++) {
			$letter = charAt($word, $x);

			if ($letter !in $root) {
				$root[$letter] = %();
			}
			$root = $root[$letter];
		}

		$root['__'] = $word;
	}
	return $trie;
}

This function creates a Trie represented as a Sleep hash.  Each branch in the Trie has letters which point to other Trie branches. A branch that represents a word will have a __ key with the associated word.

A word is added to the Trie by looking for the first letter of the word in the root branch. If the letter is not there, a new branch is made. This first letter is then removed from the word and the branch associated with the letter becomes the root branch. This process is repeated until there are no letters left in the word. At this point the __ key is set to mark that this branch represents a complete word.

$ java -jar lib/sleep.jar
>> Welcome to the Sleep scripting language
>
> interact
>> Welcome to interactive mode.
Type your code and then '.' on a line by itself to execute the code.
Type Ctrl+D or 'done' on a line by itself to leave interactive mode.
include("trie.sl");
$trie = trie(@("apple", "bat", "bart", "battle", "cat"));
println($trie);
.
%(b =>
       %(a =>
              %(t =>
                     %(t =>
                            %(l =>
                                   %(e =>
                                          %(__ => 'battle')
                                    )
                             ),
                __ => 'bat'),
                r =>
                     %(t =>
                            %(__ => 'bart')
                      )
                )
         ),
  c => %(a => %(t => %(__ => 'cat'))),
  a => %(p => %(p => %(l => %(e => %(__ => 'apple')))))
 )

If you’ve been exposed to Tries before, you’ve probably seen them used to check if a word is in the dictionary or not.  Here is the Sleep code to test a word:

sub inTrie {
	local('$x $word $trie $letter');
	($word, $trie) = @_;

	for ($x = 0; $x < strlen($word); $x++) {
		$letter = charAt($word, $x);

		if ($letter !in $trie) {
			return;
		}
		$trie = $trie[$letter];
	}

	return iff ($trie['__'] eq $1);
}

This function walks the Trie using a process similar to adding a word. If there is no branch for the current letter then the test fails. If we’ve walked through all the letters, the __ key is checked to see if we indeed have a word match.

Typically __ doesn’t contain the word. It’s usually a boolean flag. I’m lazy though and put the word in this slot to make debugging the Trie code easier and to make it easier for me when testing if a word is in the Trie.

Generating Suggestions

AtD uses this Trie structure to generate all edits for a misspelled word.  The Trie is nice because I only have to traverse paths that may yield words and I don’t have to keep a lot of traversal information in memory.  Here is the code:

sub edits {
    local('$trie $word $results $depth $root $branch $letter');
	($trie, $word, $results, $depth) = @_;

	if (strlen($word) == 0 && $depth >= 0 && $trie['__'] ne '') {
		$results[ $trie['__'] ] = 1;
	}

	if ($depth >= 1) {

		# deletion. [remove the current letter, and try it on the current branch--see what happens]
		if (strlen($word) > 1) {
			edits($trie, substr($word, 1), $results, $depth - 1);
		}
		else {
			edits($trie, "", $results, $depth - 1);
		}

		foreach $letter => $branch ($trie) {
			if ($letter eq '__') { continue; }

			# insertion. [pass the current word, no changes, to each of the branches for processing]
			edits($branch, $word, $results, $depth - 1);

			# substitution. [pass the current word, sans first letter, to each of the branches for processing]
			if (strlen($word) > 1) {
				edits($branch, substr($word, 1), $results, $depth - 1);
			}
			else {
				edits($branch, "", $results, $depth - 1);
			}
		}

		# transposition. [swap the first and second letters]
		if (strlen($word) > 2) {
			edits($trie, charAt($word, 1) . charAt($word, 0) . substr($word, 2), $results, $depth - 1);
		}
		else if (strlen($word) == 2) {
			edits($trie, charAt($word, 1) . charAt($word, 0), $results, $depth - 1);
		}
	}

	# move on to the next letter. (no edits have happened)

	if (strlen($word) >= 1 && charAt($word, 0) in $trie) {
		$letter = charAt($2, 0);
		if (strlen($word) > 1) {
			edits($trie[$letter], substr($word, 1), $results, $depth);
		}
		else if (strlen($word) == 1) {
			edits($trie[$letter], "", $results, $depth);
		}
	}

	# results are stored in a hash to prevent duplicate words
	return keys($results);
}

This function considers four types of edits: deletion, insertion, substitution, and transposition. This function returns a list of words within N edits from the specified word. It does this by considering the first letter of the word, the rest of the word, and the current branch of the Trie.

It handles deletions by applying this function to the current Trie branch and the rest of the word. In this way, all edits that could be generated from this letter being deleted are found.

With each call to edits, the function subtracts 1 from the max depth counter. When the max depth value is zero, no more traversals into the Trie are allowed. This counter makes sure we generate only words within the desired number of edits.

Insertion iterates through all child branches (representing each possible next letter) and applies the edit function to those branches with the current as an argument.  In this way each branch is treated as a possible insertion before the current letter in the word.

Substitution iterates through each child branch and applies the edit function to those branches with the rest of the word as an argument.

Transposition swaps the first and second letters of the current word and applies the edit function on this new word using the current branch.

A Trie branch that maps to a word is added to the suggestion pool when it’s called with an empty string for the word and the depth counter is greater than zero (meaning we’re within the allowed number of edits).

Once these steps have completed, the algorithm finds the Trie branch for the next letter in the word, and calls edits with the branch for that letter as the Trie and the rest of the word.  The depth counter isn’t changed as no edits were made, this is just advancing the algorithm to the next letter.

Here is a printout of edits applied to ‘sucess’ using a Trie constructed from the public domain 1939 Webster’s dictionary.

$ java -jar lib/sleep.jar
>> Welcome to the Sleep scripting language
> interact
>> Welcome to interactive mode.
Type your code and then '.' on a line by itself to execute the code.
Type Ctrl+D or 'done' on a line by itself to leave interactive mode.

include("trie.sl");

$dictionary = readAll(openf("/usr/share/dict/words"));
println(size($dictionary));
.
98569

$trie = trie($dictionary);
printAll(edits($trie, "sucess", %(), 2));
.
surest
sunless
sauces
sucks
sauce's
excess
recess
stress
Luce's
sices
access
duress
saucers
duchess
supers
guess
suers
suckers
success
sues

I’ve found this method is faster than the naïve edits generation algorithm. Let me know how it works for you.

This code is from the After the Deadline software service which is available under the LGPL license. You can find the code to the rest of the system at http://open.afterthedeadline.com

5 Responses

Subscribe to comments with RSS.

  1. dngor said, on January 29, 2010 at 8:01 pm

    that’s not how to do it.

    this is how: http://norvig.com/spell-correct.html

    • rsmudge said, on February 2, 2010 at 1:27 am

      Hi dngor,
      I’m a big fan of Dr. Norvig’s essay and learned a lot from it. That’s why I cited it in the first paragraph of this post. Dr. Norvig’s work represents a more complete spelling corrector. Here I merely present an improvement on the naive algorithm for generating the list of suggestions that a misspelled word could be. The method of sorting these suggestions is still to be presented.

      Thanks for writing in.

      — Raphael

  2. [...] an article by Raphael Mudge about using “tries” to offer accurate spelling suggestions. It’s excellent and quite thorough! Definitely recommended reading for anyone putting [...]

  3. [...] what goes on behind the scenes? Check out this recent blogpost which describes in gloriously geeky detail how ATD comes up with spelling [...]

  4. [...] How I Trie to Capture Mistakes – The paper glosses over the use of a Trie to generate a pool of suggestions. There just wasn’t enough space to explain it. This blog post covers it in detail though. Possibly related posts: (automatically generated)– Open Sourced!OSS Watch, Open Source Software Advisory Service for the UK academic communityOpen Source As A SaaS Endgame – Storytlr Joins The Elite GroupPaul Ramsey’s FOSS4G Keynote on Open Source Business [...]


Comments are closed.

Follow

Get every new post delivered to your Inbox.

Join 288 other followers

%d bloggers like this: