Grammar Checkers – Not so much magic
I once read a quote where an early pioneer of computerized grammar checkers expressed disappointment about how little the technology has evolved. It’s amazing to me how grammar checking is simultaneously simple (rule-based) and complicated. The complicated part is the technologies that make the abstractions about the raw text possible.
It helps to start with, what is a grammar checker? When I write grammar checker here I’m really referring to a writing checker that looks for phrases that represent an error. Despite advances in AI, complete understanding of unstructured text is still beyond the reach of computers. Grammar checkers work by finding patterns that a human created and flagging text that matches these patterns.
Patterns can be as simple as always flagging “your welcome” and suggesting “you’re welcome”. While these patterns are simple to make a checker for, they don’t offer much power. A grammar checker with any kind of coverage would need tens or hundreds of thousands of rules to be useful to anyone.
Realizing this, NLP researchers decided to come up with ways to infer information about text at a higher level and write rules that take advantage of this higher level information. One example is part-of-speech tagging. In elementary school you may have learned grammar by labeling words in a sentence with verb, noun, adjective, etc. Most grammar checkers do this too and the process is called tagging. With part-of-speech information, one rule can capture many writing errors. In After the Deadline, I use part-of-speech tags to find errors where a plural noun is used with a determiner that expects a singular noun e.g. “There is categories”
While tagging gives extra information about each word, the rules we can write are still limited. Some words should be grouped together, for example proper nouns like “New York”. The process of grouping words that belong together is known as chunking. Through chunking rule writers can have more confidence that what follows a tagged word (or chunk) really has the characteristics (plural, singular) assumed by the rule. For example “There is many categories” should be flagged and a decent chunker makes it easier for a rule to realize the phrase following “There is” represents a plural noun phrase.
The next abstraction is the full parse. A full-parse is where the computer tries to infer the subject, object, and verbs in a sentence. The structure of the sentence is placed into a tree-like data structure that the rules can refer to at will. With a full-parse a grammar checker can offer suggestions that drastically restructure the sentence (e.g. reword passive voice), decide which sentences make no sense, and find errors that are multiple words apart (subject verb agreement errors).
Regardless of the abstraction level used, grammar checkers are still rule-based. The techniques that back these abstractions can become very sophisticated. It seems much research has focused on improving these abstractions.
To move grammar checking to a new level, I expect we will need new abstractions one can use when writing rules. I also expect developing techniques to automatically mine rules will be a valuable research subject.
6 comments