Attempting to Detect and Correct Out of Place Plurals and Singulars
Welcome to After the Deadline Sing-A-Long. I’m constantly trying different experiments to catch my (and occasionally your) writing errors. When AtD is open sourced, you can play along at home. Read on to learn more.
Today, I’m starting a new series on this blog where I will show off experiments I’m conducting with AtD and sharing the code and ideas I use to make them happen. I call this After the Deadline Sing-A-Long. My hope is when I finally tar up the sourcecode and make AtD available, you can replicate some of these experiments, try your own ideas, have a discovery, and send them to me.
Detecting and Correcting Plural or Singular Words
One of the mistakes I make most is to accidentally add an ‘s’ to a word making it plural when I really wanted it to be singular. My idea? Why not take every plural verb and noun and convert them to their singular form and see if the singular form is a statistically better fit.
AtD can do this. The grammar and style checker are rule-based but they also use the statistical language model to filter out false positives. To set up this experiment, I created a rule file with:
AtD rules each exist on their own line and consist of declarations separated by two colons. The first declaration is a pattern that represents the phrase this rule should match. It can consist of one or more [word pattern]/[tag pattern] sequences. The tag pattern is the part-of-speech (think noun, verb, etc.). After the pattern comes the named declarations. The word= declaration is what I’m suggesting in place of the matched phrase. Here I’m converting any plural noun or verb to a singular form. The pivots specify the parts of the phrase that have changed to inform that statistical filtering I mentioned earlier.
The next step is to create a file with some examples to test with. I generally do this to see if the rule does what I want it to do. Here are two of the sentences I tried:
There are several people I never had a chance to thanks publicly. After the Deadline is a tool to finds errors and correct them.
So, with these rules in place, here is what happened when I tested them:
atd@build:~/atd$ ./bin/testr.sh plural_to_singular.rules examples.txt Warning: Dictionary loaded: 124264 words at dictionary.sl:50 Warning: Looking at: several|people|I = 0.003616585140061973 at testr.sl:24 Warning: Looking at: several|person|I = 1.1955063236931511E-4 at testr.sl:24 Warning: Looking at: to|thanks|publicly = 1.25339251574261E-6 at testr.sl:24 Warning: Looking at: to|thank|publicly = 1.7004358463574743E-4 at testr.sl:24 There are several people I never had a chance to thanks publicly. There/EX are/VBP several/JJ people/NNS I/PRP never/RB had/VBD a/DT chance/NN to/TO thanks/NNS publicly/RB 0) [REJECT] several, people -> I id => 3095c361e8beeb60abebed29fe5657be pivots => ,:singular path => @('.*', 'NNS') word => :singular 1) [ACCEPT] to, thanks -> @('thank') id => 3095c361e8beeb60abebed29fe5657be pivots => ,:singular path => @('.*', 'NNS') word => :singular Warning: Looking at: Deadline|is|a = 0.05030783620533642 at testr.sl:24 Warning: Looking at: Deadline|be|a = 0.00804979134152438 at testr.sl:24 Warning: Looking at: to|finds|errors = 0.0 at testr.sl:24 Warning: Looking at: to|find|errors = 0.0024240611254462076 at testr.sl:24 Warning: Looking at: finds|errors|and = 2.5553084536790455E-5 at testr.sl:24 Warning: Looking at: finds|error|and = 3.14416280096839E-4 at testr.sl:24 After the Deadline is a tool to finds errors and correct them. After/IN the/DT Deadline/NN is/VBZ a/DT tool/NN to/TO finds/NNS errors/NNS and/CC correct/NN them/PRP 0) [REJECT] Deadline, is -> a id => 1ab5cd35b6146cbecbc31c8b2a6d8e96 pivots => ,:singular path => @('.*', 'VBZ') word => :singular 1) [ACCEPT] to, finds -> @('find') id => 3095c361e8beeb60abebed29fe5657be pivots => ,:singular path => @('.*', 'NNS') word => :singular 2) [ACCEPT] finds, errors -> @('error') id => 3095c361e8beeb60abebed29fe5657be pivots => ,:singular path => @('.*', 'NNS') word => :singular
And that was that. After I ran the experiment against a more substantial amount of text, I found too many phrases were flagged incorrectly and I didn’t find many legitimate errors of this type. If I don’t see many obvious mistakes when applying a rule against several books, blogs, and online discussions–I ignore the rule.
In this case, the experiment showed this type of rule fails. There are options to make it better:
- Set a directive to raise the statistical threshold
- Try looking at more context (two words out, instead of one)
- Look at the output and try to create rules that look for more specific situations where this error occurs