After the Deadline Bigram Corpus – Our Gift to You…
The zesty sauce of After the Deadline is our language model. We use our language model to improve our spelling corrector, filter ill-fitting grammar checker suggestions, and even detect if you used the wrong word.
It’s not hard to build a language model, but it can be time-consuming. Our binary model files have always been available through our GPL After the Deadline distribution.
Today, as our gift to you, we’re releasing ASCII dumps of our language models under a creative commons attribution license. There is no over-priced consortium to join and you don’t need a university affiliation to get these files.
Here are the files:
This file contains each word token, a tab separator, and a count. There are 164,834 words from 76,932,676 occurrences. Our spell checker dictionary is made of words that occur two or more times in this list.
beneficently 4 Yolande 12 Fillmore's 4 kantar 2 Kayibanda 3 Solyman 2 discourses 92 Yolanda 11 discourser 1
This file is a dump of each two-word sequence that occurs in our corpus. It has 5,612,483 word pairs associated with a count. You can use this information to calculate the probability of a word given its next or previous words.
military annexation 4 military deceptions 1 military language 1 military legislation 1 military sophistication 1 military officer 61 military riot 1 military conspiracy 1 military retirement 2
This file has a limited set of trigrams (sequences of three words). Each trigram begins or ends with a word in our confusion set text file. You will need the information from the bigram corpus to construct trigram probabilities for these words.
a puppy given 1 a puppy for 4 a puppy dies 1 a puppy and 4 a puppy named 2 a puppy is 3 a puppy of 3 a puppy with 1 a puppy when 2
This work is licensed under a Creative Commons Attribution 3.0 Unported License.