After the Deadline

After the Deadline Bigram Corpus – Our Gift to You…

Posted in News, NLP Research by rsmudge on July 20, 2010

The zesty sauce of After the Deadline is our language model. We use our language model to improve our spelling corrector, filter ill-fitting grammar checker suggestions, and even detect if you used the wrong word.

It’s not hard to build a language model, but it can be time-consuming. Our binary model files have always been available through our GPL After the Deadline distribution.

Today, as our gift to you, we’re releasing ASCII dumps of our language models under a creative commons attribution license. There is no over-priced consortium to join and you don’t need a university affiliation to get these files.

Here are the files:

unigrams.txt.gz

This file contains each word token, a tab separator, and a count. There are 164,834 words from 76,932,676 occurrences. Our spell checker dictionary is made of words that occur two or more times in this list.

beneficently    4
Yolande 12
Fillmore's      4
kantar  2
Kayibanda       3
Solyman 2
discourses      92
Yolanda 11
discourser      1

bigrams.txt.gz

This file is a dump of each two-word sequence that occurs in our corpus. It has 5,612,483 word pairs associated with a count. You can use this information to calculate the probability of a word given its next or previous words.

military annexation     4
military deceptions     1
military language       1
military legislation    1
military sophistication 1
military officer        61
military riot   1
military conspiracy     1
military retirement     2

trigrams-homophones.txt.gz

This file has a limited set of trigrams (sequences of three words). Each trigram begins or ends with a word in our confusion set text file. You will need the information from the bigram corpus to construct trigram probabilities for these words.


a puppy given   1
a puppy for     4
a puppy dies    1
a puppy and     4
a puppy named   2
a puppy is      3
a puppy of      3
a puppy with    1
a puppy when    2

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.

5 Responses

Subscribe to comments with RSS.

  1. Dick Jenkin said, on July 21, 2010 at 12:00 am

    There’s only one trouble with doing your sort of work – you cannot afford to make any spelling or grammar mistakes yourselves! Have a look at the “it’s” in the last line of this paragraph!

    bigrams.txt.gz
    This file is a dump of each two-word sequence that occurs in our corpus. It has 5,612,483 word pairs associated with a count. You can use this information to calculate the probability of a word given it’s next or previous words.

    HOWEVER – I see that it has been fixed already (apostrophe removed)!! Full credit for that!
    Thanks – Dick Jenkin.

    • rsmudge said, on July 21, 2010 at 3:33 am

      Thanks for the comment. It’s always good to read something constructive from a member of the community. Yes, I should not make mistakes. I’m trying to make software that will help writers break their bad habits. One of those bad habits, unfortunately, is forgetting to run the tool.

  2. tszming said, on July 26, 2010 at 9:39 am

    How do you generate the “confusion set” and keep it updated?

    To me, this seems to be not a big set, e.g. bar, bra not appear in the file.

    • rsmudge said, on July 27, 2010 at 12:08 am

      I maintain the set by hand. There are other sets (for example, http://www.dcs.bbk.ac.uk/~jenny/resources.html) that contain more entries. More isn’t always better though. More entries (especially arbitrary ones that haven’t been tested) may lead to more false positives.

      • tszming said, on July 27, 2010 at 1:41 am

        I agree more isn’t better, so currently I am doing research to mine those set from Wikipedia’ edit history.


Comments are closed.

Follow

Get every new post delivered to your Inbox.

Join 289 other followers

%d bloggers like this: