WordPress Plugin and Front-End Component Updates
We’ve accomplished a lot lately, so I have some updates to share with you. The AtD/WordPress plugin, jQuery plugin, and TinyMCE plugin have all seen updates. Here is a list of what you get to look forward to:
jQuery API Updates
The AtD/jQuery API is the big winner in terms of fixes. This updated library builds on the AtD Core UI Module. The Core UI Module allows the jQuery API and TinyMCE plugin to share a lot of code. This means a bug fix in one is a fix in another.
The jQuery API includes a new jQuery-like syntax for attaching to a textarea. This is the technique powering the AtD Bookmarklet released last week. Do you want to add AtD to a webpage? Here is the code that does it:
<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.3/jquery.min.js"></script>
<script src="http://static.afterthedeadline.com/atd-jquery/scripts/jquery.atd.textarea.js?ver=011210"></script>
<script src="http://static.afterthedeadline.com/atd-jquery/scripts/csshttprequest.js"></script>
<link rel="stylesheet" type="text/css" media="screen" href="http://static.afterthedeadline.com/atd-jquery/css/atd.css" />
<script>
jQuery(function() {
$('textarea').addProofreader();
});
</script>
That’s not all. The jQuery Textarea API takes advantage of the new contentEditable HTML 5 feature in non-IE browsers. If you’re using a new browser you can change your content from the proofreading view.
TinyMCE API Updates
The AtD/TinyMCE module now takes advantage of the Core UI Module.
WordPress Plugin Updates
The WordPress plugin user interface is now ready for localization. If you’d like to contribute, please read the call for volunteers. The Visual Editor and HTML Editor now share a lot of code (and fixes) thanks to the Core UI Module. If you’re using AtD on your WordPress blog, I highly recommend this update.
Towards a More Usable AtD with Content Editable
Working on the web is quite exciting. Standards are evolving and the ideal ways of doing things keep coming closer to reality. Proofreading text areas with AtD isn’t as natural as I’d like. Clicking a button activates a proofreading mode. This mode places the contents of the text area into a DIV and inserts the After the Deadline markup with the text. Under this model, you click the highlighted errors to select a suggestion from a context menu. Once you’re done proofreading your text area is restored with the updated contents.
This model works OK. GMail uses it. Meetup.com uses it. I believe I’ve seen it in many other places.
Last night I decided to play with the new contentEditable attribute. This new(ish) attribute allows a developer to flag an HTML element as editable. This means the user can interact with the contents of the element in place. They can type text, move the cursor, and anything else they would do with a text area. It’s exciting stuff. Up until now web apps have achieved in-place editing by creating an iframe displaying a blank page with the design mode attribute set to true. This solution is a bit heavy.
I’m looking at using contentEditable with After the Deadline. I’m almost surprised we haven’t seen it in more places. It’s a harmless attribute to add to an application. If it works, users can interact with the div as if it’s the text area until the text area is restored. If it doesn’t work, then the user is stuck interacting with the div using the conceptual edit and proofread modes.
The only challenge is dealing with newlines correctly. When you press enter, the browser creates a line-break or a new paragraph (this is unspecified). When emulating a text area this is undesirable. So I either get to make the proofread mode swallow enter key presses or emulate text editing behavior by detecting which browser is in use and taking browser specific action to insert a newline at the cursor. I shouldn’t have to do this as my div’s set the CSS attribute white-space to pre-wrap. Under this mode a newline in a div will produce a line-break. It’d be nice if the contentEditable mode honored this.
Despite this small hang up, contentEditable is an exciting change and we look forward to bringing it to a proofreading plugin near you.
Spell and Grammar Check Bookmarklet
Today I’d like to present a new toy for you–the AtD Bookmarklet. With it, you can click “Add Proofreader” from your bookmark bar and an AtD button will magically appear above every text area on the current page. You now have the ability to check spelling, grammar, and misused words from your browser.
You can get it here. What can you do with this new bookmarklet?
1. Look smart when posting to Hacker News
2. Check your tweets before they go out
3. Avoid an embarrassing mistake on your LinkedIn profile
4. Spell check your comments–on any blog!
This proofreading technology uses an open source back-end. We also have libraries for jQuery, TinyMCE, and CKEditor to make it easy to embed After the Deadline into your application.
Chrome Update – I’ve received several reports (and have verified) that this bookmarklet does not work in Google Chrome. The AtD libraries work with Chrome and everything is happy when requests are to and from the same host. When I find a fix, I’ll post something here.
Chrome Update 2 – After investigating with Google Chrome, I believe this bookmarklet communicates in a way that conflicts with Chrome’s pop-up blocker or browser security policy.
Add Grammar and Spell Check to Any WYSIWYG Editor
After the Deadline front-end libraries are available for jQuery and TinyMCE. You’ve asked what it would take to make AtD available in another WYSIWYG environment or even in a web-based word processor. In the past I’d answer that you’d need to study the code to either the jQuery or TinyMCE extensions, visit the mountains, meditate and wait for the answer to come.
Today I’m bringing the mountain to you. Much of the code that makes After the Deadline work in TinyMCE and jQuery is similar. Painfully similar. To the point where a bug in one is a bug in the other. I’ve refactored these extensions and created an After the Deadline Core UI module. This module is browser independent with no external dependencies. It provides functionality to parse the AtD XML into an error data structure, retrieve suggestions and other information given an error phrase and a word of context, and it also abstracts away the logic to traverse a DOM and insert the AtD markup for errors.
To test this module, I gave myself one working day to port After the Deadline to CKEditor. CKEditor is a WYSIWYG editor, similar to TinyMCE. I have never used CKEditor before and this was a challenge. Still, I was able to make it quite far and in this post, I’ll show you how to add AtD to an editor using the Core UI Module.
Here is what an After the Deadline editor plugin must do:
- The first step is to setup the AtD Core UI module:
atd_core = new AtDCore();
- The next step is to define several functions that the AtD Core UI module expects. The module will use these functions to manipulate the DOM and find elements in the way that your environment expects. The full list of functions is documented in the AtD Core UI README. Here are a few of these functions from CKEditor:
atd_core.replaceWith = function(old_node, new_node) { return new_node.replace(CKEDITOR.dom.element.get(old_node)); }; atd_core.create = function(node_html) { return CKEDITOR.dom.element.createFromHtml( '<span class="mceItemHidden">' + node_html + '</span>' ); }; - Once these functions are defined you can set AtD-specific preferences like the list of strings to ignore and which types of errors to show.
atd_core.showTypes('Complex Expression, Diacritical Marks, Double Negatives, Redundant Expression'); atd_core.setIgnoreStrings('CKEditor, was thrown'); - When a user requests proofreading, it is up to you to extract the contents of the editor and post it to the After the Deadline service. Here is how I extract the editor contents in CKEditor:
var editor_contents = editor.document.getBody().getHtml();
- You should receive an XML document from the server with an error message or a data structure containing the AtD data. To check the XML document for an AtD error:
function ajax_callback(xml_response) { if (atd_core.hasErrorMessage(xml_response)) { alert(atd_core.getErrorMessage(xml_response)); return; } - If there are no errors then you’ll want to parse the XML into a data structure the Shared UI code can use. Use the processXML function to do this. It will return a JavaScript object that you will be using again.
var results = atd_core.processXML(xml_response);
- Now, let’s say you want to highlight errors in your editor. Great! Extract the contents of the editor (should be an array of elements from the root element) and pass these to markMyWords. Using the prototypes you provided earlier, this function will walk through these nodes and highlight the errors. Be thankful that you didn’t have to write the code to do this.
var nodes = editor.document.getDocumentElement().getChildren().getItem(1)['$'].childNodes; atd_core.markMyWords(nodes, results.errors);
- Earlier, I hope you attached a click or context menu listener to your editor. If you did you can use isMarkedElement on the event target when a click occurs in your editor. If this returns true, this is your clue that a user clicked on a marked error and you should display a menu offering them suggestions.To get the suggestions, call findSuggestion using the marked element. This will return a JavaScript object with the following members that may interest you: suggestions, description, moreinfo
editor.contextMenu.addListener(function(element) { if (atd_core.isMarkedNode(element.$)) { var meta = atd_core.findSuggestion(element.$); var commands = {}; addItem(editor, meta.description, function() { }, 0, commands, 'AtD_description'); for (var x = 0; x < meta.suggestions.length; x++) addItem(editor, meta.suggestions[x], makeCallback(element.$, meta.suggestions[x]), x + 1, commands, 'AtD_suggestions'); addItem(editor, 'Ignore', makeIgnoreCallback(element.$, element.$.innerHTML), 1, commands, 'AtD_ignore'); addItem(editor, 'Ignore All', makeIgnoreAllCallback(element.$, element.$.innerHTML), 0, commands, 'AtD_ignore'); return commands; } }); - For each suggestion, I use makeCallback to generate a function to attach to the menu item. This generated function calls applySuggestion in the Core UI Module. You should use applySuggestion as it’s smart enough to do what the suggestion asks. For example, the (omit) suggestion removes the word in question:
var makeCallback = function(element, suggestion) { return function() { atd_core.applySuggestion(element, suggestion); }; }; - You may want to add functionality to let the user ignore the current error or to ignore all occurrences of it. The “Ignore suggestion” menu item removes the marked node keeping its children. No need to call into the AtD Core UI module:
var makeIgnoreCallback = function(element, word) { return function() { CKEDITOR.dom.element.get(element).remove(true); };To ignore all occurrences of an error, you can use the removeWords function in the Core UI Module.
var makeIgnoreAllCallback = function(element, word) { return function() { atd_core.removeWords(undefined, word); }; }; - The last thing you should do is attach a listener to remove the After the Deadline markup when the contents of the editor are grabbed. You can remove the AtD markup using the removeWords function.
And that’s it. Add in the necessary scaffolding for your editor and you have After the Deadline integration. Here are some other things you can do:
- Make an Ignore Always menu option save the user’s preference in a cookie or on your server. Resurrect this setting later with the setIgnoreStrings function.
- Keep track of the user’s preference for what types of errors to show. Use showTypes to enable these for the user.
These are the kinds of things we do in WordPress to make AtD a first-class feature for our users. If you’re looking to port After the Deadline to another environment, the AtD Core UI module will save you a lot of time.
Coming up I’ll be releasing versions of the AtD/jQuery and AtD/TinyMCE Extensions using this shared module. The AtD/CKEditor extension is available now.
WordCamp NYC Ignite: After the Deadline
I see that the After the Deadline demonstration for WordCamp NYC has been posted. This short five-minute demonstration covers the plugin and its features.
Before you watch this video, can you find the error in each of these text snippets?
There is a part of me that believes that if I think about these issues, if I put myself through the emotional ringer, I somehow develop an immunity for my own family. Does writing a book about bullying protect your children from being bullied? No. I realize that this kind of thinking is completely ridiculous.’’
[Op-Ed] … Roberts marshaled a crusader’s zeal in his efforts to role back the civil rights gains of the 1960s and ’70s — everything from voting rights to women’s rights.
The success of Hong Kong residents in halting the internal security legislation in 2004, however, had an indirect affect on allowing the vigil here to grow to the huge size it was this year.
These examples come from the After Deadline blog, When Spell-Check Can’t Help. You can watch the video to learn how After the Deadline can help and what the errors are. You can also try these out at http://www.polishmywriting.com.
You can also view the WCNYC session on how embed After the Deadline into an application.
George Orwell and After the Deadline
Ok, I have to admit something. George Orwell does not use After the Deadline. But, if he were alive now, I bet he would.
In his essay, Politics and the English Language, George Orwell defines the following rules for clear writing:
- Never use a metaphor, simile, or other figure of speech which you are used to seeing in print.
- Never use a long word where a short one will do.
- If it is possible to cut a word out, always cut it out.
- Never use the passive where you can use the active.
- Never use a foreign phrase, a scientific word, or a jargon word if you can think of an everyday English equivalent.
- Break any of these rules sooner than say anything outright barbarous.
Did you know After the Deadline can help you with these rules? Here is how:
Rule 1: Avoid clichés
You should avoid clichés in your writing. After the Deadline flags over 650 worn out phrases. These phrases lose their power because we’re so used to seeing them.
Rule 2: Use Simple Words
After the Deadline helps you replace complex expressions with simple everyday words. Examples include use instead of utilize, set up over establish, and equal over equivalent.
Rule 3: Avoid Redundant Expressions
A common poor writing habit is using phrases with extra words that add nothing to the meaning. After the Deadline flags these so you can remove them. Examples include destroy over totally destroy, now instead of right now, and written over written down.
Rule 4: Avoid Passive Voice
Like a good copy editor, After the Deadline uses its virtual pen to find passive voice and bring it to your attention. It’s up to you if you want to revise it or not. In most cases you will make your writing much clearer.
Rule 5: Avoid Jargon
This is a hard one as each field has its own jargon. After the Deadline flags some foreign phrases and jargon words. It’s up to you to try to find the right words depending on your audience.
Rule 6: Remember, rules are meant to be broken
Rules are great but they do not cover every situation. To help, After the Deadline uses a statistical language model to filter poor suggestions.
This is a repost from the old-AtD blog. If this topic interests you visit http://www.afterthedeadline.com where you can download After the Deadline for WordPress or learn how to add it to an application.
WordPress Plugin Update
The AtD/WP.org plugin experienced some reworking this week. This release smooths the install process, adds a new feature, and fixes several bugs. Here are the highlights:
Auto-Proofread on Publish and Update
Many of you have told me “I love AtD but I keep forgetting to run it before I post”. Well, never fear. Mohammad Jangda and I have worked together to bring a new toy to you. AtD now has an auto-proofread on publish and update feature. You can enable it from your user profile page.
When enabled, this feature will run AtD against your post (or page) before a publish or update. If any errors are found, you’ll be prompted with a dialog:
It is then up to you. If you want to publish, click OK. Otherwise click Cancel to interact with the errors and make your changes. The next time you hit Publish your post will go through.
Define a Global AtD Key for WPMU Users
If you’re using WPMU and would like a way to set a global AtD key, we’ve got you covered. AtD now looks for an ATD_KEY constant before prompting for a key. If this constant exists, the ask for a key page goes away. You can also define ATD_SERVER and ATD_PORT if you’re running your own AtD server from our open source distribution. You can set these constants in wp-config.php.
Smoother Installation Process
For most of you, installing After the Deadline is a snap. For some of you, it doesn’t work out. It seems there are two issues that pop up and this update addresses them.
The first snag is many folks try to use their WordPress.com API key instead of their After the Deadline API key. Fortunately these have different forms and are easy to tell apart. The plugin now detects when you entered something other than an After the Deadline API key and gently notifies the user that an After the Deadline API key is different.
The second snag has to do with security. Many system administrators lock down a PHP installation by disabling functions that PHP scripts use to connect to other hosts on the internet. AtD connects to a service to do its job. AtD now detects this security measure and tells the user to contact their system administrator (along with what needs to be fixed).
This should help many of you out. As always, enjoy the update.
As a side note: the AtD/jQuery and AtD/TinyMCE extensions were both updated as well. These are minor fixes but you should get them if you’re using them in your app.
After the Deadline @ Washington, DC PHP Meeting
Last night I had the priviledge to present After the Deadline to the Washington DC PHP Meeting. Definitely one of the best audiences I’ve experienced. Thanks guys.
In this talk I demoed After the Deadline, talked about some of the NLP and AI technology under the hood, and showed how to embed AtD into an application using jQuery and TinyMCE.
Shaun Farrell was technically savvy enough to record it (I tried but my attempt failed). You can see the video here:
And the slides from the presentation are here:
If you’re looking at all this and thinking: “wow, this After the Deadline stuff is fun. I want to attend an After the Deadline live seminar in my area” then you’ve come to the right place. I’m demoing After the Deadline tonight at the Washington DC Technology Meetup in Ellicot City, MD and next week I’m giving a similar talk at the Baltimore PHP Meetup.
I’m tracking AtD related events on the Events page of this blog. If you’d like a speaker for your event, I’m glad to take this show on the road in the mid-atlantic region. Feel free to contact me raffi at automattic dot com.
Thoughts on a tiny contextual spell checker
Spell checkers have a bad rap because they give poor suggestions, don’t catch real word errors, and usually have out of date dictionaries. With After the Deadline I’ve made progress on these three problems. The poor suggestions problem is solved by looking at context as AtD’s contextual spell checker does. AtD again uses context to help detect real word errors. It’s not flawless but it’s not bad either. AtD has also made progress on the dictionary front by querying multiple data sources (e.g., Wikipedia) to find missing words.
Problem Statement
So despite this greatness, contextual spell checking isn’t very common. I believe this is because contextual spell checking requires a language model. Language models keep track of every sequence of two words seen in a large corpus of text. From this data the spell checker can calculate P(currentWord|previousWord) and P(currentWord|nextWord). For a client side application, this information amounts to a lot of memory or disk space.
Is it possible to deliver the benefits of a contextual spell checker in a smaller package?
Why would someone want to do this? If this could be done, then it’d be possible to embed the tiny contextual spell checker into programs like Firefox, OpenOffice, and others. Spell check as you type would be easy and responsive as the client could download the library and execute everything client side.
Proposed Solution
I believe it’s possible to reduce the accuracy of the language model without greatly impacting its benefits. Context makes a difference when spell checking (because it’s extra information), but I think the mere idea that “this word occurs in this context a lot more than this other one” is enough information to help the spell checker. Usually the spell checker is making a choice between 3-6 words anyways.
One way to store low fidelity language model information is to associate each word with some number of bloom filters. Each bloom filter would represent a band of probabilities. For example a word could have three bloom filters associated with it to keep track of words occurring in the top-25%, middle-50%, and bottom-25%. This means the data size for the spell checker will be N*|dictionary| but this is better than having a language model that trends towards a size of |dictionary|^2.
A bloom filter is a data structure for tracking whether something belongs to a set or not. They’re very small and the trade-off is they may give false positives but they won’t give false negatives. It’s also easy to calculate the false positive rate in a bloom filter given the number of set entries expected, the bloom filter size, and the number of hash functions used. To optimize for space, the size of the bloom filter for each band and word could be determined from the language model.
If this technique works for spelling, could it also work for misused word detection? Imagine tracking trigrams (sequences of three words) for each potentially misused word using a bloom filter.
After looking further into this, it looks like others have attacked the problem of using bloom filters to represent a language model. This makes the approach even more interesting now.
Generating a Plain Text Corpus from Wikipedia
AtD *thrives* on data and one of the best places for a variety of data is Wikipedia. This post describes how to generate a plain text corpus from a complete Wikipedia dump. This process is a modification of Extracting Text from Wikipedia by Evan Jones.
Evan’s post shows how to extract the top articles from the English Wikipedia and make a plain text file. Here I’ll show how to extract all articles from a Wikipedia dump with two helpful constraints. Each step should:
- finish before I’m old enough to collect social security
- tolerate errors and run to completion without my intervention
Today, we’re going to do the French Wikipedia. I’m working on multi-lingual AtD and French seems like a fun language to go with. Our systems guy, Stephane speaks French. That’s as good of a reason as any.
Step 1: Download the Wikipedia Extractors Toolkit
Evan made available a bunch of code for extracting plaintext from Wikipedia. To meet the two goals above I made some modifications*. So the first thing you’ll want to do is download this toolkit and extract it somewhere:
wget http://www.polishmywriting.com/download/wikipedia2text_rsm_mods.tgz tar zxvf wikipedia2text_rsm_mods.tgz cd wikipedia2text
(* see the CHANGES file to learn what modifications were made)
Step 2: Download and Extract the Wikipedia Data Dump
You can do this from http://download.wikimedia.org/. The archive you’ll want for any language is *-pages-articles.xml.bz2. Here is what I did:
wget http://download.wikimedia.org/frwiki/20091129/frwiki-20091129-pages-articles.xml.bz2 bunzip2 frwiki-20091129-pages-articles.xml.bz2
Step 3: Extract Article Data from the Wikipedia Data
Now you have a big XML file full of all the Wikipedia articles. Congratulations. The next step is to extract the articles and strip all the other stuff.
Create a directory for your output and run xmldump2files.py against the .XML file you obtained in the last step:
mkdir out ./xmldump2files.py frwiki-20091129-pages-articles.xml out
This step will take a few hours depending on your hardware.
Step 4: Parse the Article Wiki Markup into XML
The next step is to take the extracted articles and parse the Wikimedia markup into an XML form that we can later recover the plain text from.There is a shell script to generate XML files for all the files in our out directory. If you have a multi-core machine, I don’t recommend running it. I prefer using a shell script for each core that executes the Wikimedia to XML command on part of the file set (aka poor man’s concurrent programming).
To generate these shell scripts:
find out -type f | grep '\.txt$' >fr.files
To split this fr.files into several .sh files.
java -jar sleep.jar into8.sl fr.files
You may find it helpful to create a launch.sh file to launch the shell scripts created by into8.sl.
cat >launch.sh ./files0.sh & ./files1.sh & ./files2.sh & ... ./files15.sh & ^D
Next, launch these shell scripts.
./launch.sh
Unfortunately this journey is filled with peril. The command run by these scripts for each file has the following comment: Converts Wikipedia articles in wiki format into an XML format. It might segfault or go into an “infinite” loop sometimes. This statement is true. The PHP processes will freeze or crash. My first time through this process I had to manually watching top and kill errant processes. This makes the process take longer than it should and it’s time-consuming. To help I’ve written a script that kills any php process that has run for more than two minutes. To launch it:
java -jar sleep.jar watchthem.sl
Just let this program run and it will do its job. Expect this step to take twelve or more hours depending on your hardware.
Step 5: Extract Plain Text from the Articles
Next we want to extract the article plaintext from the XML files. To do this:
./wikiextract.py out french_plaintext.txt
This command will create a file called french_plaintext.txt with the entire plain text content of the French Wikipedia. Expect this command to take a few hours depending on your hardware.
Step 6 (OPTIONAL): Split Plain Text into Multiple Files for Easier Processing
If you plan to use this data in AtD, you may want to split it up into several files so AtD can parse through it in pieces. I’ve included a script to do this:
mkdir corpus java -jar sleep.jar makecorpus.sl french_plaintext.txt corpus
And that’s it. You now have a language corpus extracted from Wikipedia.







5 comments