Progress on the Multi-Lingual Front

December 3, 2009

I’m making progress on multi-lingual AtD. I’ve integrated LanguageTool into AtD. LanguageTool is a language checking tool with support for 18 languages. Creating grammar rules is a human intensive process and I’d prefer to go with an established project with a successful community process.

I’m also working on creating corpus data from Wikipedia. I have a pipeline of four steps. The longest step for each language takes 12+ hours to run and ties up my entire development server. So I’m limited to generating data for one language each night.

With this corpus data I have the ability to provide contextual spell checking for that language and crude statistical filtering for the LanguageTool results (assuming LT supports that language).

Here are some stats to motivate this:

66% of the blogs on are English. This limits the utility of AtD to 66% of our userbase. By supporting the next six languages with AtD, we can provide proofreading tools to nearly 90% of the community. That’s pretty exciting.

Right now this work is in the proof of concept stage. I expect to have a French AtD (spell checking + LanguageTool grammar checking) soon. I’ll have some folks try it and tell me what their experience is. If you want to volunteer to try this out, contact me.

