Generating a Plain Text Corpus from Wikipedia
AtD *thrives* on data and one of the best places for a variety of data is Wikipedia. This post describes how to generate a plain text corpus from a complete Wikipedia dump. This process is a modification of Extracting Text from Wikipedia by Evan Jones.
Evan’s post shows how to extract the top articles from the English Wikipedia and make a plain text file. Here I’ll show how to extract all articles from a Wikipedia dump with two helpful constraints. Each step should:
- finish before I’m old enough to collect social security
- tolerate errors and run to completion without my intervention
Today, we’re going to do the French Wikipedia. I’m working on multi-lingual AtD and French seems like a fun language to go with. Our systems guy, Stephane speaks French. That’s as good of a reason as any.
Step 1: Download the Wikipedia Extractors Toolkit
Evan made available a bunch of code for extracting plaintext from Wikipedia. To meet the two goals above I made some modifications*. So the first thing you’ll want to do is download this toolkit and extract it somewhere:
wget http://www.polishmywriting.com/download/wikipedia2text_rsm_mods.tgz tar zxvf wikipedia2text_rsm_mods.tgz cd wikipedia2text
(* see the CHANGES file to learn what modifications were made)
Step 2: Download and Extract the Wikipedia Data Dump
You can do this from http://download.wikimedia.org/. The archive you’ll want for any language is *-pages-articles.xml.bz2. Here is what I did:
wget http://download.wikimedia.org/frwiki/20091129/frwiki-20091129-pages-articles.xml.bz2 bunzip2 frwiki-20091129-pages-articles.xml.bz2
Step 3: Extract Article Data from the Wikipedia Data
Now you have a big XML file full of all the Wikipedia articles. Congratulations. The next step is to extract the articles and strip all the other stuff.
Create a directory for your output and run xmldump2files.py against the .XML file you obtained in the last step:
mkdir out ./xmldump2files.py frwiki-20091129-pages-articles.xml out
This step will take a few hours depending on your hardware.
Step 4: Parse the Article Wiki Markup into XML
The next step is to take the extracted articles and parse the Wikimedia markup into an XML form that we can later recover the plain text from.There is a shell script to generate XML files for all the files in our out directory. If you have a multi-core machine, I don’t recommend running it. I prefer using a shell script for each core that executes the Wikimedia to XML command on part of the file set (aka poor man’s concurrent programming).
To generate these shell scripts:
find out -type f | grep '\.txt$' >fr.files
To split this fr.files into several .sh files.
java -jar sleep.jar into8.sl fr.files
You may find it helpful to create a launch.sh file to launch the shell scripts created by into8.sl.
cat >launch.sh ./files0.sh & ./files1.sh & ./files2.sh & ... ./files15.sh & ^D
Next, launch these shell scripts.
Unfortunately this journey is filled with peril. The command run by these scripts for each file has the following comment: Converts Wikipedia articles in wiki format into an XML format. It might segfault or go into an “infinite” loop sometimes. This statement is true. The PHP processes will freeze or crash. My first time through this process I had to manually watching top and kill errant processes. This makes the process take longer than it should and it’s time-consuming. To help I’ve written a script that kills any php process that has run for more than two minutes. To launch it:
java -jar sleep.jar watchthem.sl
Just let this program run and it will do its job. Expect this step to take twelve or more hours depending on your hardware.
Step 5: Extract Plain Text from the Articles
Next we want to extract the article plaintext from the XML files. To do this:
./wikiextract.py out french_plaintext.txt
This command will create a file called french_plaintext.txt with the entire plain text content of the French Wikipedia. Expect this command to take a few hours depending on your hardware.
Step 6 (OPTIONAL): Split Plain Text into Multiple Files for Easier Processing
If you plan to use this data in AtD, you may want to split it up into several files so AtD can parse through it in pieces. I’ve included a script to do this:
mkdir corpus java -jar sleep.jar makecorpus.sl french_plaintext.txt corpus
And that’s it. You now have a language corpus extracted from Wikipedia.