Generating a Plain Text Corpus from Wikipedia

Posted in Multi-Lingual AtD, NLP Research by rsmudge on December 4, 2009

AtD *thrives* on data and one of the best places for a variety of data is Wikipedia. This post describes how to generate a plain text corpus from a complete Wikipedia dump. This process is a modification of Extracting Text from Wikipedia by Evan Jones.

Evan’s post shows how to extract the top articles from the English Wikipedia and make a plain text file. Here I’ll show how to extract all articles from a Wikipedia dump with two helpful constraints. Each step should:

finish before I’m old enough to collect social security
tolerate errors and run to completion without my intervention

Today, we’re going to do the French Wikipedia. I’m working on multi-lingual AtD and French seems like a fun language to go with. Our systems guy, Stephane speaks French. That’s as good of a reason as any.

Step 1: Download the Wikipedia Extractors Toolkit

Evan made available a bunch of code for extracting plaintext from Wikipedia. To meet the two goals above I made some modifications*. So the first thing you’ll want to do is download this toolkit and extract it somewhere:

wget http://www.polishmywriting.com/download/wikipedia2text_rsm_mods.tgz
tar zxvf wikipedia2text_rsm_mods.tgz
cd wikipedia2text

(* see the CHANGES file to learn what modifications were made)

Step 2: Download and Extract the Wikipedia Data Dump

You can do this from http://download.wikimedia.org/. The archive you’ll want for any language is *-pages-articles.xml.bz2. Here is what I did:

wget http://download.wikimedia.org/frwiki/20091129/frwiki-20091129-pages-articles.xml.bz2
bunzip2 frwiki-20091129-pages-articles.xml.bz2

Step 3: Extract Article Data from the Wikipedia Data

Now you have a big XML file full of all the Wikipedia articles. Congratulations. The next step is to extract the articles and strip all the other stuff.

Create a directory for your output and run xmldump2files.py against the .XML file you obtained in the last step:

mkdir out
./xmldump2files.py frwiki-20091129-pages-articles.xml out

This step will take a few hours depending on your hardware.

Step 4: Parse the Article Wiki Markup into XML

The next step is to take the extracted articles and parse the Wikimedia markup into an XML form that we can later recover the plain text from.There is a shell script to generate XML files for all the files in our out directory. If you have a multi-core machine, I don’t recommend running it. I prefer using a shell script for each core that executes the Wikimedia to XML command on part of the file set (aka poor man’s concurrent programming).

To generate these shell scripts:

find out -type f | grep '\.txt$' >fr.files

To split this fr.files into several .sh files.

java -jar sleep.jar into8.sl fr.files

You may find it helpful to create a launch.sh file to launch the shell scripts created by into8.sl.

cat >launch.sh
./files0.sh &
./files1.sh &
./files2.sh &
...
./files15.sh &
^D

Next, launch these shell scripts.

./launch.sh

Unfortunately this journey is filled with peril. The command run by these scripts for each file has the following comment: Converts Wikipedia articles in wiki format into an XML format. It might segfault or go into an “infinite” loop sometimes. This statement is true. The PHP processes will freeze or crash. My first time through this process I had to manually watching top and kill errant processes. This makes the process take longer than it should and it’s time-consuming. To help I’ve written a script that kills any php process that has run for more than two minutes. To launch it:

java -jar sleep.jar watchthem.sl

Just let this program run and it will do its job. Expect this step to take twelve or more hours depending on your hardware.

Step 5: Extract Plain Text from the Articles

Next we want to extract the article plaintext from the XML files. To do this:

./wikiextract.py out french_plaintext.txt

This command will create a file called french_plaintext.txt with the entire plain text content of the French Wikipedia. Expect this command to take a few hours depending on your hardware.

Step 6 (OPTIONAL): Split Plain Text into Multiple Files for Easier Processing

If you plan to use this data in AtD, you may want to split it up into several files so AtD can parse through it in pieces. I’ve included a script to do this:

mkdir corpus
java -jar sleep.jar makecorpus.sl french_plaintext.txt corpus

And that’s it. You now have a language corpus extracted from Wikipedia.

13 comments

13 Responses

Subscribe to comments with RSS.

David R. MacIver said, on December 7, 2009 at 10:57 am

Thanks, I ended up using the first part of this to build up a sentence corpus from the english wikipedia (http://www.drmaciver.com/2009/12/i-want-one-meelyun-sentences/).

I ended up doing something different for stages 4+. One thing you might find useful is xargs for parallelism: If you write small scripts which take the file names to process as input arguments then you can run as many of them in parallel as you want. e.g. I ran my scripts as

find out/ -type f | xargs -P4 -L2 ./extract.rb

Which feeds at most two files to each ruby script (-L2) but runs four of them in parallel at once (-P4). It worked well and is easy to tweak.

(extract.rb was essentially a poor man’s markdown text extractor. It didn’t do a perfect job, but it ran relatively fast and had the advantage of not falling into infinite loops or seg faulting)
- rsmudge said, on December 8, 2009 at 1:38 pm
  
  Thanks for sharing this. Next time I’m looking to simplify this process I’ll take a look at xargs.
N-Gram Language Guessing with NGramJ « After the Deadline said, on February 8, 2010 at 11:14 pm

[…] to generate one yourself. A great source of language data is Wikipedia. I’ve written about extracting plain-text from Wikipedia here before. Today, I needed to generate a profile for Indonesian. The first step is to create a […]
All About Language Models « After the Deadline said, on March 4, 2010 at 11:11 pm

[…] with most natural language processing tasks is getting data and collapsing it into a usable model. Prepping a large data set is hard enough. Once you’ve prepped it, you have to put it into a language model. My old NLP […]
Kevin said, on March 11, 2010 at 9:37 am

Wow.

I just used this short shell script. The only thing you might want to remove afterwards is the “Category: foo” lines, but they’re each on their own line and thus vgreppable.

It would be nice if there were an actual, robust wikipedia parser though.
- rsmudge said, on March 11, 2010 at 1:36 pm
  
  That script looks pretty cool. I’ll have to play with it later. The original article I cited was really an eye opener for me to get at the WP data. I’ve since created LMs out of 10 WP collections and am now working on generating one for the big English WP.
Dominique said, on March 23, 2010 at 8:10 pm

The Kevin’s script works fine with one small update in order to accept other charsets than iso-8859-1.

Thank to both of you
- Aengus said, on March 25, 2010 at 1:41 am
  
  Could you tell me what this “one small update” is?
yhj said, on July 28, 2010 at 7:34 am

thanks for your post. it is very useful for me. there is just a small issue on the last step. i think you have missed a sleep.jar after “-jar” : )
- rsmudge said, on July 28, 2010 at 2:29 pm
  
  And so I did. Thanks for pointing this out. I’ve corrected the post.
Aly said, on August 9, 2010 at 2:41 am

I failed with your script on step 3.
Any ideas what’s wrong? Thanks!

Traceback (most recent call last):
File “./xmldump2files.py”, line 93, in
xml.sax.parse(sys.argv[1], WikiPageSplitter(sys.argv[2]))
File “/tmp/python.6884/usr/lib/python2.5/xml/sax/__init__.py”, line 33, in parse
File “/tmp/python.6884/usr/lib/python2.5/xml/sax/expatreader.py”, line 107, in parse
File “/tmp/python.6884/usr/lib/python2.5/xml/sax/xmlreader.py”, line 123, in parse
File “/tmp/python.6884/usr/lib/python2.5/xml/sax/expatreader.py”, line 207, in feed
File “/tmp/python.6884/usr/lib/python2.5/xml/sax/expatreader.py”, line 304, in end_element
File “./xmldump2files.py”, line 79, in endElement
writeArticle(self.root, self.title, self.text)
File “./xmldump2files.py”, line 41, in writeArticle
out = open(filename, “w”)
IOError: [Errno 2] No such file or directory: ‘../enwiki-20100130-out/90/f9/Con.txt’
- rsmudge said, on August 10, 2010 at 2:38 am
  
  File not found eh… make sure you have enough disk space and enough free inodes on that disk to create all the files you will need. This script creates a lot of them. It’s been a long time since I’ve run this process, but I was able to pull it off for 10 wikipedias (including en). I have faith you’ll get it too.
  - Aly said, on August 10, 2010 at 9:54 pm
    
    Hmm, it worked fine on a different version of wikipedia (the latest one: 20100728). Not sure what happened. Thanks for your quick reply anyway.