Setting up Amazon AWS instance to crunch Wikipedia XML dump files

In this post, I’ll describe the steps to setup an Amazon AWS instance and get access to XML Wikipedia dump files.There is a way to download wikidumps for any project / language, the data is from early 2009. I will detail the steps as a note for future reference. The data is made available as part of Amazon AWS Public Datasets (

  1. Create an AWS account
  2. Log in to AWS and select the EC2 tab
  3. Click ‘Launch Instance’
  4. Select a Configuration, the Basic 64-bit Amazon Linux AMI 1.0 is fine
  5. Instance Type: select Micro (this is the cheapest) and press continue
  6. Instance Defaults: keep the defaults and press continue
  7. Enter key, value pairs, such as key=name, value=WikiMirror, this is
    not really required and press continue
  8. Create a new key pair and give it a name and press Create &
    Download your Key Pair (this is your private EC2 key and you need to
    store it somewhere safe).
  9. Create a security group, default settings are fine and press enter
    to continue.
  10. Review your settings and press launch

This will start an Amazon EC2 instance. The next step is to make the
Wikimedia public dataset accessible to our instance.

  1. Click EBS Volumes on the upper right of the window
  2. Select Create Volume from the different tabs
  3. Snapshot pulldown: scrolldown and search for Wikipedia XML Backups
    (XML) and enter 500Gb in the size input field. Make sure that the zone
    matches the zone of your primary volume and press create
  4. Click Attach Volume and enter /dev/sdf or similar

We know have made the Dataset available to our EC2 instance. Let’s get the data: (I’ll assume a Windows environment, I do not think this step is necessary for Linux / OSX users):

  1. We have a .PEM certificate but if we would like to use Putty on
    Windows then we need to convert the .PEM certificate to a certificate
    the Putty can use.
  2. Download puttygen from
  3. Start puttygen and select Load Private Key. Select the key that we
    downloaded at step 8 of creating an EC2 instance.
  4. Press Save private key and save the new .ppk key
  5. Close puttygen and start putty (you can download it from as
  6. Create a new session: the EC2 server name can be found by going to EC2 dashboard on the left, and then selecting Running Instance. At the bottom of the page you will find: Public DNS and a long name, copy this name and enter it in the Putty session.
  7. In Putty, click on SSH on the left and then Auth. There is a field
    saying use key for authentication and a browse button. Press the
    browse button and select the key from step 4.
  8. Start the session, enter as username ec2-user, the key will
    authorize you. We have logged on to our EC2 instance
  9. Enter on the command line: sudo mkdir /mnt/data-store
  10. Enter on the command line: sudo mount /dev/sdf /mnt/data-store
    (the sdf depends on what you entered at step 4 of creating the
  11. cd /mnt/data-store and enter ls -al and you will see all the files

Next step is either to copy a file using SCP or start your own FTP server on the EC2 instance and download the files that you need. You are ready to start crunching Wikipedia’s XML dump files!

Posted in wikipedia | Tagged | Leave a comment

Reminiscences of a young gamer and the comeback of the adventure genre in gaming

Back in 1984…

I was six years old when my parents bought their first PC, it was 1984. They bought an Olivetti with 2 5¼ disk drives, a CGA video adapter and a monitor with phosphor green screen. As soon as I heard them discussing buying this computer, I knew this was BIG. I had never seen a computer in my life before but I knew, I just knew, that it was very important to have a computer. At that time it was a huge investment, I think they spent about ƒ4000,- guilders which 28 years would have been about a staggering $22,277.59 (calculated using

The first things I did using this computer was playing computer games (Bouncing Babies, California Games) and writing small programs in GW-Basic. Well, writing programs is a slight exaggeration, I would type over the source code from magazines that published this verbatim. Usually, these listings contained errors and I had to ‘debug’ the program. At age 6, living in the Netherlands, and not speaking English, this is mission impossible.

Around age 10, we moved to a slightly larger town and I became friends with a guy (D) one year older then I was and whose parents had a original IBM PC (with a 20Mb hard drive!). He had a game I didn’t have yet: Leisure Suit Larry in the Land of the Lounge Lizards aka Larry I.

I copied the game and together we started playing it. At age 10, I had a rudimentary English vocabulary but playing Larry was extremely hard, often we were stuck just not knowing what we were supposed to do. Fortunately, Larry was often killed and that gave us clues on what to do. One of the first we got stuck was at the very beginning, we were in this store and we needed needed to buy condoms (for many years I thought the English word for condom was lubber, not realizing that this slang) and Larry needed to practise safe sex (else he would die). Progress was painstakingly slow, we could be stuck for weeks.

Meanwhile, I had met another guy (M) in the neighbourhood who was 4 years older and who had a lot of games, and so I copied them: Police Quest I, Space Quest I, Kings Quest I, Police Quest II, Space Quest II, Fruit of the Loom, Monkey Island I, Civilization I. Me and D would play, my English vocabulary expanded slowly and sometimes I would use the dictionary to look up words. Progress increased when our library started carrying complete walkthroughs, it felt like cheating but finishing the game was the Holy Grail.

Early nineties…

By 1992, our pc was about 8 years old and it was outdated: it didn’t have a hard drive, it didn’t have a colour monitor and it was slow. Most of my friends had a 286 by this time and I still had a 8086. So I started putting pressure on my parents to buy a new pc, without success. So I increasingly played games at my friends house and by this time games circulated freely between me and my friends. There was a lively culture of sharing because it allowed us to play the games at the same time and discuss the next at school the problems, obstacles and potential solutions.

My disposable income was ƒ0,-;€0,-;$0,-. And this was true for my friends as well.

Fast forward to 2012

The reason I wrote this blog post is because right now there seems to be a renaissance of the adventure game genre happening on Kickstarter. Quite a few Kickstarter projects are centered around developing an adventure game:

I have such a fond memories of playing these kind of games that I have supported all three of them. It’s time to payback.

Posted in gaming | Leave a comment

Twittering to End Dictatorship: Ensuring the Future of Web-based Social Movements

Update: this post was originally posted in June 2o09. With the current events in Syria, it is important that we keep pressuring Western corporations not to sell surveillance technologies to these regimes. I have not updated the original post so might be a bit outdated.

People around the globe have been moved by the bravery of the Iranian people in their demand for an honest election. They have woven together Ghandi’s non-violence, the religious activism of the 1979 Iranian Revolution, and Web 2.0 tools to organize, strengthen and coordinate massive popular demonstrations. In doing so, they have inspired and touched people around the globe.

Not everybody will like what they see. The Iranian people are sending a chilling signal to oppressive governments around the world. They have demonstrated that there is no way to censor the Internet: an electron is just too small for governments to block.

The End of Censorship

For centuries, governments have worked to isolate their territory from the influence of outsiders, blocking their citizen’s free access to information, opinions, and ideas from the outside world. The notion that a country can be fenced off, that information flow can be prevented, is increasingly obsolete. Social movements, like multinational corporations, are no longer contained by national boundaries. Activists ally themselves with peers around the world using Web tools, both first generation tools like email, and 2.0 tools like Facebook and Twitter. In Iran in June 2009, every person has become a broadcaster, uploading video, short messages, and photos to YouTube, Twitter, and Flickr. Web 2.0 allows us to be witnesses of resistance and suppression in real-time. It creates new allies and activists. This regime can slow the information flow within Iran and deport foreign reporters, but it has no tools to fight the armies of cell phone users arrayed in its streets. Information is still flowing in and out of the country.

This is very bad news for governments which depend on their control of mass media to restrict access to the information. Their days are now counted.

Unless, that is, such regimes begin to control the design, sale and dissemination of bandwidth, hardware and software. If they can control technology, instead of controlling the media and those information flows, they will be able to maintain control over their people.

This scenario is less unlikely than it seems.

The Role of Technology Firms:  More than Market Actors

The events in Iran make it starkly evident to anti-democratic rulers that their continued existence depends on restricting access to the Web. To gain control over the enormously distributed functions of the Internet, they will need to require that all corporations seeking access to their markets adapt the hardware of cell phones, hand-held computers, and PC’s in ways that will enable such regimes to continuously monitor their use. Companies that do not comply with such demands will not get access to lucrative markets.

The biggest demands for changes to soft- and hardware are already coming from China. For example, China is demanding the source code from hardware manufacturers as an “obligatory accreditation system for IT security products” and it is drafting technical standards to “define methods of tracing the original source of Internet communications and potentially curbing the ability of users to remain anonymous”.

These demands put companies in a difficult spot: any support of democratic ideals pits the interests of shareholders against the interests of their customers in states without functioning democracies. Seldom have corporations sided with their customers. For example, American telecommunication companies, including AT&T, collaborated with the American government to build a surveillance system that used wiretap domestic and international communications without a warrant. The program might still run. More recently, Western governments are proposing laws that would significantly increase the ability to track and trace their citizens. For example, Canada has proposed warrantless searches, Germany proposed Internet censorship, and Australia has had plans to block BitTorrent.

Big Brother Free

We owe it to the Iranian people to insist that our technologies remain neutral, that our corporations do not give in to the demands of oppressive regimes by embedding remote control and tracing capabilities. We need to demand that the technologies that we use are and remain Big Brother Free.

The most powerful weapon against totalitarianism is not the gun but the cell phone. We must ensure it  can be used.

This post was written by Alison Kemper and Diederik van Liere.

Posted in social movements, Twitter | Tagged , | Leave a comment

Using Hadoop to analyze the full Wikipedia dump files using WikiHadoop


Probably the largest free dataset available on the Internet is the full XML dump of the English Wikipedia. This dataset in it’s uncompressed form is about 5.5Tb and still growing. The sheer size of this dataset poses some serious challenges to analyze the data. In theory, Hadoop would be a great tool to analyze this dataset but it turns out that this is not necessarily the case.

Jimmy Lin wrote the Cloud9 a Hadoop InputReader that can handle the stub Wikipedia dump files (the stub dump files contain all variables as in the full dump file with the exception of the text of each revision). Unfortanutely, his InputReader does not support the full XML dump files.

The way that the XML dump files are organized is as follows: each dump file starts with some metadata tags and after that come the tags that contain the revisions. Hadoop has a StreamXmlRecordReader that allows you to grab an XML fragment and send it as input to a mapper. This poses two problems:

  • Some pages are so large (10’s of Gb’s) that you will run inevitable into out of memory errors.
  • Splitting by tag leads to serious information loss as you don’t know to which page a revision belongs.

Hence, Hadoop’s StreamXmlRecordReader is not suitable to analyze the full Wikipedia dump files.

During the last couple of weeks, the Wikimedia Foundation fellows of the Summer of Research have been working hard on tackling this problem. In particular a big thank you to Yusuke Matsubara, Shawn Walker, Aaron Halfaker and Fabian Kaelin. We have released a customized InputFormat for the full Wikipedia dump files that supports both the compressed (bz2) and uncompressed files. The project is called WikiHadoop and the code is available on Github at

Features of WikiHadoop

Wikihadoop offers the following:

  • WikiHadoop uses Hadoop’s streaming interface, so you can write your own mapper in Python, Ruby, Hadoop Pipes or Java.
  • You can choose between sending 1 or 2 revisions to a mapper. If you choose two revisions then it will send two consecutive revisions from a single page to a mapper. These two revisions can be used to create a diff between them (what has been added / removed). The syntax for this option is:
    -D org.wikimedia.wikihadoop.previousRevision=false (true is the default)
  • You can specify which namespaces to include when parsing the XML files. Default behavior is to include all namespaces. You can specify this by entering a regular expression. The syntax for this option is:
    -D org.wikimedia.wikihadoop.ignorePattern='xxxx'
  • You can parse both bz2 compressed and uncompressed files using WikiHadoop.

Getting Ready

  • Install and configure Hadoop 0.21. The reason you need Hadoop 0.21 is that it has streaming support for bz2 files and Hadoop 0.20 does not support this. Good places to look for help on configuration can be found and
  • Download WikiHadoop and extract the source tree. Confirm there is a directory called mapreduce.
  • Download Hadoop Common and extract the source tree. Confirm there is a directory called mapreduce.
  • Move to the top directory of the source tree of your copy of Hadoop Common.
  • Merge the mapreduce directory of your copy of WikiHadoop into that of Hadoop Common.
    rsync -r ../wikihadoop/mapreduce/ mapreduce/
  • Move to the directory called mapreduce/src/contrib/streaming under the source tree of Hadoop Common.
    cd mapreduce/src/contrib/streaming
  • Run Ant to build a jar file.
    ant jar
  • Find the jar file at mapreduce/build/contrib/streaming/hadoop-${version}-streaming.jar under the Hadoop common source tree.

If everything went smoothly then you should now have built the Wikihadoop InputReader and a functioning installation of Hadoop. If you have difficulties compiling WikiHadoop then please contact us, we are happy to help you out.


So now we are ready to start crunching some serious data!

  • Download the latest full dump from Look for the files that start with enwiki-latest-pages-meta-history and end with ‘bz2′. You can also download the 7z files but then you will need to decompress them. Hadoop cannot stream 7z files at the moment.
  • Copy the bz2 files to HDFS. Make sure you have enough space, you can delete the bz2 files from your regular partition after they have been copied to HDFS.
    hdfs dfs -copyFromLocal /path/to/dump/files/enwiki--pages-meta-history.xml.bz2 /path/on/hdfs/

    You can check to see if the files were successfully copy to hdfs via:

    hdfs dfs -ls /path/on/hdfs/
  • Once the files are in HDFS, you can launch Hadoop by entering the following command:
    hadoop jar hadoop-0.<version>-streaming.jar 
    -D mapred.child.ulimit=3145728 
    -D mapreduce.task.timeout=0 
    -D mapreduce.input.fileinputformat.split.minsize=400000000  #Sets the file split size a smaller size will mean more seeking and SLOWER processing time
    -D mapred.output.compress=true 
    -D mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec 
    -input /path/on/hdfs/enwiki-<date>-pages-meta-history<file>.xml.bz2 
    -output /path/on/hdfs/out 
    -mapper <name of mapper>
    -inputformat org.wikimedia.wikihadoop.StreamWikiDumpInputFormat
  • You can customize your job with the following parameters:
  • -D . This is a regular expression that determines which namespaces to include. The default is to include all the namspaces.
  • -D mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec. This compresses the output of the mapper using the LZO compression algorithm. This is optional but it saves hard disk space.

Real Life Application

At the Wikimedia Foundation we wanted to have a more fine-grained understanding of the different types of editors that we have. To analyze this, we need to generate the diffs between two revisions to see what type of content an editor has removed and added. In the examples folder of you can find our mapper function that creates diffs based on the two revisions it receives from WikiHadoop. We set the number of reducers to 0 as there is no aggregation over the diffs, we want just want to store them.

You can launch this as follows:

hadoop jar hadoop-0.<version>-streaming.jar 
-D mapred.child.ulimit=3145728 
-D mapreduce.task.timeout=0 
-D mapreduce.input.fileinputformat.split.minsize=400000000 
-D mapred.reduce.tasks=0
-D mapred.output.compress=true 
-D mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec 
-input /path/on/hdfs/enwiki-<date>-pages-meta-history<file>.xml.bz2 
-output /path/on/hdfs/out 
-mapper /path/to/wikihadoop/examples/
-inputformat org.wikimedia.wikihadoop.StreamWikiDumpInputFormat

Depending on the number of nodes in your cluster, the number of cores in each node, and memory on each node, this job will run for quite a while. We are running it on a three node mini-cluster with quad-core machines and the job takes about 14 days to parse the entire English Wikpedia dump files.

Posted in hadoop, wikipedia | Tagged , , | 28 Comments

Configuring Cassandra multinode on Ubuntu 10.10

Many many thanks to David Strauss for guiding me through configuring Casssandra and I thought I should share this knowledge.

Installation of Cassandra

Let’s start with a clean Ubuntu 10.10 64bits installation. Before we can install Cassandra, make sure you have all the latest updates.

sudo apt-get update
sudo apt-get upgrade

Now you need to open /etc/apt/sources.list

sudo nano /etc/apt/sources.list

Add the following two lines:

deb unstable main
deb-src unstable main

Now, rerun apt-get update:
Let’s start with a clean Ubuntu 10.10 64bits installation. Before we can install Cassandra, make sure you have all the latest updates.

sudo apt-get update
sudo apt-get upgrade

Now you need to open /etc/apt/sources.list

sudo nano /etc/apt/sources.list

Add the following two lines:

sudo apt-get update

You will get an error message, because we haven’t registered the PGP key yet. So let’s add the PGP key:

gpg --keyserver --recv-keys F758CE318D77295D
sudo apt-key add ~/.gnupg/pubring.gpg
sudo apt-get update

Now we are ready to install Cassandra:

sudo apt-get install cassandra

Configuration Cassandra Box 1

By default, Cassandra is configured as a single-node installation. First, stop Cassandra:

sudo /etc/init.d/cassandra stop

Follow the next steps to create a multinode configuration for Box 1:

sudo nano /etc/cassandra/cassandra.yaml
  1. We need to change the name of cluster. The current name is ‘Test Cluster’. Search for the string cluster_name and give a name to your cluster.
  2. Next, search for the string  listen_address and change it from localhost to the ip address of this machine.
  3. Search for the string rpc_address and change it from localhost to

We need to delete the cache and logfiles because we changed the name of the cluster.

sudo rm -rf /var/lib/cassandra/*

Finally, let’s restart Cassandra on Box 1:

sudo /etc/init.d/cassandra start

Configuration Cassandra Box 2

Start with repeating the same steps for Box 1. Then you need to take two more steps when you are editing cassandra.yaml:

  1. Search for the string auto_bootstrap and change the value from false to true.
  2. Search for the string seeds and replace localhost with the ip address of Box 1.

Finally, let’s restart Cassandra on Box 2:

sudo /etc/init.d/cassandra start

Inspect the logfile of Cassandra on Box 1 and you should see that Box 2 has arrived. You can also issue this command:

nodetool -host ring
Posted in nosql | Tagged , | 2 Comments

Add Cairo support to iGraph 0.5

I have been working with iGraph 0.5.2 and I am really happy with the speed and diversity of algorithms. Of course, networks need to be visualized as well. iGraph does offer visualization capabilities but you need Cairo installed. Unfortunately, installing the python bindings for Cairo requires a little bit of hacking, especially if you do not want to upgrade to Python 2.6.

So, here we go to enable Cairo support for iGraph using Python 2.5. (Probably it’s way easier to use Macports of Fink but I like to compile by hand).  Grab the following libraries:

and run for each library the sequence:



make test (not required and not every library supports this)

(sudo) make install

Cairo also supports PDF and SVG output but that will require additional libraries and compiling. This is the bare minimum to get Cairo to run. If you run make test on the Cairo package you are likely to have a bunch of tests failed, as far as I can tell that doesn’t really matter for iGraph but I am sure that some features of Cairo won’t work.

Now, let’s fix pycairo-1.8.8. There are two issues:

  1. Pycairo-1.8.8 requires Python 2.6 or higher
  2. Pycairo might look for the PPC shared libraries which it can’t find.

First, open configure in a text editor that does not mess with the linebreaks. I use Textwrangler for this, I tried nano first but that gave me this error

./configure: bad interpreter: No such file or directory

Open configure and go to lines: 11116 and 11150, it will read:

minver = list(map(int, \’2.6\’.split(\’.\’))) + [0, 0, 0]

and replace 2.6 with 2.5. Close the file and save it.  Now we need to fix, so open it in a text editor and do the following:

Add at the top of the file:

from __future__ import with_statement

Comment import io by adding # in front of it

Go to line 76, it reads:

if sys.version_info &lt; (2,6):

and replace 2.6 with 2.5

Save the file and close it. Now, we need to compile pycairo:

./configure LDFLAGS=”-arch i386″ (this will disable PPC support)


(sudo) make install

(sudo) python install

If everything went smooth then fire up Python and enter:

import cairo

If you don’t get any errors then you have succeeded!

Posted in igraph | Tagged , , | Leave a comment