In this post, I’ll describe the steps to setup an Amazon AWS instance and get access to XML Wikipedia dump files.There is a way to download wikidumps for any project / language, the data is from early 2009. I will detail the steps as a note for future reference. The data is made available as part of Amazon AWS Public Datasets (http://aws.amazon.com/publicdatasets/).
- Create an AWS account
- Log in to AWS and select the EC2 tab
- Click ‘Launch Instance’
- Select a Configuration, the Basic 64-bit Amazon Linux AMI 1.0 is fine
- Instance Type: select Micro (this is the cheapest) and press continue
- Instance Defaults: keep the defaults and press continue
- Enter key, value pairs, such as key=name, value=WikiMirror, this is
not really required and press continue
- Create a new key pair and give it a name and press Create &
Download your Key Pair (this is your private EC2 key and you need to
store it somewhere safe).
- Create a security group, default settings are fine and press enter
- Review your settings and press launch
This will start an Amazon EC2 instance. The next step is to make the
Wikimedia public dataset accessible to our instance.
- Click EBS Volumes on the upper right of the window
- Select Create Volume from the different tabs
- Snapshot pulldown: scrolldown and search for Wikipedia XML Backups
(XML) and enter 500Gb in the size input field. Make sure that the zone
matches the zone of your primary volume and press create
- Click Attach Volume and enter /dev/sdf or similar
We know have made the Dataset available to our EC2 instance. Let’s get the data: (I’ll assume a Windows environment, I do not think this step is necessary for Linux / OSX users):
- We have a .PEM certificate but if we would like to use Putty on
Windows then we need to convert the .PEM certificate to a certificate
the Putty can use.
- Download puttygen from
- Start puttygen and select Load Private Key. Select the key that we
downloaded at step 8 of creating an EC2 instance.
- Press Save private key and save the new .ppk key
- Close puttygen and start putty (you can download it from
- Create a new session: the EC2 server name can be found by going to EC2 dashboard on the left, and then selecting Running Instance. At the bottom of the page you will find: Public DNS and a long name, copy this name and enter it in the Putty session.
- In Putty, click on SSH on the left and then Auth. There is a field
saying use key for authentication and a browse button. Press the
browse button and select the key from step 4.
- Start the session, enter as username ec2-user, the key will
authorize you. We have logged on to our EC2 instance
- Enter on the command line: sudo mkdir /mnt/data-store
- Enter on the command line: sudo mount /dev/sdf /mnt/data-store
(the sdf depends on what you entered at step 4 of creating the
- cd /mnt/data-store and enter ls -al and you will see all the files
Next step is either to copy a file using SCP or start your own FTP server on the EC2 instance and download the files that you need. You are ready to start crunching Wikipedia’s XML dump files!