Setting up Amazon AWS instance to crunch Wikipedia XML dump files

In this post, I’ll describe the steps to setup an Amazon AWS instance and get access to XML Wikipedia dump files.There is a way to download wikidumps for any project / language, the data is from early 2009. I will detail the steps as a note for future reference. The data is made available as part of Amazon AWS Public Datasets (http://aws.amazon.com/publicdatasets/).

  1. Create an AWS account
  2. Log in to AWS and select the EC2 tab
  3. Click ‘Launch Instance’
  4. Select a Configuration, the Basic 64-bit Amazon Linux AMI 1.0 is fine
  5. Instance Type: select Micro (this is the cheapest) and press continue
  6. Instance Defaults: keep the defaults and press continue
  7. Enter key, value pairs, such as key=name, value=WikiMirror, this is
    not really required and press continue
  8. Create a new key pair and give it a name and press Create &
    Download your Key Pair (this is your private EC2 key and you need to
    store it somewhere safe).
  9. Create a security group, default settings are fine and press enter
    to continue.
  10. Review your settings and press launch

This will start an Amazon EC2 instance. The next step is to make the
Wikimedia public dataset accessible to our instance.

  1. Click EBS Volumes on the upper right of the window
  2. Select Create Volume from the different tabs
  3. Snapshot pulldown: scrolldown and search for Wikipedia XML Backups
    (XML) and enter 500Gb in the size input field. Make sure that the zone
    matches the zone of your primary volume and press create
  4. Click Attach Volume and enter /dev/sdf or similar

We know have made the Dataset available to our EC2 instance. Let’s get the data: (I’ll assume a Windows environment, I do not think this step is necessary for Linux / OSX users):

  1. We have a .PEM certificate but if we would like to use Putty on
    Windows then we need to convert the .PEM certificate to a certificate
    the Putty can use.
  2. Download puttygen from
    http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html
  3. Start puttygen and select Load Private Key. Select the key that we
    downloaded at step 8 of creating an EC2 instance.
  4. Press Save private key and save the new .ppk key
  5. Close puttygen and start putty (you can download it from
    http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html as
    well)
  6. Create a new session: the EC2 server name can be found by going to EC2 dashboard on the left, and then selecting Running Instance. At the bottom of the page you will find: Public DNS and a long name, copy this name and enter it in the Putty session.
  7. In Putty, click on SSH on the left and then Auth. There is a field
    saying use key for authentication and a browse button. Press the
    browse button and select the key from step 4.
  8. Start the session, enter as username ec2-user, the key will
    authorize you. We have logged on to our EC2 instance
  9. Enter on the command line: sudo mkdir /mnt/data-store
  10. Enter on the command line: sudo mount /dev/sdf /mnt/data-store
    (the sdf depends on what you entered at step 4 of creating the
    dataset.
  11. cd /mnt/data-store and enter ls -al and you will see all the files
    available.

Next step is either to copy a file using SCP or start your own FTP server on the EC2 instance and download the files that you need. You are ready to start crunching Wikipedia’s XML dump files!

This entry was posted in wikipedia and tagged . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>