Category Archives: wikipedia

Setting up Amazon AWS instance to crunch Wikipedia XML dump files

In this post, I’ll describe the steps to setup an Amazon AWS instance and get access to XML Wikipedia dump files.There is a way to download wikidumps for any project / language, the data is from early 2009. I will … Continue reading

Posted in wikipedia | Tagged | Leave a comment

Using Hadoop to analyze the full Wikipedia dump files using WikiHadoop

Background Probably the largest free dataset available on the Internet is the full XML dump of the English Wikipedia. This dataset in it’s uncompressed form is about 5.5Tb and still growing. The sheer size of this dataset poses some serious … Continue reading

Posted in hadoop, wikipedia | Tagged , , | 28 Comments