Tag Archives: wikipedia

Using Hadoop to analyze the full Wikipedia dump files using WikiHadoop

Background Probably the largest free dataset available on the Internet is the full XML dump of the English Wikipedia. This dataset in it’s uncompressed form is about 5.5Tb and still growing. The sheer size of this dataset poses some serious … Continue reading

Posted in hadoop, wikipedia | Tagged , , | 28 Comments