I have a project where I collect all the Wikipedia articles belonging to a particular category, pull out the dump from Wikipedia, and put it into our db.
So I should be parsing the Wikipedia dump file to get the stuff done. Do we have an efficient parser to do this job? I am a python developer. So I prefer any parser in python. If not suggest one and I will try to write a port of it in python and contribute it to the web, so other persons make use of it or at least try it.
So all I want is a python parser to parse Wikipedia dump files. I started writing a manual parser which parses each node and gets the stuff done.
There is example code for the same at http://jjinux.blogspot.com/2009/01/python-parsing-wikipedia-dumps-using.html
I don’t know about licensing, but this is implemented in python, and includes the source.
Another good module is mwlib from here – it is a pain to install with all dependencies (at least on Windows), but it works well.
Wiki Parser is a very fast parser for Wikipedia dump files (~2 hours to parse all 55GB of English Wikipedia). It produces XML that preserves both content and article structure.
You can then use python to do anything you want with the XML output.
pip install mwxml
Usage is pretty intuitive as demonstrated by this example from the documentation:
>>> import mwxml >>> dump = mwxml.Dump.from_file(open("dump.xml")) >>> print(dump.site_info.name, dump.site_info.dbname) Wikipedia enwiki >>> for page in dump: ... for revision in page: ... print(revision.id) ... 1 2 3
It is part of a larger set of data analysis utilities put out by the Wikimedia Foundation and its community.