It is not very uncommon for me to receive some code requests from friends looking for a little help to “make something easier”. I’m talking about those pesky day to day things that after a while makes you go: “If I could only push a button and have it done!”
This week I was chit-chatting with my good friend Evandro Pastor when he mentioned one of these situations that, according to him, makes his existence miserable. You see, Evandro is the brain, muscle and talent behind QuartoStudio, “a small studio located in the state of SÃ£o Paulo, Brazil” specialized in creating websites and blogs for small to medium sized companies. We have worked together many times on several different projects, and through the years have become great friends.
Anyhow, Evandro was telling me how painful it is when a new client asks him to migrate an existing web blog to a new host/domain and hands him a tarball containing the exported content of a WordPress blog in XML format. The problem resides on the issue that some web hosting providers will limit how big a file can be uploaded via a POST method, and depending on how big this XML file is, you may have to manually break it into smaller files first.
So during my lunch break yesterday I took upon myself to create a script that would do this for him so he wouldn’t have to manually split files and worry about making sure all the right tags were present and closed appropriately. Of course I searched the web for a free/open source alternative first but after several minutes later I still had not found something that would do what he needed.
"I’ll just write one then!", I said to my self. My first stop was to read up on lxml since I figured that using something that could parse XML would be a better start than reading a file and searching for tags within the file. Armed with this knowledge and still enjoying the buzz from my freshly brewed cup of coffee, I set out to write a couple of methods that take in a XML file, the number of “chunks” you want to split it into, and I was done!
There was still a problem, however: Evandro would have to use the python command line to do what he needed. Writing a command line tool then became my next priority.
Since I have written command line tools in python before, my first impulse was to copy some of the code I had already written somewhere else and adapt it for this specific project… and so I did. Two minutes later I had my first draft of a command line tool that could be invoked from the shell the same way that cp or ls can. It was then that it dawned on me that I had never really taken the time to really learn about optparse, the “convenient, flexible, and powerful library for parsing command-line options.”
Of course, my first stop was to read through the “official” documentation. I then did a bit of Googling around and was pleased to find an amazing introductory article by Alexander Sandler. If you have ever wanted to learn how to write a command line tool and need to learn in detail about optparse, I highly recommend his article to supplement the official documentation.
So Tuesday night, as I sat down to watch Lost, I added the finishing touches to my script and was able to surprise Evandro with it the next morning! Unfortunately this story does not have a 100% happy ending… The script works as expected and it lets you configure how many “chunks” to break the XML into, what tag to look for (defaults to “/rss/channel/item”) and what to name the generated smaller files. However, since a post or comment can have embedded HTML tags, lxml is converting them to HTML entities… aaaand WordPress doesn’t like that! Aaaand it turns out that the XML file generated by WordPress is not really XML per se but its own WRX format. :/
Not everything is lost though. I did learn more about optparse and using lxml to parse XML files! By the way, if anyone has some more information on how I can make lxml ignore html tags, please drop me a line! :)