Repository crawlers for Mercurial (or why you need to learn about revsets)

30 Jan 2011

Technology 

Originally posted at https://tech.labs.oliverwyman.com/blog/2011/01/30/repository-crawl-mercurial/

Recently I needed to write a tool to crawl a Mercurial repository and look for certain things in unfinished branches that could cause us problems in the future. Given I knew that Mercurial was written in Python, my first approach to this was to start digging around in its code and see if there was anything in there I could cannibalise to build what I needed. This appeared to be bearing fruit, as I found the ancestor module pretty quickly, but I rapidly realised that in order to do something relatively simple I was going to have to copy vast reams of the Mercurial code to support it.

This is the wrong approach, or at least for most purposes it’s a really bad idea, as there’s a much easier way to write this sort of tool. The primary goal of a crawler tool is to get some well defined subset of the revision tree and then either confirm or alter some properties of those revisions. Most of the time and effort I was expending was in trying to find the wanted revisions, but Mercurial 1.6 added in a lovely feature called “Revision Sets” or revsets for short (which thankfully one of my colleagues pointed me towards before I got too deep into the brute force approach). hg help revsets will tell you the full syntax, but the first words of that (“Mercurial supports a functional language for selecting a set of revisions”) tells you all you really need to know.

To use them in your code, we need a little bit of boilerplate first

The revs routine just makes it easier to use revsets (it could also be done as a generator, but I found I needed to do list slices of things too often for that to be useful), and you could make repo point to another specified location if you want, but this’ll work with the repository in the current directory.

Ok, now here’s some fun things I just used in my script

(shoving things in as strings isn’t particularly Pythonic, but such is the price of using an embedded DSL)

All of this can also be used at the command line, but in a lot of cases you then want to do some more filtering (e.g. I wanted to find certain patches whose format would be hard to specify with a regular expression, but was simple with a line or two of Python), or print out various bits of information about said revisions. Net result of using this was the tool I needed got written in a couple of hours v.s. potentially days worth of work if I’d continued down the original route.

All of this is written with a unmentioned proviso, in that the first section of the Mercurial API page is called “Why you shouldn’t use Mercurial’s internal API”, and that’s exactly what we’re doing here. However, by using revsets as opposed to any other part of the API, we’ve got an advantage in that Mercurial themselves advise using “Mercurial’s published, documented, and stable API: the command line interface”, and revsets are explicitly part of the command line interface. This doesn’t mean they, or the interface we’re using to get to them is guaranteed not to change next revision, but the odds are better than for most parts of the code.

Previously: Robolectric: unit testing Android apps Next: Automating a meme: Compound Movies