Over the weekend I decided to spend some time trying to improve the web page for my BitTorrent project. I felt that there is a good bit going on with the project, but that that web page was not reflecting this to people who might be visiting it. I already write CVS commit log messages, I don’t feel like also writing those things into the HTML page. Why not simply display the latest N commit logs on the web page? This informs the user at a glance a) how active the project is b) what I’ve been working on.
Probably the traditional approach to doing this kind of thing is using CVS ‘commit hooks’. These are scripts run on every check in that can do things with the log messages and so on. For example, the OpenBSD ’source-changes’ mailing list is run using this feature. I didn’t really want to bother with this though - not least because the web server is not the same machine as the CVS server. Also, the commit hook approach only works from the point it has been set up onwards. I wanted prior changes to be visible.
The output of ‘cvs log’ contains all the relevant information. The cvs log command doesn’t require access to the CVSROOT, indeed I can do it quite easily read-only over SSH, since cvs.unworkable.org uses anoncvs - thus it is highly secure, and distributed. All I needed to do was to write a program to parse the output of ‘cvs log’, order it by timestamp (latest first ordering), and display the first N entries. It turns out to be pretty easy to do this in Python. My first question was, what should the data structures be? I reasoned that each commit could be represented by a dictionary with a few keys - timestamp, file, and commit log message. So all we needed was a list of these dictionaries - easy! After this, I saw two challenges - the parser and the sorter. The parser is a fairly simple finite state machine. String handling is pretty nice in Python - at least compared to C. It did not take too much work to get it pulling out the data I needed - great. Now to order the list. It is not hard to grasp what we need to do - we want something to compare each dict in the list based on its timestamp. But how exactly do we get Python to do this for us? We could write our own sort routine, but that shouldn’t really be necessary. Python already has a perfectly good list.sort() method. The question thus becomes, how do we make the sort() API do what we want? Python’s internal API docs reveal:
1 2 3 | sort(...) L.sort(cmp=None, key=None, reverse=False) -- stable sort *IN PLACE*; cmp(x, y) -> -1, 0, 1 |
This tells us very little beyond a few named parameters. I can guess the meanings of ‘reverse’ and ‘cmp’ roughly, but there is no info at all on ‘key’. Sparse documentation like this is one of my pet peeves about Python. Fortunately I was able to figure it out. ‘key’ expects a function to be passed in, which will be called on each list entry and should return a value to sort by. Aha! All we need is to write a function to return the timestamp from each list entry. Oh wait - it turns out we don’t. Python’s standard library has a module called ‘operator‘ which includes a pre-written function “itemgettr” to do exactly this! So the entire ’sorting’ part of the problem can be achieved in a single line: entries.sort(key=operator.itemgetter(’dt’), reverse=True)
Having finished the program, it simply reads from stdin and writes to stdout. It defaults to printing the most recent three entries, but this is configurable through a single command-line argument. For example, cvs log unworkable | python changes.py -n 1 currently prints out:
network.c revision 1.180 date: 2007/11/20 04:44:07; author: niallo; state: Exp; lines: +3 -2 guard against a NULL piece_dl in the PIECE message handler. this needs to be re-examined, but it should at least stop us segfaulting.
which is exactly what I want!
Anyway, you can download the Python program (which I’ve BSD-licensed) here. Feel free to post any comments or feedback.