Faster BitTorrent, or, SHA1 is slow
This weekend I have been concentrating on improving the performance of my BitTorrent implementation. I somewhat believe in the mantra “premature optimisation is the root of all evil”, or at least, I’m not too worried about making everything super fast the first time around. Asynchronous network programs are complicated enough to write that I expected off the bat to have to rewrite much of the code after having sketched it out. This process of hacking up an initial approach, pushing it as far as I could, figuring out a better way, then re-doing it the better way, has brought my project reasonable success. In any case, I was happy enough with the download process that I decided to spend some time making it faster.
In order to make something fast, it helps to understand what is making it slow. Efficient BitTorrent protocol handling requires keeping lots of “in-flight” data block requests around. Both queued to individual peers, or spread out across many peers. At any one time, there are many block requests which the implementation must track, especially in order to make decisions about what it should try to download next. While working on my initial sketch which used a single linear tail queue for all “in-flight” requests, I found there are two common ways I need to access these requests. Firstly, I often ask “give me the in-flight request for offset X of piece index Y”. This maps nicely onto a binary tree structure, searched by request index and offset. Currently, these properties are guaranteed to be unique – i.e. only one request for a given index/offset pair can exist at a time – but in the future this will not be guaranteed. Think especially of “the endgame”, where we may want to request the same block from many peers. To future proof against this, I have the tree actually point to a tail queue of requests. Currently I am using a red-black tree. The second way I often need to access “in-flight” requests is by peer. I need to know “what are all the in-flight requests for peer X”. This is quite a bit easier to manage. Each peer simply has its own tail queue of its associated requests, which is at most one hundred elements long, and so cheap to search linearly. Furthermore, when inspecting a peer’s requests, it is almost always to perform an operation – e.g. deallocation – on all of them, so the the linear data structure is perfectly appropriate.
I didn’t measure exactly how much of an improvement this data structure change made – perhaps very little, since the linear lists were never more than a thousand entries long – however I’m happier to use the ‘right’ data structures in general, and not have to worry about it. After some profiling, I found another pretty obvious consumer of resources – SHA1 hashing. I had not especially optimised unworkable’s handling of the PIECE message, and it was performing a SHA1 hash of a piece as each block for it came in – regardless of whether we had received all the other blocks or not. I added a simple conditional which only does the checksum when we at least believe we have received all the blocks in the piece. This drastically improved performance, since we are now doing on the order of 16x (but depending on piece/block sizes, of course) fewer SHA1 hash operations.
I have not looked closely at other BitTorrent implementations, but I believe I can answer with relative confidence the question “why is BitTorrent so resource hungry?”. The answer: SHA1 hashing is expensive. As an aside, I do not think that OpenBSD yet supports hardware-accelerated SHA1 features in the newer VIA CPUs. Such hardware crypto features would be nice to see in more chips, maybe. Also, I wonder if other BitTorrent implementation authors have come up with better data structures or storage schemes to keep track of the stuff I am keeping track of. I must take a look sometime.







Recent Comments