Posts Tagged ‘C’

BitTorrent encryption, MSE basics

Sunday, July 29th, 2007

I have been working quite a bit recently on my BitTorrent implementation. My project is faring reasonably well so far - it can successfully download large real-world torrents at decent speed. In the course of development, I have been focusing solely on the core specification, as documented here. Having gotten support for peer connections - outgoing and incoming - working, I started to notice something. Some peers would send ostensibly random data to my client. Investigation of this led me to conclude these were peers using the Message Stream Encryption protocol.

I am not hugely interested in avoiding traffic shaping, but the encryption protocol looks like fun to implement and I’d like my software to be able to communicate with these funky peers (instead of just killing them, as it does now). I know very little about cryptography, but I know a little more now that I have begun implementing MSE. MSE really has two components - a Diffie-Hellman (D-H) key exchange to negotiate a secret, and RC4 stream cipher to encrypt the data. Both these functions are used in Secure Sockets Layer. However it should be pointed out that MSE is more about obfuscating the data to confuse traffic shapers than hiding its contents from prying eyes.

Fortunately for me, the OpenSSL library contains nice implementations of both D-H key exchange and RC4 . Before I can get to the nitty gritty of actually negotiating obfuscated streams, I had to make considerable changes to my software’s peer handling logic. I had written the peer response handler with quite hard assumptions about the nature of the incoming peer handshakes - for example, assuming the first byte would be the length of the protocol string. Furthermore I needed to add some timekeeping logic for each peer, because MSE handshake depends on timing for synchronisation. The initial peer handshake message is variably padded and so time must be considered to know when the message is complete. For example a MSE handshake from peer A may be considered invalid by peer B a) if A sent less than 96 Bytes within 30 seconds or b) if A sent more than 608 bytes.

My first step has been to re-work my peer handshake handling so that MSE handshakes can be distinguished and separately handled by the software, while traditional handshakes continue to work. Code to do this has been committed. The next step is to actually support crypto peers. For anyone curious about where the source code for my software is, I am not yet ready to release it. I have distributed it to a small group of testers to help iron out some of the bugs, but I don’t feel its ready for public release. I am being somewhat selfish and sitting on my personal project until I have polished it to where I want it to be. If you are really interested though, shoot me an email (niallo (*) niallo dot net).

BitTorrent basics - protocol etc

Tuesday, July 10th, 2007

I started working on a BitTorrent implementation in 2006. The protocol is kind of fun and there is a need for a lightweight UNIX implementation in my opinion. rTorrent is quite usable, but absolutely huge - something like over 40,000 lines of C++ last time I looked. My implementation is ~3,500 lines at present and has most of the needed functionality - although admittedly it will not support downloading of multiple torrents at a time, nor the fancy UI of rTorrent. Even assuming it doubles in size by the time its finished, that is still quite lightweight in comparison. It is written in C, Yacc and uses libevent - and its under the BSD license. I develop it completely under OpenBSD, although across multiple platforms (to weed out the endianness and pointer-size bugs). It should be easily portable to other systems with some glue (much like OpenSSH or OpenNTPd). Anyway, it has shaped up rather well and in fact I have been able to actually download some ‘real-world’ torrents with it. However, I still need to overhaul the network scheduler. Once I am satisfied with this part, I will publish the source code. The network scheduler is a complicated component - with numerous research papers on the subject! - and I plan to cover it in future postings.

For now, I want to post a quick summary of how BitTorrent works. I’m going to describe ‘vanilla’ BitTorrent - without distributed, trackerless operation and unofficial encrypted streams. The best complete “specification” I have found is this wiki page - the official specification being rather useless. Unfortunately, the official Python implementation sources are a neceesary reference at times also.

BitTorrent splits data into a number of units called pieces. Typically these pieces are 256K. Each piece has an associated 20 byte SHA-1 hash to verify integrity. The pieces are sub-divided further into what are called ‘blocks’ - typically 16K. Each peer has zero, some or all pieces. Metadata is distributed in .torrent files. Very likely you are familiar with downloading these files via HTTP and then feeding them to a BitTorrent implementation in some way. These files are encoded in a mixed binary/ASCII format called ‘bencode’ - which is a kind of mini-language supporting nested strings, integers, dictionaries and lists. Its actually quite a pain to parse in a language like C - it took me quite some time to write the Yacc parser. The metadata contains things like, number of pieces, piece sizes, hashes for each piece, tracker URL and filenames. Pieces are all zero-indexed. This gets a little complicated in multi-file torrents, where you must determine which file piece X falls within, and at what offset. My implementation uses mmap(), and has translation routines to nicely abstract these calculations away :-)

Each peer must initially ‘announce’ itself to the tracker, and is supposed to send further updates periodically. Communication with the tracker is done via HTTP GET, with the tracker returning a bencode’ed dictionary as its response. This response includes a list of other peers - whose addresses are in fixed-length (IPv4) binary format, making it a little difficult to use IPv6 for BitTorrent, at least without protocol changes. There are a variety of peer messages. Data is transferred in terms of pieces, offsets and lengths (block size). Peers tell each other which pieces they have by sending a bitmap - one bit per piece.

The basics of the protocol are fairly straight forward. Aside from some poor decisions in terms of the bencode format for encoding messages, it is not hard to code support. As I said above, things start to get tricky when it comes to peer communication strategies. One thing of note is that the most efficient algorithm for downloading pieces is believed to be ‘rarest first order’. To my knowledge, nobody is using a sequential download order - which is why partial BitTorrent downloads of audio or video data are notoriously unplayable.

UNIX, tell me if I can execute this file

Thursday, June 28th, 2007

A surprisingly complicated question on a UNIX system, in fact. UNIX files have three relevant execution bits. One for owner, group and other. To test if a file is executable at all, you can stat it, and simply test the mode bit-field - consider this short C program:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#include <sys/stat.h>
#include <sys/types.h>
 
#include <err.h>
#include <stdlib.h>
#include <stdio.h>
 
int
main(int argc, char **argv)
{
        struct stat sb;
 
        if (argc < 2)
                errx(1, "supply a filename");
 
        if (stat(argv[1], &sb) == -1)
                err(1, "stat %s", argv[1]);
 
        if (sb.st_mode 
            & (S_IXUSR|S_IXGRP|S_IXOTH))
                printf("%s is executable by "
                    "owner, group or other\n",
                     argv[1]);
        else
                printf("%s is not executable\n",
                    argv[1]);
 
        exit(0);
}

Now we run it against some very common UNIX files:


$ ./isexec $(which sh)
/bin/sh is executable by owner, group or other
$ ./isexec /etc/passwd
/etc/passwd is not executable

This still doesn’t tell us if the file is executable by our user. To find out, we must first find our relevant user ids and all our group memberships. We can do this quite nicely with the getresuid and getgroups functions. You may ask why I use getresuid() instead of the superficially simpler getuid() function. The answer is that the value getuid() returns can be ambiguous, getresuid() is much more explicit about what the real, effective and saved user IDs are. It makes code clearer and reduces the opportunity for subtle bugs. We add code like the following to the start of our program:

1
2
3
4
5
6
7
8
        gid_t mygroups[NGROUPS_MAX];
        uid_t r, e, s;
 
        if (getgroups(0, mygroups) == -1)
                err(1, "getgroups failure");
        if ((ngroups = getresuid(&r, &e, &s))
            == -1)
                err(1, "getresuid failure");

Now we have our uid and our group ids, we can test against the owner, group and mode bit-field of the file. If we are the owner, and the owner execute bit is not set, we may not execute it - even if the group or other execute bits are set in our favour. Same with group - if we are in the same group as the file, but group execute bit is not set, we may not execute it - unless we are the owner and the owner execute bit is set. This is may seem a little bit complicated to code at first. The important thing is to get the order right:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
        if (euid == sb.st_uid)
                if (sb.st_mode & S_IXUSR)
                        goto exec;
                else
                        goto noexec;
 
         for (j = 0; j < ngroups; j++)
                if (mygroups[j] == sb.st_gid)
                        if (sb.st_mode & S_IXGRP)
                                goto exec;
                        else
                                goto noexec;
         if (sb.st_mode & S_IXOTH)
                 goto exec;
         goto noexec;
exec:
         printf("file is executable by you\n");
         return (0);
noexec:
         printf("file is not executable by you\n");
         return (0);

The use of the goto keyword here in fact increases legibility, since it means you don’t have to have a whole bunch of ugly conditionals.

You may wonder why I bothered to write about this subject. Well, the answer is that I needed such an algorithm in order to add auto-completion to the cwm window manager. I committed the code to OpenBSD’s Xenocara tree a few days ago, so that now the exec dialog scans the default PATH, determines which files you may execute, and populates the execution menu with these values, such that you get auto-completion (aka type-ahead-find) when executing programs.

Blog is blog

Friday, June 15th, 2007

For a long time I’ve considered Word Press to be an awful pile of poo. Simply consider that its written in PHP, a bad, buggy language, and is futher full of security holes of its own making. I was thinking about writing my own implementation in Python (maybe TurboGears or Django or my own thing) or C. If I was doing it in C, I could use the BSD-licensed undeadly.org code and plug in Clearsilver templates or somesuch. It would be incredibly fast and perhaps a bit of fun. The problem is not so much the software side of it - its trivial to throw a simple schema together to represent posts in SQLObject - but more the display and interface. Writing good forms and UI and so forth is highly time-consuming and I just couldn’t be bothered. HTML and CSS and JavaScript are hard. Word Press also has a ton of extra features which I probably wouldn’t have time to implement, yet could potentially be nice for my site. I decided I would leverage the scalable nature of the Internet Word Press community, all those hundreds of people working on little plugins and tweaks etc. Even if the average quality is pretty low, in the end it should work “well enough”. Much as I like polish and perfection, sometimes its nice to be lazy and accept the warm sludge vomited up by the PHP-abusing masses. So yeah, I went with Word Press.

Well, at least its running in a chroot()-jailed Apache, with some pretty strict PF rules blocking outbound connections by the Apache user in case yet-another-PHP hole comes out before we can patch our server.

I also decided to try out the reCAPTCHA WordPress plugin to protect my posts from spam-comments. reCAPTCHA is a very clever idea - take a huge amount of old books, OCR them, and then use the words that the OCR software failed to recognise as a starting point. Next make captcha-solving help digitize these old books. I like that it encourages a kind of “arms race” between those cracking captchas (effectively the same thing as OCR) and those making better captchas which in the end, no matter what, will benefit the effort of creating digital copies of old books.