Tarsnap Architecture

DRAFT

Backups have been on my mind recently because my backup drive decided to suddenly fail last week. I didn't lose anything critical or irreplacable (it was a backup drive after all), but the incident did prompt me to think more about offsite backups.

Tarsnap is an online backup system designed and operated by Colin Percival. Its tagline is Backups for the truly paranoid because its author is a security officer for the FreeBSD project and encrypted everything up the wazoo.

Unfortunately, I did some math and it isn't suitable for my backup needs. To store my music collection would cost over $100 a month, which is way out of my price range. Music is an interesting case for backups (also photos and videos and other media in general). Compression doesn't do any good, because the music is already compressed. And the deduplication doesn't do much good either, because once i rip an album and tag it, i pretty much never modify it again.

I may try using tarsnap for some smaller, more easily compressed data, such as my code and documents.

That said, i'm also interested in tarsnap from a technical standpoint. Tarsnap manages to provide all three of compression, encryption, and deduplication, while also allowing you to delete old backups. The two other deduplicating backup systems i've looked at (camlistore and bup) don't support deletion, which is an instant disqualifier for me, since i'm always running out of disk space.

There's some technical info on the tarsnap website about the cryptography and the deduplication but there's not much info about how it fits together—the architecture as a whole. That said, it's possible to glean some information from the documentation and from the author's blog, and the source code is available (but note that tarsnap is not open-source[2]), so let's take a deeper look and see what we can figure out.

I want to first draw attention to the following from the website, under Design > Efficiency.

When creating archives, Tarsnap takes streams of archive data and splits them into variable-length blocks; these blocks are compared, and any duplicate blocks are removed (i.e., the data is "de-duplicated") before being uploaded to the Tarsnap server.

Tarsnap keeps a local cache telling it what blocks have been previously stored, and uses this when creating further archives; thus storing two archives containing the same data only takes very slightly more space (due to a small amount of per-archive overhead) than storing a single archive. More importantly, if files change between archives, only the modifications will need to be uploaded when the second archive is created.

So basically, tarsnap uses the content-addressed storage trick that venti uses. All the deduplication is done by the tarsnap client, not the server, because the data is encrypted before it is handed to the server, which makes it impossible for the server to determine if blocks are the same.

In other words, the server is just a store for binary blobs, keyed by a hash (actually a HMAC). The entire design of tarsnap is dedicated to making sure the server can't determine anything about the blobs it stores.

Let's dive into the code.

├── keygen		source for tarsnap-keygen
├── keymgmt		source for tarsnap-keymgmt
├── keyregen		source for tarsnap-keyregen
├── lib
│   ├── crypto		keys?, RSAES-OAEP, scrypt
│   ├── datastruct	patricia tree, hash table
│   ├── keyfile
│   ├── netpacket
│   ├── netproto
│   ├── network
│   ├── scryptenc	tarsnap.com/scrypt.html
│   └── util
├── libarchive
├── libcperciva		code.google.com/p/libcpercia/ (partial)
│   ├── alg		sha256
│   ├── crypto		AES-CTR, Diffie-Hellman, ...
│   ├── datastruct
│   ├── events
│   ├── network
│   └── util
├── pkg			packages for Arch Linux and Debian GNU/Linux
│   ├── archlinux
│   └── debian
├── recrypt		source for tarsnap-recrypt
└── tar			source for tarsnap (hacked-up version of bsdtar)
    ├── ccache		chunk cache
    ├── chunks
    ├── glue		connects libarchive and multitape, and implements some top-level tarsnap modes
    ├── multitape	chunkify and stuff
    └── storage		storage backend communication

Most chunk-based backup systems i've run across (camlistore, bup) use a modified version of rsync's modified adler32 rolling checksum to determine the chunk boundaries.

Tarsnap doesn't.

The chunking code is in tar/multitape/chunkify.c. At first glance, it looks a lot more mathematically sound than bup's rollsum, about which bup's design document has this to say:

(Incidentally, even though the average chunk size is 8192 bytes, the actual probability distribution of block sizes ends up being non-uniform; if we remember our stats classes correctly, which we probably don't, it's probably an "exponential distribution." The idea is that for each byte in the block, the probability that it's the last block is one in 8192. Thus, the block sizes end up being skewed toward the smaller end. That's not necessarily for the best, but maybe it is. Computer science to the rescue? You know the drill.)

Note: the maximum value of meanval, 1262226, is carefully chosen so that p will be less than 2^32/3.

Tarsnap is built in layers.

The outer layer, based on libarchive, gathers all the files to be backed up and constructs a tar stream.

The multitape layer breaks the tar stream into content-dependant chunks, performs deduplication

The chunk layer does something. Compression is performed here.

The storage layer communicates with the tarsnap server to store chunks. Encryption is performed here.

Chunk format (lib/crypto/crypto_file.c)

256		Encrypted AES key (per-archive random session key (AES256), encrypted with RSAEP-OAEP (2048) (KEY_ENCR_PRIV))
8		Nonce (64-bit)
len		Data
32		HMAC of the encrypted data (HMAC_FILE_WRITE key)

Chunks are encrypted with a "per-archive random" session key, but new archives can refer to old chunks which means that restoring an archive can require arbitrarily many session keys, and is also why the session key is stored with each chunk, rather than in a single place. The CTR nonce also has to be stored with the chunk, because there's no way to reconstruct it otherwise.

Tarsnap uses only X functions from openssl: AES_encrypt, AES_set_encrypt_key, RAND_seed, BN_num_bytes BN_bin2bn, BN_bn2bin, ERR_get_error, ERR_error_string, ERR_load_crypto_strings, RSA_new, RSA_free, RSA_generate_key, RSAPublicKey_dup, RSA_private_encrypt, RSA_private_decrypt, RSA_pubilc_encrypt, RSA_public_decrypt, // Diffie-hellman BN_CTX_new BN_new BN_mod_exp BN_mod_mul BN_set_word BN_free BN_CTX_free

The RSA functions are called with RSA_NO_PADDING; tarsnap implements its own padding. It implements these on its own: SHA-256, HMAC-SHA256, MMAC-DRBG, AES-CTR mode, RSA padding, Diffie-Hellman group 14

I'm not sure why Colin decided to reimplement SHA-256. Possibly because the OpenSSL implementation is spread across dozens of files, due to sharing common bits with other hash functions, which makes it hard to audit?

BREAKING!

There's a brief overview of the architecture in Colin's 2013 talk at BSDCon. http://www.daemonology.net/papers/EuroBSDCon13.pdf

How to build tarsnap (overview)
1.  Start with bsdtar...
	... and libarchive, which does most of the work.
2.  Deduplicate archive data.
	2.1 Split data into blocks.
	2.2 Reference-count the blocks.
3.  Add cryptography.
	3.1 Encrypt all the blocks.
	3.2 Sign the archive.
	3.3 Sign all the blocks.
4.  Upload the new data...
	... while the archive is being created.
5.  Have a server which puts the data somewhere safe...
	... and can find it for you again when you need it!

Unrelated, but: video about cryptography https://www.youtube.com/watch?v=jzY3m5Kv7Y8 slides: http://www.daemonology.net/papers/crypto1hr-small.pdf