Protecting Data

If data matters, as it does to most of us computer scientists, protecting it is important. Prompted by Greg Grossmeier’s search for encryption best practices (see also his synthesized advice), I thought I’d document some of the things I do to protect my data.

There are two principal threats I consider here:

Loss
I need to be able to keep my data, even if my computer or other storage media is lost, broken, or stolen. Backups are the principle means of protecting against loss.
Misuse
I’d really rather not have my data used if it falls into the wrong hands (e.g. my laptop gets stolen). Aside from general paranoia reasons, I have financial records, class assignment solutions, and things like that on my hard drives that shouldn’t really see the light of day without my authorization.

Backups

I have used rsnapshot for quite a while and like it a lot. It is built on rsync and takes periodic snapshots of your data, rotating them through hourly, daily, weekly, and monthly snapshots. It uses rsync and cp to maintain your backup area as a set of snapshots on the file system with common files hard-linked. This takes more space than tools like bup or tarsnap that do block-level de-duplication, but has the distinct advantage of maintaining your backups as simple file trees. No specialized software is needed to recover a file or restore a backup, and there are no custom file formats to decode (or to increase the opportunity for bugs to corrupt your backups). It can also back up remote systems.

rsnapshot works great for always-on systems with always-attached storage. I had been backing up my laptop by periodically (2-3 times a week) rsyncing my home directory over to the file server, which snapshotted it (along with the web/mail server and some of its own data) with rsnapshot. But I wanted something more automatic and opportunistic than that. Also, forests of rsnapshot trees can be a bit cumbersome to copy around, e.g. to make offsite backups (hard links have to be preserved, which makes rsync work pretty hard on large filesystems).

rdiff-backup is also very nice. It stores the most recent backup as a native set of files, so restoring missing files are a snap. Incremental backups are stored as reverse binary diffs using the rsync algorithm. This means you don’t have the sea of hard links lying around, and get better de-duplication for changed files, but can’t restore old versions without rdiff-backup. But that’s fine for me; restoring from the most recent file is the common case, rdiff-backup seems like it won’t be going anywhere anytime soon, and I still have a full copy of my data directly on the filesystem. Transfers of the full backup repository should now be faster.

I used to have a hand-rolled script based on the ideas of rsnapshot for laptop backups, but now I use rdiff-backup. It’s run automatically by a script which in turn is controlled by a pair of systemd units, a service and a timer unit, to run it on my laptop. The service uses systemd conditions to quietly do nothing if the USB drive isn’t plugged in. The timer unit just tries to run the service every 15 minutes.

The script itself checks if the backup has been run in the last 3 hours. If not, it runs the backup.

The result is that my laptop updates the backup of my home directory every hour whenever it is plugged in to my external drive. If you care, the external drive is a 1TB Western Digital My Passport. They’re nice, slim, USB-powered drives.

There are a few ad-hoc things I do that provide additional assurance for some data. Most of my source code lives on BitBucket or GitHub. I also synchronize a lot of research code and data between my laptop and workstation with unison, in addition to pushing the code to BitBucket. I also use git-annex to manage media files, downloaded software packages, archives of old projects, etc. It shuttles things around between my laptop, workstation, external disk, file server, and offsite storage.

There are a few outstanding TODO items in my backup regime:

  • Regular offsite backups. I have infrastructure in place for offsite backups, and take them occasionally, but still need to make a regular schedule and perhaps some automation for these.
  • Integrate server backups into my laptop-based scheme. Right now, the file server in our living room is still responsible for backing up itself and the web server; I want to regularly integrate these into the same infrastructure as I use for laptop backups and get them offsite.

Encryption

Encryption is the primary means I use to protect my data from misuse. Both my laptop and my wife’s use full-disk encryption with LUKS (built-in to Linux). Our large external drives are also encrypted with LUKS.

I use Diceware passphrases to protect encryption. Diceware makes it easy to reason about how much entropy is in your passphrase, providing a strong lower bound on the difficulty of brute-forcing it.

After I have memorized a passphrase (usually takes less than a week), I destroy the written version so it lives on only in my brain.