Migrating datasets to RotKraken on BTRFS

Published: Dec 17, 2021 by luxagen

After final testing of RotKraken with some smaller datasets (COW-copied for safety), I recently used it on my ~16 TiB data collection as part of a new long-term data-management strategy designed to simplify everything without compromising on data integrity.

They say that “engineering works because it assumes it doesn’t”, thus the existence of test-driven development, which I used from the start for RotKraken. Despite also designing it never to touch file content, I wasn’t going to assume anything, so here’s how I carefully applied it to my data.

Note: You’ll see a couple of instances of the sync command below. This is because, at least on Ubuntu 20.04, BTRFS has minor bugs (presumably to do with making generation-number updates visible before they’re committed to disk) that sometimes make it act as if nothing on a subvolume has changed. While this doesn’t necessarily mean that snapshotting will use the out-of-date state, I prophylactically sync before snapshotting subvolumes I’ve just changed to be sure.

1. Take a “before” snapshot

We first snapshot the data to provide a safety copy should anything bad happen:

sync

btrfs subvolume snapshot -r $PATH $SNAP_PRE

2. Do the job

We run RotKraken on the data to do the initial hashing run:

rk -i $PATH

3. Take an “after” snapshot

To guard against rsync mishaps, we take another read-only snapshot:

sync

btrfs subvolume snapshot -r $PATH $SNAP_POST

4. Check that only the extended attributes have changed

Finally, we use rsync’s handy tree-diffing mode to check that the only changes were to extended attributes:

rsync --delete -aHAXni $SNAP_PRE/ $SNAP_POST/ | grep -v '^.f........x'

Et voilà! We can now dispose of the snapshots (if we wish) and get on with more interesting things — with the assurance that any future change to the content of these files will be detectable.