Published: Dec 17, 2021 by luxagen
After final testing of RotKraken with some smaller datasets (COW-copied for safety), I recently used it on my ~16 TiB data collection as part of a new long-term data-management strategy designed to simplify everything without compromising on data integrity.
They say that “engineering works because it assumes it doesn’t”, thus the existence of test-driven development, which I used from the start for RotKraken. Despite also designing it never to touch file content, I wasn’t going to assume anything, so here’s how I carefully applied it to my data.
Note: You’ll see a couple of instances of the sync
command below. This is because, at least on Ubuntu 20.04, BTRFS has minor bugs (presumably to do with making generation-number updates visible before they’re committed to disk) that sometimes make it act as if nothing on a subvolume has changed. While this doesn’t necessarily mean that snapshotting will use the out-of-date state, I prophylactically sync
before snapshotting subvolumes I’ve just changed to be sure.
1. Take a “before” snapshot
We first snapshot the data to provide a safety copy should anything bad happen:
sync
btrfs subvolume snapshot -r $PATH $SNAP_PRE
2. Do the job
We run RotKraken on the data to do the initial hashing run:
rk -i $PATH
3. Take an “after” snapshot
To guard against rsync mishaps, we take another read-only snapshot:
sync
btrfs subvolume snapshot -r $PATH $SNAP_POST
4. Check that only the extended attributes have changed
Finally, we use rsync’s handy tree-diffing mode to check that the only changes were to extended attributes:
rsync --delete -aHAXni $SNAP_PRE/ $SNAP_POST/ | grep -v '^.f........x'
Et voilà! We can now dispose of the snapshots (if we wish) and get on with more interesting things — with the assurance that any future change to the content of these files will be detectable.