4

I have a btrfs volume, which I create regular snapshots of. The snapshots are rotated, the oldest being one year old. As a consequence, deleting large files may not actually free up the space for a year after the deletion.

About a year ago I copied the partition to a bigger drive but still kept the old one around.

Now the new drive has become corrupted, so that the only way to get the data out is btrfs-restore. As far as I know, the data on the new drive should still fit on the old, smaller drive, and files do not really change much (at most, some new ones get added or some deleted, but the overhead from a year’s worth of snapshots should not be large). So I decided to restore the data onto the old drive.

However, the restored data filled up the old drive much more quickly than I expected. I suspect this has to do with the implementation of btrfs:

  • Create a large file.
  • Create a snapshot of the volume. Space usage will not change because both files (the original one and the one in the snapshot) refer to the same extent on the disk for their payload. Modifying one of the two files would, however, increase space usage due to the copy-on-write nature of btrfs.
  • Overwrite the large file with identical content. I suspect space usage increases by the size of the file because btrfs does not realize the content has not changed. As a result, it copies the blocks occupied by the file and ends up filling it with identical content, creating two identical files in two separate sets of blocks.

Does btrfs offer a mechanism to revert this by finding files which are “genetically related” (i.e. descended from the same file by copying it and/or snapshotting the subvolume on which it resides), identical in content but stored in separate sets of blocks, and turning them back into reflinks so space can be freed up?

1 Answer 1

7

TL;DR: There are tools to accomplish this, but not as part of the official tool suite and probably not in your distribution’s repositories. You will have to choose from a number of tools and probably build it yourself. See below for details.

Tools

The btrfs wiki has an article on deduplication, which also mentions some tools.

There are more tools out there – I looked at one, though it seems to have been unmaintained for 6 years as of this writing, so I decided to stick with what is on the btrfs wiki.

None of these are part of the official btrfs suite so far, and at least Ubuntu 20.04 does not offer packages for them – you will have to build them yourself.

dduper looked promising – it claims to do both file and block based deduplication (i.e. it deduplicates entire files, as well as blocks which are identical between two or more files). Also it is said to be fast as it relies on internal btrfs indices. Being written in python, it does not need to be built before use (you do need the prettytable package for Python on your machine, though). However, it seems to skip any files below 4 KB, which I figure is counterproductive when you have lots of small, identical files.

I decided to go with duperemove, which only does file based deduplication. Apart from a C build environment and autotools, you will need the libsqlite3-dev package on your machine. Grab the sources and build them by running make from the source dir. duperemove can then be run directly from the source dir, for those who don’t want to make install random stuff on their system.

Running duperemove

The docs mention two ways to run duperemove: directly, or by running fdupes and piping its output into duperemove. The first is only recommended for small data sets. The second one, however, turned out to be extremely resource-hungry for my data set of 2–3 TB and some 4 million files (after a day, progress was around half a percent, and memory usage along with constant swapping rendered the system almost unusable).

What seems to work for me is

sudo duperemove -drA /foo --hashfile=/path/to/duperemove.hashfile

This will:

  • Run the whole process as root, as my account does not have access to everything on the disk
  • Deduplicate results (-d), as opposed to just collecting hashes and spitting out a list of dupes
  • Recurse into subdirectories of the path given (-r)
  • Open files read-only for deduping (-A, needed because my snapshots are read-only)
  • Store hashes in a file, not in memory (--hashfile), giving you two advantages:
    • The process can be interrupted at any time and resumed later, as long as the hash file remains in place
    • The process will not hog memory on your system (unlike running in fdupes mode), although the hash file takes up disk space: 90 bytes per block and 270 bytes per file.

duperemove runs in multiple phases:

  • Index disk content (which is what is stored in the hash file)
  • Load duplicate hashes into memory
  • Deduplicate files

duperemove can be interrupted and resumed at any point, as long as the index database remains in place.

Test results

I ran duperemove against a disk with 8 subvolumes, 5 of which are snapshotted regularly, with 24 snapshots being kept around (the last 12 monthly, 5 weekly and 7 daily ones). Snapshots included, the disk holds some 5–6 million files taking up 3.5 TB (pre-dedup, expected 1.8 TB post-dedup).

When duperemove starts running and displaying progress, the percentage only refers to the indexing phase, not the whole deduplication process. Loading duplicate hashes can take much less or much more than that, depending on how many blocks were examined. Deduplication again takes roughly the same time as indexing. Also, progress calculation for the indexing phase seems to be based solely on the number of files, not blocks, making it an unreliable indicator of total time/space required if all your large files tend to be at the beginning (or the end) of the set.

Resource usage during the indexing phase is low enough to keep my system responsive when using a hashfile, though loading duplicate data can eat into your free memory. If the index DB is larger than the amount of free memory on your system, this may cause excessive swapping and and slow down your system.

Indexing everything (with the default block size of 128K) took some 28 days to complete and produced a 21 GB hash file. I ran out of memory on day 36, which left my system unresponsive, so I had to abort. Memory usage by the duperemove process had been oscillating around 12–14 GB for four days, though total memory usage kept increasing until the system became unusable.

For the next attempts, I decided to deduplicate subvolumes one-by-one, with another run between portions of two subvolumes where I know I had moved data. I started out with a 1024K block size, though this will miss smaller duplicate blocks as well as files smaller than the block size, in exchange for better performance. This took around 24 hours and ended up freeing up some 45 GB on my drive – satisfactory performance but space savings are not what I expected.

I aborted another attempt with 4K on a 300G subvolume – indexing took roughly four times as long as with 1024K but after 3 days, loading duplicate hashes still had not finished. Another attempt at 64K completed in under 4 hours. Note that deduplication for any subsequent pass after the first one should finish faster, as only the small blocks are left to deduplicate.

Hence my suggestions, based on practical experience:

  • Hash file placement is important:
    • Make sure you have plenty of disk space (I did not, so I had to interrupt, move the file and resume).
    • /tmp may not be a good idea as it may get wiped on reboot (even if you don’t plan on rebooting, you might want to be safe in case of a system crash or power outage).
    • If you have encrypted your home dir, performance will suffer from what I have observed. In my experience, an external USB disk appeared to perform better than my encrypted home dir on the internal HD.
  • Data cannot be deduplicated between two (or more) read-only subvolumes. If you have multiple read-only snapshots of the same subvolume, choose your paths so that only one of these snapshots gets scanned – usually the most recent one, or the one you are planning to keep around for the longest. Scanning all snapshots will take much longer, and will not increase efficiency – unless some files in the current set match an earlier snapshot, not the latest.
  • Deduplicate in portions – split your filesystem into portions which you don’t expect to have a lot of duplicate data between them. This will save space when loading duplicate hashes.
  • Determine the optimum block size as described in Choosing the right block size for duperemove. For my drive, 16K worked well for portions of 0.6–1 TB (pre-dedup) while at 8K, loading duplicate hashes went up from a few hours to a more than 2 days – which means 16K is the sweet spot I was looking for. With a smaller portion, 4K has worked well.
3
  • Is it worth it?
    – gapsf
    Commented Sep 3, 2022 at 21:26
  • Depends on how much space you are hoping to save and how much in time and resources you are willing to throw in. I figured that, after wasting a month, I can spend another week retrying with more efficient settings.
    – user149408
    Commented Sep 4, 2022 at 15:59
  • Unfortunately, the implementation of duperemove is very inefficient (a lot of linear or quadratic matching for things that should be linear / logarithmic time). But the script is very short, so it's not unreasonable to think that it could improve!
    – Clément
    Commented Nov 5, 2022 at 5:56

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .