I am trying to deduplicate a BTRFS filesystem with multiple subvolumes. Altogether, it holds around 3.5 TB of data, which I expect to be slightly more than half that size after deduping. I am mostly concerned about duplicate files, not about single blocks (but I still want to deduplicate small files as well). File size varies greatly. The drive is currently in maintenance mode, i.e. no changes to files are being made while dedup is in progress.
duperemove
is running on a system with 16 GB physical memory, 8 GB of swap space. I use a hashfile due to the amount of data, also because it allows me to interrupt and resume anytime.
My initial run was with the default block size. Indexing took some 28 days to complete (producing a 21 GB hashfile), after which the system spent another 8 days loading duplicate hashes into memory before running out of memory almost completely and becoming unresponsive. (duperemove
kept oscillating between 12–14 GB of memory usage for most of that time, but memory kept filling up constantly, even though I did not see an increase in memory usage for any of the processes on my system.)
My options to add extra memory are limited. Pretty much my only option is to add extra swap space on a USB drive, which adds another performance penalty to the already costly swapping mechanism. Nonetheless, I have added another 32 GB f swap space this way, just to prevent running out.
However, I have not tried using different block sizes (and the FAQ has hardly any information on that). So basically, my questions are:
- How should I choose my block size to prevent running out of memory?
- How should I choose my block size for best performance, while still maintaining a good dedup rate? (I don’t want to wait a month for a single test run again, but I can afford to waste a gigabyte or two of disk space.)
- What is the performance penalty of swapping? Does it help to cut down on memory usage so that swapping is not required, or is the benefit of not swapping offset by something else?
- Can I reuse an existing hashfile created with a different block size? If so, will changing the block size have any effect at all if everything is already hashed?