Find/check redundant files for cleanup and backup purposes

Question

Very often I want to find out if the contents of given folders match or if the content of a single folder is redundant - maybe to check if it's duplicate and can be removed or to make sure there are copies of every files somewhere.

If the folder structures match and the files haven't been renamed you can use diff -r, meld or whatever tool which compares folders (and it stops to work if you rename files or even directories).

If you just wand to find duplicates you can use tools like duff or fdupes.

But - and this leads to my question - I'd like to check/query if two folders have same content on a file content basis (rather than file content and file path basis like with diff) Or instead of listing duplicates located in a given folder I want to get files without at least one copy somewhere on my system.

A possible tool's output could look like this:

fuzzydiff folder1 folder2
Only in 1: folder1/img_1234.jpg
Only in 2: folder2/bali/very_nice_moment.jpg
Only in 2: folder2/pictures_of_me/favorite_picture.jpg

(in this example the folders bali or pictures_of_me inside folder2 might not even exist in folder so diff -r would just skip the directory)

Is there anyone with similar needs/requirements and who has found a convenient and reliable way to retrieve the described information about file systems with several hundred GB up to a couple of TB efficiently?

I'm working on a Linux system so the suggested approaches should be Posix-ish and command line bases (in order to stack/combine results).

In case my description is still too fuzzy: an ever-recurring example for a problem I want to solve is: I want to delete a big folder with pictures or videos I have copied/moved/renamed and I want to have an (I hope so empty) list of files inside this folder which I don't have any copies of somewhere.

I'm currently writing a tool which meets my requirements but I doubt I'm the first one to have this kind of situation/problem to solve. In any other case I appreciate any hint or feedback which helps development!

grawity_u1686 · Accepted Answer · 2016-01-25 15:55:55Z

Have you tried git-annex for managing the files? It automatically keeps track of which files are on which storage and makes sure there's at least 𝒏 copies of a file across repositories.

For example, if you run git annex drop Photos/2014, it will delete the files locally but only after verifying that they also exist on another disk (and git annex get … would copy them back in). There also are the inverse git annex move/copy --to.

If you reorganize the files, annex add && annex sync would update the directory structure across all repositories. There's also a "preferred content" feature allowing to specify which repositories git-annex should automatically copy files to – for example, a large backup disk would want all files, the desktop would want only whatever was retrieved manually via annex get, your laptop would want just the "Photos/2016" directory, etc.

So your original question, "which files are here but not there", could be answered via:

git annex find --in . --not --in backup_vol
git annex find --in . --not --copies 2

If the repository has numcopies ≥ 2, you could use:

git annex find --approxlackingcopies 1

Note that you will likely want to enable "direct mode" via git annex direct – this makes git-annex only track the latest versions of file content, but also makes working with files themselves simpler. (v6 thin mode improves this.)

I've also considered using tools like git. But this way just have another duplicate-detector but no answer to my question about differences between two folders. But +1 for git-annex - I'll give it a try (but I doubt this works for terabytes of files piled up in decades) — frans, Commented Jan 25, 2016 at 15:33
It should work. It was pretty much built for that. Only the initial git annex add will take a while, but you asked for file content checking anyway, so that's unavoidable. — grawity_u1686, Commented Jan 25, 2016 at 15:38

thomas_d_j · Accepted Answer · 2016-01-25 20:17:28Z

1

You can do this with rmlint.

Use the following command line to find files which are only in folder3:

rmlint -k -o uniques folder1 folder2 // folder3

Edit: also to find out which files can be safely deleted from folder3 because they have copies somewhere in folder1 or folder2:

rmlint -km folder3 // folder1 folder2

This will generate a shell script (rmlint.sh) which you can use to delete the identified files. For large datasets you might want to add a progressbar by adding -g to the command.

edited Jan 25, 2016 at 20:17

answered Jan 25, 2016 at 19:56

thomas_d_j

1412 bronze badges

sounds promising, I'll check this out! Unfortunately it's not packaged in standard distributions and you have to compile it :/
– frans
Commented Jan 26, 2016 at 10:48
Agreed; we're having trouble finding packagers except for Arch Linux. Anyway it's a pretty quick compile.
– thomas_d_j
Commented Jan 26, 2016 at 11:33

Add a comment |

Stack Exchange Network

Find/check redundant files for cleanup and backup purposes

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
linux
backup
comparison
duplicate
checksum
.

Hot Network Questions

Find/check redundant files for cleanup and backup purposes

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged linuxbackupcomparisonduplicatechecksum.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
linux
backup
comparison
duplicate
checksum
.