Very often I want to find out if the contents of given folders match or if the content of a single folder is redundant - maybe to check if it's duplicate and can be removed or to make sure there are copies of every files somewhere.
If the folder structures match and the files haven't been renamed you can use diff -r
, meld
or whatever tool which compares folders (and it stops to work if you rename files or even directories).
If you just wand to find duplicates you can use tools like duff
or fdupes
.
But - and this leads to my question - I'd like to check/query if two folders have same content on a file content basis (rather than file content and file path basis like with diff
) Or instead of listing duplicates located in a given folder I want to get files without at least one copy somewhere on my system.
A possible tool's output could look like this:
fuzzydiff folder1 folder2
Only in 1: folder1/img_1234.jpg
Only in 2: folder2/bali/very_nice_moment.jpg
Only in 2: folder2/pictures_of_me/favorite_picture.jpg
(in this example the folders bali
or pictures_of_me
inside folder2
might not even exist in folder
so diff -r
would just skip the directory)
Is there anyone with similar needs/requirements and who has found a convenient and reliable way to retrieve the described information about file systems with several hundred GB up to a couple of TB efficiently?
I'm working on a Linux system so the suggested approaches should be Posix-ish and command line bases (in order to stack/combine results).
In case my description is still too fuzzy: an ever-recurring example for a problem I want to solve is: I want to delete a big folder with pictures or videos I have copied/moved/renamed and I want to have an (I hope so empty) list of files inside this folder which I don't have any copies of somewhere.
I'm currently writing a tool which meets my requirements but I doubt I'm the first one to have this kind of situation/problem to solve. In any other case I appreciate any hint or feedback which helps development!