0

I have accidentally deleted part of folder (before stopping the rm command). However, the backup I restored was around 2 weeks old, and unfortunately, I've renamed and restructured the directories between deleting them and the point in time of the backup. I have manually restored what I know was missing, but I'm not sure I managed to catch everything.

Is there a fast method of showing file differences that do not include their parent directories, only file name and modification or creation date? For example, I have the directories

data/output/test1/file1.mha

which I might have moved/renamed to

data/results/mhas/first_test/file1.mha

Using diff -rq did not work for this and is also rather slow. The directory has a size of around 2TB and a fairly large number of files, so checking the MD5 of every files is barely an option.


To clarify a little bit, after restoring the backup, I have:

/data_backup_restore/output/test1/file1.mha

and

/data/results/mhas/first_test/file1.mha

since the restored backup still uses the 'old' directory structure. I've changed it because it was a mess, but I haven't written down all changes/renames I've done, since there were a lot of them.
I would consider both of the above the same if filesize, modification date and filename match.

8
  • To clarify: you have one directory with files and another directory with files. And you want to compare them? Commented Mar 6, 2023 at 9:57
  • I'd probably start with an rsync dry run
    – Tom Yan
    Commented Mar 6, 2023 at 10:32
  • @RomeoNinov Yes, but the files are in different subdirectories
    – muffinname
    Commented Mar 6, 2023 at 12:21
  • @TomYan Sorry, forgot to mention in the original post that I've also tried that with rsync -navi, and while it was really fast, I could not get it to compare just the files while ignoring the path to the file
    – muffinname
    Commented Mar 6, 2023 at 12:24
  • OK, do they have same names: compare file1 with file? And do you want to check if the file exist in second directory or if the same file exist in second directory? Commented Mar 6, 2023 at 12:24

3 Answers 3

2

If I understand correctly you want to compare the two directories recursively, but ignoring the directory structure, so basically if you find two files in the two trees having the same filename, creation/modification time and size (you don't mention size, but I guess it will be also useful), then treat them as the same, even if they are at different positions in the two directory trees.

If this is correct, you can create a list of files with size, time and filename like this:

ls -lR --time-style=long-iso /data/output/  | grep ^- | tr -s ' ' | cut -d' ' -f5- | sort -k 4 >files_output.txt
ls -lR --time-style=long-iso /data/results/  | grep ^- | tr -s ' ' | cut -d' ' -f5- | sort -k 4 >files_results.txt

And then compare the two lists, either with diff or some GUI like meld.

Details:

  • Using --time-style=long-iso to avoid locale specific peculiarities that might break the following pipes.
  • grep ^- to only select the actual files, ignore directories and possibly other special files. Depending on your use case, you might want to add more here, e.g. symbolic links...
  • tr -s ' ' will squeeze multiple consecutive spaces for the following cut to work correctly in all cases.
  • cut columns beginning from column 5 (the file size)
  • sort to make the comparison later work. -k 4 is not really necessary, as long as you are consistent in the two commands. -k 4 will sort by filename which may be useful.

After you compare the two files and find differences, you will of course have to locate the file in the original directory tree, you can use find for this.

Update

Based on your comments, if you want to find the full paths for filenames that appear many times, you can do the following:

First get the list of files that are missing in your second directory, e.g. like this:

comm -1 -3 file_outputs.txt file_results.txt >missing_files.txt

Then, for each missing file use find to find the full path of the specific file:

cat missing_files.txt | while read size date time name
do
    find . -name "$name" -size ${size}c -newermt "$date $time" ! -newermt "$date $time +0000 +1 minutes"
done

Now notice that this is just a simple example and not optimal, and depending on the number of missing files it will call find that many times which can be slow if the directories are big as you indicated. In that case you should try to optimize it somehow (e.g. make a list of all files similar to the ls -lR but containing the full paths, and try to match that list with the list you found in the missing.txt file).

8
  • And what about same filename, same size (you do not compare sizes) but different content? Commented Mar 6, 2023 at 10:38
  • Definitely, this will not catch these. My understanding is that the OP did not want to compare content because of the big size "The directory has a size of around 2TB and a fairly large number of files, so checking the MD5 of every files is barely an option". Even without doing MD5, comparing the content would be slow (that's why the diff -rq from the OP was also slow, since even with the -q it has to compare the whole files as long as they are identical, which they will mostly be in this use case).
    – gepa
    Commented Mar 6, 2023 at 10:43
  • IMHO the situation is opposite, hashing will work faster. And will be applicable for binary files too. Commented Mar 6, 2023 at 10:46
  • Faster than diff or faster than ls? I doubt it will be faster than diff -q which only compares whether files differ or not, but that will depend on the actual implementation of diff -q (which btw also compares binary files). But my point was, that both diff and md5 would be too slow to the OP, since they would have to scan through the whole 2TB disk.
    – gepa
    Commented Mar 6, 2023 at 10:55
  • This worked well and really fast, my only problem is that now I don't know where the files that are missing are located, since a lot of files have the same name. Is there a way to also print the full path to each file but ignore it while comparing the original outputs?
    – muffinname
    Commented Mar 6, 2023 at 12:55
0

To compare file contents, you could use the following commands:

find FolderA -type f -print0 | xargs -0 cksum > FoldA.cksum
find FolderB -type f -print0 | xargs -0 cksum > FoldB.cksum

You may sort the two files together. As the first two fields are checksum and size, you may ignore groups of two that have the same checksum and size. Groups of one denote a missing file in one folder.

Source : Compare large directories recursively - but ignoring sub-directories - compare two backups - with gui.

3
  • IMHO CRC (default for above command) is quite weak as hash and you may see good amount of collisions. Commented Mar 6, 2023 at 12:42
  • 1
    @RomeoNinov: It's easy enough to test. It would be interesting if the poster could do some comparative tests. I never saw checksum collisions, but I haven't used it extensively.
    – harrymc
    Commented Mar 6, 2023 at 13:32
  • I've tried both this and the the sha1sum version, but both seem to be far too slow and were hogging too many resources, so I had to stop them. However, it's 'only' 155k files in total, so a collision seems unlikely to me? I'm sorry if my original post made it seem like much more, it's just that most things I had tried did take surprisingly much longer than ls and I didn't know how many files there were myself.
    – muffinname
    Commented Mar 6, 2023 at 17:07
0

One possible way is to use hashes:

cd /directory1
sha1sum * **/* >/tmp/sum
cd /directory2
sha1sum -c /tmp/sum

The odd construction **/* is to search in subdirectories (globbing should be enabled) this will generate hash of files in first directory and check it with the hash in second directory with indication about OK file and missing/mismatch hash:

#a/aa: OK
rr: OK
zzz: FAILED
sha1sum: WARNING: 1 of 3 computed checksums did NOT match

P.S. Do not be afraid of using hash functions, they are quite fast

2
  • Thanks, something like this should work. I assume sha1sum is faster than cksum suggested harrymc?
    – muffinname
    Commented Mar 6, 2023 at 12:47
  • @muffinname, I am not sure, but AFAIK the default hash of cksum is CRC which create more probabilities for collision (different files with same hash). W/o tests I am 99% sure cksum with CRC is faster than sha1sum :) Commented Mar 6, 2023 at 12:52

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .