9

I have two 300 GB files on different volumes:

  • encrypted local backup
  • encrypted ‘remote’ backup on NAS that is).

By design, these two files are identical in size and also mostly (>90%) identical in content...

Is there an efficient tool to „rsync“ these files, and only copy over the differing sections, so the target file becomes identical with the source?

Perhaps somethings that builds checksums of blocks to figure that out, I don't know... (anything more efficient than cp -f... rsync would afaik also grab the entire source file to overwrite)

8
  • Why take risk to sync sections of file, when you can probably corrupt it by doing so. Good question though.
    – ankit7540
    Commented Feb 25, 2017 at 13:58
  • Why are the backups only 90%+ identical? I would assume they should be same already if that's your goal. How are they created? Related: askubuntu.com/questions/556580/…
    – Elder Geek
    Commented Feb 25, 2017 at 14:32
  • 1) I will know about corruption by using a hash. 2) The encrypted images contain only few changes, they never get fully or to any larger portion rewritten... thus it's like sectors on a harddrive (only few changes week by week)
    – Frank N
    Commented Feb 25, 2017 at 14:39
  • Is'nt possible to start from other end, like not generetaing backups as monolith images? Maybe you could use filesystem images, mount them, backup there, unmount and encrypt, for example.
    – wk.
    Commented Feb 25, 2017 at 15:01
  • 3
    If I'm not mistaken, rsync can use a delta-transfer algorithm to transfer only the different parts. Try to force it with the --no-W option. Try --no-whole-file if that doesn't work.
    – barotto
    Commented Feb 25, 2017 at 15:26

2 Answers 2

13

rsync can be used to do this.

--no-whole-file or --no-W parameters use the block-level sync instead of the file level syncing.


Test case

Generated a random text files using /dev/random and large chunks of text file from websites as following. These 4 files are different in all contents. tf_2.dat is our target file.

~/logs/rs$ ls -tlh    
-rw-rw-r-- 1 vayu vayu 2.1G  二  25 23:11 tf_2.dat
-rw-rw-r-- 1 vayu vayu 978M  二  25 23:11 a.txt
-rw-rw-r-- 1 vayu vayu 556K  二  25 23:10 file2.txt
-rw-rw-r-- 1 vayu vayu 561K  二  25 23:09 nt.txt

Then copied them to different hard disk using rsync (the destination is empty).

rsync -r --stats rs/ /mnt/raid0/scratch/t2

The following stat was received.

Number of files: 5 (reg: 4, dir: 1)
Number of created files: 0
Number of deleted files: 0
Number of regular files transferred: 4
Total file size: 3,260,939,140 bytes
Total transferred file size: 3,260,939,140 bytes
Literal data: 3,260,939,140 bytes
Matched data: 0 bytes
File list size: 0
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 3,261,735,553
Total bytes received: 92

sent 3,261,735,553 bytes  received 92 bytes  501,805,483.85 bytes/sec
total size is 3,260,939,140  speedup is 1.00

Now I merge, the files to make a new file which has approx 60% old data.

cat file2.txt a.txt >> tf_2.dat

Now, I sync the two folders , this time using the --no-W option.

rsync -r --no-W --stats rs/ /mnt/raid0/scratch/t2

Number of files: 5 (reg: 4, dir: 1)
Number of created files: 0
Number of deleted files: 0
Number of regular files transferred: 4
Total file size: 4,289,593,685 bytes
Total transferred file size: 4,289,593,685 bytes
Literal data: 1,025,553,047 bytes
Matched data: 3,264,040,638 bytes
File list size: 0
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 1,026,127,265
Total bytes received: 611,604

sent 1,026,127,265 bytes  received 611,604 bytes  21,169,873.59 bytes/sec
total size is 4,289,593,685  speedup is 4.18

You can see a large data is matched and speedup.

Next, I try again, this time I merge several shell files to the target (tf_2.dat) such that change is ~2%,

cat *.sh >> rs/tf_2.dat

And, again sync using rsync.

rsync -r --no-whole-file --stats rs/ /mnt/raid0/scratch/t2


Number of files: 5 (reg: 4, dir: 1)
Number of created files: 0
Number of deleted files: 0
Number of regular files transferred: 4
Total file size: 4,289,727,173 bytes
Total transferred file size: 4,289,727,173 bytes
Literal data: 178,839 bytes
Matched data: 4,289,548,334 bytes
File list size: 0
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 541,845
Total bytes received: 690,392

sent 541,845 bytes  received 690,392 bytes  43,236.39 bytes/sec
total size is 4,289,727,173  speedup is 3,481.25

We see a large match and speedup giving fast syncing.

5
  • This is why I ❤ stackoverflow. Most simple solution. Accepted!
    – Frank N
    Commented Feb 25, 2017 at 17:22
  • In wonder if the results are the same if a block somewhere in the middle changes... (unchanged blocks behind must of course maintain position (aka offest), otherweise it becomes a costly rewrite. I know that as a C-Programmer... )
    – Frank N
    Commented Feb 25, 2017 at 17:53
  • 1
    And I wonder, if -no-whole-file implies --inplace (these guys wonder over the opposite direction). — Either way can probably not hurt,to add that param, too.
    – Frank N
    Commented Feb 25, 2017 at 17:55
  • I tried --inplace, but that did not work.
    – ankit7540
    Commented Feb 25, 2017 at 18:35
  • I think not having --inplace just leads that data is first copied to a parallel file (a .hidden one, as I could watch slowly building up the Gigabyte) and only then deleting the old and renaming the tmp file to its name... anyway, with -no-whole-file this should be a must anyway.
    – Frank N
    Commented Feb 25, 2017 at 18:51
2

You can also try to use https://bitbucket.org/ppershing/blocksync (disclaimer: I am the author of this particular fork). An advantage over rsync is that it reads the file only once (as far as I know rsync can't be convinced to assume two files are different without computing the checksum before it starts the delta transfer. Needless to say, reading 160GB hard drives twice isn't a good strategy). A note of caution -- the current version of blocksync works well over short-RTT connections (e.g., localhost, LAN and local WiFi) but isn't particularly useful for syncing over long distances.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .