Diff of two similar big raw binary files

Question

Let's say I have a 4 GB file abc on my local computer. I have uploaded it to a distant server via SFTP, it took a few hours.

Now I have slightly modified the file (probably 50 MB maximum, but not consecutive bytes in this file) locally, and saved it into abc2. I also kept the original file abc on my local computer.

How to compute a binary diff of abc and abc2?

Applications:

I could only send a patch file (probably max 100MB) to the distant server, instead of reuploading the whole abc2 file (it would take a few hours again!), and recreate abc2 on the distant server from abc and patch only.
Locally, instead of wasting 8 GB to backup both abc and abc2, I could save only abc + patch, so it would take < 4100 MB only.

How to do this?

PS: for text, I know diff, but here I'm looking for something that could work for any raw binary format, it could be zip files or executables or even other types of file.

PS2: If possible, I don't want to use rsync ; I know it can replicate changes between 2 computers in an efficient way (not resending data that has not changed), but here I really want to have a patch file, that is reproducible later if I have both abc and patch.

Kusalananda · Accepted Answer · 2020-02-03 17:05:22Z

For the second application/issue, I would use a deduplicating backup program like restic or borgbackup, rather than trying to manually keep track of "patches" or diffs. The restic backup program allows you to back up directories from multiple machines to the same backup repository, deduplicating the backup data both amongst fragments of files from an individual machine as well as between machine. (I have no user experience with borgbackup, so I can't say anything about that program.)

Calculating and storing a diff of the abc and abc2 files can be done with rsync.

This is an example with abc and abc2 being 153 MB. The file abc2 has been modified by overwriting the first 2.3 MB of the file with some other data:

$ ls -lh
total 626208
-rw-r--r--  1 kk  wheel   153M Feb  3 16:55 abc
-rw-r--r--  1 kk  wheel   153M Feb  3 17:02 abc2

We create out patch for transforming abc into abc2 and call it abc-diff:

$ rsync --only-write-batch=abc-diff abc2 abc

$ ls -lh
total 631026
-rw-r--r--  1 kk  wheel   153M Feb  3 16:55 abc
-rw-------  1 kk  wheel   2.3M Feb  3 17:03 abc-diff
-rwx------  1 kk  wheel    38B Feb  3 17:03 abc-diff.sh
-rw-r--r--  1 kk  wheel   153M Feb  3 17:02 abc2

The generated file abc-diff is the actual diff (your "patch file"), while abc-diff.sh is a short shell script that rsync creates for you:

$ cat abc-diff.sh
rsync --read-batch=abc-diff ${1:-abc}

This script modifies abc so that it becomes identical to abc2, given the file abc-diff:

$ md5sum abc abc2
be00efe0a7a7d3b793e70e466cbc53c6  abc
3decbde2d3a87f3d954ccee9d60f249b  abc2
$ sh abc-diff.sh
$ md5sum abc abc2
3decbde2d3a87f3d954ccee9d60f249b  abc
3decbde2d3a87f3d954ccee9d60f249b  abc2

The file abc-diff could now be transferred to wherever else you have abc. With the command rsync --read-batch=abc-diff abc, you would apply the patch to the file abc, transforming its contents to be the same as the abc2 file on the system where you created the diff.

Re-applying the patch a second time seems safe. There is no error messages nor does the file's contents change (the MD5 checksum does not change).

Note that unless you create an explicit "reverse patch", there is no way to easily undo the application of the patch.

I also tested writing the 2.3 MB modification to some other place in the abc2 data, a bit further in (at about 50 MB), as well as at the start. The generated "patch" was 4.6 MB large, suggesting that only the modified bits were stored in the patch.

Thanks a lot @Kusalananda, it's great! PS: rsync --read-batch=abc-diff ${1:-abc} (automatically generated .sh script) gave remote destination is not allowed with --read-batch rsync error: syntax or usage error (code 1) at main.c(1326) [Receiver=3.1.2], but rsync --read-batch=abc-diff abc worked succesfully. What is the difference between these two similar commands? — Basj, Commented Feb 3, 2020 at 19:21
2/2 Is there a way to take abc as input, apply the patch diff-abc with --read-batch but not modify abc "in-place", but rather output to a new file abc3? (if possible all with rsync, without piping, so that it will work easily on Linux as well as Windows which also has rsync.exe available) — Basj, Commented Feb 3, 2020 at 19:30
@Basj The commands would do different things if $1 had a value. ${1:-abc} means "use the first positional parameter ($1) unless it's empty or undefined. In the case that it's empty or undefined, use abc instead". I'm assuming that $1 had a value when you tried it, possibly something that it interpreted as a remote destination address. — Kusalananda, Commented Feb 3, 2020 at 20:34
@Basj I'm not entirely sure that this is possible, but I'll have a look tomorrow after sleep. — Kusalananda, Commented Feb 3, 2020 at 20:36
Thanks for your answer about ${1:-abc}. It probably failed is because I tried it on Windows (I'm using rsync both on Linux for my distant server, and Windows locally). But it's perfect since rsync --read-batch=abc-diff abc works :) — Basj, Commented Feb 3, 2020 at 20:38

Kaz · Accepted Answer · 2020-02-04 07:50:20Z

4

How to compute a binary diff of abc and abc2?

Using bsdiff/bspatch or xdelta and others.

$ bsdiff older newer patch.bin     # patch.bin is created
[...]
$ bspatch older newer patch.bin    # newer is created

However, these admonishments from the man pages are to be noted:

bsdiff uses memory equal to 17 times the size of oldfile, and requires an absolute minimum working set size of 8 times the size of oldfile.
bspatch uses memory equal to the size of oldfile plus the size of newfile, but can tolerate a very small working set without a dramatic loss of performance.

edited Feb 4, 2020 at 7:50

answered Feb 4, 2020 at 2:18

Kaz

8,5852 gold badges28 silver badges50 bronze badges

Could you possibly show an example?
– Kusalananda ♦
Commented Feb 4, 2020 at 6:32
Thank you for your answer. bsdiff uses memory equal to 17 times the size of oldfile so this won't usually work for 4GB files (at least on my 8GB RAM machine).
– Basj
Commented Feb 4, 2020 at 8:20
@Basj What is possible is to chop up the 4GB file into smaller ones (say 128MB each), and do individual deltas. This could be wrapped into a script. chopped-bsdiff: chop the files, do pairwise bsdiffs, tar those up into an archive. chopped-bspatch: read pairwise patches from archive, apply to chunks of input file, catenate output.
– Kaz
Commented Feb 4, 2020 at 16:28
@Kaz I see, but I'm more looking for a ready-to-use tool that can be called in 1 line (mydiff abc abc2 > patchfile and mypatch abc patchfile > abc3) regardless the size. Also, if I chop into 128 MB chunks, what happens if the first 1GB of abc == the last (trailing) 1GB of abc2? When we'll compare abc-first128mb with abc2-first128mb, no match will be found, so it might not be efficient?
– Basj
Commented Feb 4, 2020 at 17:41

Add a comment |

Community · Accepted Answer · 2020-06-11 14:16:50Z

2

Have you tried just forcing diff to treat the files as text:

diff -ua abc abc2

As explained here.

-u output NUM (default 3) lines of unified context
-a treat all files as text

This should get you a patch. The downside of this is the 'lines' could be quite long and that could bloat the patch.

edited Jun 11, 2020 at 14:16

CommunityBot

1

answered Feb 3, 2020 at 16:47

user1794469

4,1171 gold badge26 silver badges42 bronze badges

Oops, yeah you don't actually want the n. I'm interested to know if it works as I'm not sure how long the "lines" will be.
– user1794469
Commented Feb 3, 2020 at 23:21
Thanks for your comment! I created two very similar 256 MB files abc and abc2. Then I tried diff -ua abc abc2 > patch, then I copied abc to abc3 and I tried to recover abc2 thanks to abc3 and patch : patch abc3 < patch, but it did not work: at the end abc3 was only 1KB instead of 256 MB. Any idea?
– Basj
Commented Feb 3, 2020 at 23:39
Hmmm, not sure what happened. I just did it on my machine and it worked better than I had expected. I took a 382M file that was random integers written out in binary to a file. I changed 3 bytes in it and did the diff and patch and it worked. The resulting files were md5sum equal.
– user1794469
Commented Feb 4, 2020 at 1:56
If a big file has no byte 0x0a, i.e. newline, or very few, I suspect it wouldn't work so well, it would be interesting to test.
– Basj
Commented Feb 4, 2020 at 8:18
Oh for sure. You can make an educated guess on a binary with wc -l which will look for line breaks and in my experience runs very quickly. I would expect on a arbitrary binary it would work pretty well. For example on my machine I found a 252M mp4 that had 1.2 million "lines", and a 59M .deb that had about 230k, so average "lines" of less than 220 bytes and 258 bytes respectively. I don't see why these files would be that different than others but you could definitely get unlucky. In practice I suspect that it would work pretty well and if not it's still a fun hack.
– user1794469
Commented Feb 4, 2020 at 21:33

Add a comment |

vonbrand · Accepted Answer · 2020-02-04 23:32:38Z

1

Use xdelta, it was created exactly for this type of uses. Based on VCDIFF (RFC 3284) in latest versions.

edited Feb 4, 2020 at 23:32

answered Feb 4, 2020 at 13:53

vonbrand

18.3k2 gold badges39 silver badges60 bronze badges

1

The link is non-working (is there another URL?). Also could you add an example in a few lines to show how to: 1) compute the diff patch file, and 2) restore abc2, given only abc and patch?
– Basj
Commented Feb 4, 2020 at 14:00
Sorry, fixed URL
– vonbrand
Commented Feb 4, 2020 at 23:33
1

Thanks @vonbrand. Would you have such an example?
– Basj
Commented Feb 5, 2020 at 7:50

Add a comment |

Basj · Accepted Answer · 2020-02-07 21:26:12Z

Complements to other answers according to my tests:

With `diff`

I created two very similar 256 MB files abc and abc2. Then let's create the diff file:

diff -ua abc abc2 > abc-abc2.diff

Now let's try to recover abc2 thanks to the original abc file and abc-abc2.diff:

cp abc abc3
patch abc3 < abc-abc2.diff

or

cp abc abc3
patch abc3 -i abc-abc2.diff

or

patch abc -i abc-abc2.diff -o abc3

It works on Linux. I also tried on Windows (patch.exe and diff.exe are available too), but for an unknown reason it failed: the produced abc3 file is only 1KB instead of 256MB (I'll update this answer later here).

With `rsync`

As detailed in the accepted answer, this works:

rsync --only-write-batch=abc-abc2-diff abc2 abc

cp abc abc3

rsync --read-batch=abc-abc2-diff abc3

With `rdiff`

As detailed in this answer, this is a solution too:

rdiff signature abc abc-signature
rdiff delta abc-signature abc2 abc-abc2-delta

rdiff patch abc abc-abc2-delta abc3

Tested also on Windows with rdiff.exe from here and it works.

I'm guessing that patch failed on Windows because it was reading the input file in"text" mode which signals end-of-file when it encounters a CONTROL-Z (byte 0x18) in the input file. This is a legacy mode from early DOS days when the directory did not record the length of the file and so the file-length was computed based on the number of 512-byte sectors. If you can tell patch to open the file in binary mode, it shouldn't have this error. — Adrian Pronk, Commented Feb 29, 2020 at 21:06

Stack Exchange Network

Diff of two similar big raw binary files

5 Answers 5

With `diff`

With `rsync`

With `rdiff`

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
files
diff
binary
patch
.

Linked

Hot Network Questions

Diff of two similar big raw binary files

5 Answers 5

With diff

With rsync

With rdiff

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged filesdiffbinarypatch.

Linked

Related

Hot Network Questions

With `diff`

With `rsync`

With `rdiff`

Not the answer you're looking for? Browse other questions tagged
files
diff
binary
patch
.