Patch a very big binary file over a slow connection

Question

for backup purposes, I have transfered a very big binary file over a comparably upstream-wise slow connection (transfer took 2 weeks), by rsyncing it on a mounted cifs-share (so I could and can access it block-wise). After the 2 weeks, rsync showed an error (unfortunately couldn't save it) but the file sized matched.Also

tail -c 1000000000 myfile.img|md5sum # and
head -c 1000000000 myfile.img|md5sum

match, so the beginning and end of the file are identical.

Since my downstream is much faster, I downloaded the full image again and did md5 sums over the whole thing, and those do NOT match. So, apparently , somewhere in those 1.5TB is at least one bit that differs.

Is there a way so generate a "patch" from the two files I downloaded and then apply it on the remote file, so that only the wrong blocks have to be transfered again?

Please note again: I do NOT have the power to execute code remotely or make use of rsync's capabilities that require running rsync remotely. I guess I could still use rsync and it works in the order of magnitude of my download rate, but I wonder if there is a better way making use of the fact that I have both version locally. It would probably not be that hard to write something up, but I would prefer using something tested and save the work.

I just saw an answer here that suggested bsdiff. Icannot see it anymore. I actually looked at this and is says it is running with O((n+m) log n). Since my files have the same size, and apparently large portions are the same, I feel this should be possible in O(n) -> Run once over the first file, look at the corresponding bit in the other and write down if you want to change it and to what. — mcandril, Commented Jun 16, 2016 at 11:57
Now about bsdiff: The 200MHz Pentium Pro mentioned on their page would need 9375h for my 1.5TB. My system isn't that slow, but also not a modern Core i7. So I would probably still get at least into the time region on a re download, which I should also be able to achieve with rsync, using this blog.christophersmart.com/2014/01/15/…. The other suggested one I cannot remember. — mcandril, Commented Jun 16, 2016 at 12:00

meuh · Accepted Answer · 2016-06-16 13:41:06Z

1

(assuming Linux) if you believe there is just a block or so of data corrupted, but the size of the block did not change, then you could use cmp -l. It compares byte by byte and with -l gives the offset of any differences. If you have a vague idea of where to start within the files you can give an initial start with -i. When you have the offsets in error you can use dd skip=... to snip that from the original file, and dd seek=... conv=notrunc to paste it into the broken file. (Test on copy first)

answered Jun 16, 2016 at 13:41

meuh

6,4541 gold badge21 silver badges26 bronze badges

Awesome, exactly what I am looking for!
– mcandril
Commented Jun 16, 2016 at 15:40

Add a comment |

billc.cn · Accepted Answer · 2016-06-16 12:57:58Z

0

I would use BitTorrent to recover the file on the remote side. The protocol divides a file into small blocks and automatically re-downloads blocks whose hashes do not match the seed file.

To get it work in a private setting:

Disable DHT on the local and remote bit-torrent clients.
Open local bit-torrent ports on the firewall or setup SSH port forwarding.
Create a seed file on the source side. Do not use a tracker. Make sure the client start to seed the file as well.
Backup the file on the remote side.
Copy the seed file to the remote side and open it with the client.
Point the download location to the corrupted file and choose the option to not start the download!! Also disable options to connect to DHT, peer exchange, etc, if avaialbe.
Ask the client to recheck the downloaded file. It should report a download percentage that is almost complete.
Add the local client as a peer to the download
Start the download

answered Jun 16, 2016 at 12:57

billc.cn

7,10922 silver badges36 bronze badges

Thanks, but as I said: I cannot run code remotely. That also means there cannot be a remote bittorrent client. The only thing I have are protocols like SCP (but NOT SSH, I cannot even get checksum calculated on the remote side), SFTP, CIFS, WebDAV. Potentially messing up stuff is no huge problem, however, since the remote storage supports snapshots.
– mcandril
Commented Jun 16, 2016 at 13:42
If you have SCP/CIFS/WebDAV access, you can mount these as local file systems and use BitTorrent as above. It would be extra slow though.... An intermediate solution would be to do this from a computer with fast connection to the remote side. E.g. AWS/VPS-by-hour provider close to the remote server.
– billc.cn
Commented Jun 16, 2016 at 17:03
Yes, but in that case I do not see how rsync wouldn't be much more straightforward. I actually do have a server with fast access to that storage, but then I still would use rsync. Should have thought of that for the initial transfer. Anyway meuh's proposal is exactly what I want and I cannot image how it could work faster. It is O(n) locally and then only transfers the wrong bytes.
– mcandril
Commented Jun 17, 2016 at 7:38

Add a comment |

Stack Exchange Network

Patch a very big binary file over a slow connection

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
file-transfer
rsync
patch
.

Hot Network Questions

Patch a very big binary file over a slow connection

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged file-transferrsyncpatch.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
file-transfer
rsync
patch
.