Fastest way to compress (i.e. archive into a single file) millions of small files on a powerful cluster to speed up transfer of files

Question

IMPORTANT NOTE: Compression is NOT the goal, archiving/taping (packing all of the files into a single archive) is the goal.

I want to backup a single directory, which contains hundreds of sub-directories and millions of small files (< 800 KB). When using rsync to copy these files from one machine to another remote machine, I have noticed that the speed of transfer is painfully low, only around 1 MB/sec, whereas when I am copying huge files (e.g. 500 GB) the transfer rate is in fact around 120 MB/sec. So the network connection is not the problem whatsoever.

In such a case moving only 200 GB of such small files has taken me about 40 hours. So I am thinking of compressing the entire directory containing these files and then transferring the compressed archive to the remote machine, afterwards uncompressing it on the remote machine. I am not expecting this approach to reduce 40 hours to 5 hours, but I suspect it would definitely take less than 40 hours..

I have access to a cluster with 14 CPU cores (56 threads -- Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz) and 128 GB RAM. Therefore, CPU/RAM power is not a problem.

But what is the fastest and most efficient way to create a single archive out of so many files? I currently only know about these approaches:

traditional tar.gz approach
7zip
pigz (parallel gzip - https://zlib.net/pigz/)

However, I do not know which is faster and how the parameters should be tuned to achieve maximum speed? (for example, is it better to use all CPU cores with 7zip or just one?)

N.B. File size and compression rate do NOT matter at all. I am NOT trying to save space at all. I am only trying to create a single archive out of so many files so that the rate of transfer will be 120 MB/s instead of 1 MB/s.

RELATED: How to make 7-Zip faster

The first possible approach that comes to my mind would be to first gzip then tar: by gzipping many files separately you can compress multiples files at the same time (up to 1 per CPU thread). This would divide the compression time by up to 56 in your case. You can use any other compression method than gzip. — Nathan.Eilisha Shiraini, Commented Jul 13, 2018 at 8:14
Compression probably doesn't matter all that much. In addition as the files are small the gains by individually compressing them are probably going to be small. Doing the reverse will probably yield a better result if the total file size matters. — Seth, Commented Jul 13, 2018 at 8:17
If you just want to meet the connection is fully saturated does compression even matter? Wouldn't just tar do the job? — Mokubai, Commented Jul 13, 2018 at 8:20

Kamil Maciorowski · Accepted Answer · 2018-07-13 21:44:31Z

6

Use tar, but forgo the gzipping part. The whole point of TAR is to convert files into a single stream (it stands for tape archive). Depending on your process you could write the stream to a disk and copy that, but, more efficiently, you could pipe it (for example via SSH) to the other machine - possibly uncompressing it at the same time.

Because the process is IO rather then CPU intensive, parellellizing the process won't help much, if at all. You will reduce the file transfer size (if files are not exactly divisible by block size), and you will save a lot by not having the back-and-forward for negotiating each file.

To create an uncompressed tar file:

tar -cf file.name /path/to/files

To stream across the network:

tar -c /path/to/files | ssh [email protected] 'cd /dest/dir && tar -x'

Note: If writing an intermediate file to a hard drive as per example 1, it may actually be faster to gzip the file if there is a decent amount of compression because it will reduce the amount to be written to disk which is the slow pare of the process.

edited Jul 13, 2018 at 21:44

Kamil Maciorowski

75.7k22 gold badges152 silver badges229 bronze badges

answered Jul 13, 2018 at 8:44

davidgo

71.4k14 gold badges111 silver badges169 bronze badges

Since almost all of my files are .bmp, compression would undoubtedly reduce the file size significantly (especially if I use something like 7z -mx9 I am sure the final archive will have a compression rate of at least 60%). But I do not care at all about the compression rate, rather for me speed of the transfer and getting the files on the remote machine as soon as possible is of utmost importance. Isn't 7zip (.7z) more efficient than tar even for taping?
– vivoru
Commented Jul 13, 2018 at 8:52
I have debugged the second example a lot. Please review if this is what you meant.
– Kamil Maciorowski
Commented Jul 13, 2018 at 21:45

Add a comment |

Stack Exchange Network

Fastest way to compress (i.e. archive into a single file) millions of small files on a powerful cluster to speed up transfer of files

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
unix
zip
tar
7-zip
.

Linked

Hot Network Questions

Fastest way to compress (i.e. archive into a single file) millions of small files on a powerful cluster to speed up transfer of files

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged unixziptar7-zip.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
unix
zip
tar
7-zip
.