IMPORTANT NOTE: Compression is NOT the goal, archiving/taping (packing all of the files into a single archive) is the goal.
I want to backup a single directory, which contains hundreds of sub-directories and millions of small files (< 800 KB). When using rsync
to copy these files from one machine to another remote machine, I have noticed that the speed of transfer is painfully low, only around 1 MB/sec, whereas when I am copying huge files (e.g. 500 GB) the transfer rate is in fact around 120 MB/sec. So the network connection is not the problem whatsoever.
In such a case moving only 200 GB of such small files has taken me about 40 hours. So I am thinking of compressing the entire directory containing these files and then transferring the compressed archive to the remote machine, afterwards uncompressing it on the remote machine. I am not expecting this approach to reduce 40 hours to 5 hours, but I suspect it would definitely take less than 40 hours..
I have access to a cluster with 14 CPU cores (56 threads -- Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz) and 128 GB RAM. Therefore, CPU/RAM power is not a problem.
But what is the fastest and most efficient way to create a single archive out of so many files? I currently only know about these approaches:
- traditional
tar.gz
approach 7zip
pigz
(parallel gzip - https://zlib.net/pigz/)
However, I do not know which is faster and how the parameters should be tuned to achieve maximum speed? (for example, is it better to use all CPU cores with 7zip or just one?)
N.B. File size and compression rate do NOT matter at all. I am NOT trying to save space at all. I am only trying to create a single archive out of so many files so that the rate of transfer will be 120 MB/s instead of 1 MB/s.
RELATED: How to make 7-Zip faster
gzip
thentar
: by gzipping many files separately you can compress multiples files at the same time (up to 1 per CPU thread). This would divide the compression time by up to 56 in your case. You can use any other compression method than gzip.tar
do the job?