I work with some big image datasets containing millions of images, and I often need to compress the results of each step of processing to be uploaded as backup.
I have seen that some datasets can be downloaded as a set of .zip files, which can be unzipped independently into the same folder as one consistent dataset. This can be pretty convenient as it enables me to pipeline the download -> decompress -> delete archive process, which is more efficient in terms of both time and storage space, as explained below with arbitrary time/sizes:
- When decompressing a single 100GB .zip, let's say downloading takes 5 minutes and decompressing takes 10 minutes. I need 15 minutes to get all my data. Assuming the .zip had a 50% compression ratio, I need to use 100+200 = 300GB disk space.
- When decompressing two 50GB .zip, let's say downloading each takes 2.5 minutes and decompressing each takes 5 minutes. I can do: 2.5 minutes downloading zip1, 5 minutes decompressing zip1 and 2.5 minutes downloading zip2 simultaneously, delete zip1, then decompress zip2 in 5 minutes, for a total of 2.5+5+5 = 12.5 minutes. Meanwhile, I only need to have at maximum zip2, folder1 and folder2 on disk at the same time, so 50+100+100 = 250GB of disk space.
These time and space savings increase as we increase the number of separate zip files. I am therefore looking for a way to do this.
My requirements are as such:
- The method can work on any folder structure, no matter how deep
- Compression results in .zip files of roughly equal size
- All resulting archives can be decompressed independently to reconstruct part of the folder (sometimes I may want to use only part of the dataset for tests, in which case I don't want to have to decompress the entire dataset)
- Optional:
- The method should be able to show a progress bar
- The method is fast and efficient
I think I would be able to write a bash or python script that fits the first few requirements, but I doubt it would be fast enough.
I am aware of the -s switch in zip and the -v switch in 7z, but they both require the users to have all the parts of the archive to be able to decompress any part of it, which is much less desirable.