How to compress one folder into multiple .zip, all independently extractable?

Question

I work with some big image datasets containing millions of images, and I often need to compress the results of each step of processing to be uploaded as backup.

I have seen that some datasets can be downloaded as a set of .zip files, which can be unzipped independently into the same folder as one consistent dataset. This can be pretty convenient as it enables me to pipeline the download -> decompress -> delete archive process, which is more efficient in terms of both time and storage space, as explained below with arbitrary time/sizes:

When decompressing a single 100GB .zip, let's say downloading takes 5 minutes and decompressing takes 10 minutes. I need 15 minutes to get all my data. Assuming the .zip had a 50% compression ratio, I need to use 100+200 = 300GB disk space.
When decompressing two 50GB .zip, let's say downloading each takes 2.5 minutes and decompressing each takes 5 minutes. I can do: 2.5 minutes downloading zip1, 5 minutes decompressing zip1 and 2.5 minutes downloading zip2 simultaneously, delete zip1, then decompress zip2 in 5 minutes, for a total of 2.5+5+5 = 12.5 minutes. Meanwhile, I only need to have at maximum zip2, folder1 and folder2 on disk at the same time, so 50+100+100 = 250GB of disk space.

These time and space savings increase as we increase the number of separate zip files. I am therefore looking for a way to do this.

My requirements are as such:

The method can work on any folder structure, no matter how deep
Compression results in .zip files of roughly equal size
All resulting archives can be decompressed independently to reconstruct part of the folder (sometimes I may want to use only part of the dataset for tests, in which case I don't want to have to decompress the entire dataset)
Optional:
- The method should be able to show a progress bar
- The method is fast and efficient

I think I would be able to write a bash or python script that fits the first few requirements, but I doubt it would be fast enough.

I am aware of the -s switch in zip and the -v switch in 7z, but they both require the users to have all the parts of the archive to be able to decompress any part of it, which is much less desirable.

Two remarks: (1) Super User is not a script writing service. "These are my requirements, gimme script and lemme test that" is not a good question. What is your script (or at least a stub) so far? Where are you stuck? Helping you with Python may be considered off-topic, shell scripts are mostly OK. (2) If you choose an archiver and a compressor (see the difference) able to run in a pipe then it won't be 10 minutes + 5 minutes; it can be max(10 minutes, 5 minutes), so 10 minutes. — Kamil Maciorowski, Commented Nov 27, 2020 at 7:16
Additionally: decompressing from a pipe does not require disk space for compressed data. OTOH resuming a broken download can be problematic (if possible at all). Still you may want to redesign your approach, especially if your network connection is reliable. Is the ZIP format a constraint or your choice? — Kamil Maciorowski, Commented Nov 27, 2020 at 7:35
@KamilMaciorowski 1) Sorry if I gave off the impression of asking people to write me a script. Ideally I am just looking for a pointer to a tool I have overlooked or some one-liner that would do the trick. 2) I was not aware I could pipe a download to decompress directly. That sounds pretty ideal. Any pointers? Additionally) I chose zip format based on the possibility of extracting only part of the data, and because it is widely used so that my teammates do not have to install extra tools. — LemmeTestThat, Commented Nov 27, 2020 at 10:35

Fang DaHong · Accepted Answer · 2023-11-22 03:33:20Z

I have a script that can assist with this task. Below is an example of a Bash script that individually compresses files into distinct ZIP archives, making them separately extractible. You can execute this script within a directory containing the files to generate ZIP archives. I've tested this process, and Python, particularly with Pandas, can easily read these archives without manual extraction.

#!/bin/bash

# Set the target directory
target_directory="/path/to/your/directory"

# Navigate to the target directory
cd "$target_directory" || exit

# Iterate through files in the directory
for file in *.csv; do
  if [ -f "$file" ]; then
    # Build the target ZIP file name
    zip_file="${file}.zip"

    # Check if the target ZIP file already exists, if yes, skip compression
    if [ -f "$zip_file" ]; then
      echo "File $zip_file already exists. Skipping compression."
    else
      # Compress the file
      zip "$zip_file" "$file"
      if [ $? -eq 0 ]; then
        echo "File $file compressed successfully into $zip_file."
        # Remove the original CSV file after successful compression
        rm "$file"
      else
        echo "File $file compression failed."
      fi
    fi
  fi
done

Running this script in the directory will create separate ZIP files for each CSV file and will delete the original CSV file upon successful compression.

Interesting. This could be useful for some use cases where there are few big and highly compressible files, like those .csvs. However, I was interested in still being able to package small files together so that we don't end up with an unreasonable amount of zip files when compressing folders made up of many small files — LemmeTestThat, Commented Nov 22, 2023 at 4:06

Zachary Vander Klippe · Accepted Answer · 2022-10-11 21:32:18Z

The ZIP file format is really just a container (basically a folder) which contains compressed files. This is in contrast with the .tar.gz format which is frequently used on Linux platforms. The advantage of ZIP is that the contents can be individually extracted exactly as you are hoping to do without extracting the entire archive.

Indeed most operating systems, including Windows, natively support opening a ZIP folder to review file names and metadata without extracting the entire archive. And it isn't difficult to extract just a subset of a large directory structure (in Windows you mealy copy-paste a selection of files)
7-Zip is able to do this as well but you have to press the "Copy" button and then specify the destination.

There are issues with nested .zip files, generally the parent .zip will have to be fully extracted in order to review the children.

As an aside note, the .tar.gz format I mentioned uses the same DEFLATE algorithm as ZIP, but it can sometimes compress better since the file names and metadata is also compressed. The cost to doing this is that usually the whole archive must be extracted to review any of it's contents.

Stack Exchange Network

How to compress one folder into multiple .zip, all independently extractable?

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
linux
compression
zip
archiving
.

Linked

Hot Network Questions

How to compress one folder into multiple .zip, all independently extractable?

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged linuxcompressionziparchiving.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
linux
compression
zip
archiving
.