5

I need to make a zip file archiving ~100k files from a directory containing ~500k files. I get "argument list too long" errors when I try the obvious commands:

zip archive.zip *pattern*.txt                        # fails
zip archive.zip `find . -name "*pattern*.txt"`       # fails

One approach is to use the -@ option to feed a list of files in via stdin:

find . -name "*pattern*.txt" | zip -@ archive.zip

However, the zip man page says:

If a file list is specified as -@ [Not on MacOS], zip takes the list of input files from standard input instead of from the command line.

It's the "Not on MacOS" that is bugging me. I went ahead and tried the -@ option, and it seems to work; but I'm feeling nervous about whether it's really doing the right job (archiving all the files, intact).

Here are my questions:

  1. Why would -@ not be OK on MacOS?
  2. Are there some versions of MacOS/bash/zip where this warning is true, and others where it is not? Is this an obsolete warning, and if so, where is the dividing line?
  3. What would be a viable approach for this problem without using -@?

Note that the solution given here zip: Argument list too long (80.000 files in overall) will not work; I need to be archiving some, not all, of the files in the directory.

I'm running Mac OS 10.7.5. Here is some version info:

$ bash --version
GNU bash, version 3.2.48(1)-release (x86_64-apple-darwin11)
$ zip --version
This is Zip 3.0 (July 5th 2008), by Info-ZIP.
...
Compiled with gcc 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00) for Unix (Mac OS X) on Jun 24 2011.

3 Answers 3

8

First of all,

zip archive.zip `find . -name "*pattern*.txt"`

is never a good idea. Filenames can contain spaces, newlines character, parts that could interpreted as switches and whatnot.

To perform an action for every found file, you can use the -exec switch or xargs.

find . -name "*pattern*.txt" -exec zip archive.zip {} +

will add the files one by one to the zip file. Here, {} symbolizes the currently processed file.

Terminating the -exec argument with a + instead of ; causes find to process several file at once (as many as it can without generating the same errors you're getting), which should be considerably faster for a large number of files.

find . -name "*pattern*.txt" -print0 | xargs -0 zip archive.zip

does essentially the same. xargs processes several files at once by default.

The -print0 switch to find and -0 switch to xargs make them use null characters as file separators to deal properly with strange filenames.

I don't know why the -@ isn't recommended for Mac OS1, but find ... | zip -@ will not handle strange filenames (specifically, filenames containing newline characters) properly. This is true regardless of the operating system.


1 I'm guessing this applies only to Mac OS up to version 9.x, since Mac OS used carriage returns as newline characters, while zip -@ expects linefeeds.

3

Dennis was right, it's an OS 9 thing. I took a look at the source code for Zip 3.0. In the macos/ platform directory, there is a note that says:

This port is for Mac versions before Mac OS X. As Mac OS X is build on Unix, use the Unix port for Mac OS X. - 7 June 2008

In addition, the zip.c file wraps the declaration of the command line option in #ifndef MACOS. In other words, if I were running the "MacOS" port of zip, the -@ option would simply fail.

Dennis also provided the answer to "a viable way to accomplish the task without -@", namely,

find . -name "*pattern*.txt" -print0 | xargs -0 zip archive.zip

I agree that this is the best way to proceed in order to be safe against "weird" filenames (filenames with spaces, newlines, etc). However, there is a performance penalty. xargs will call zip multiple times, with a big set of filenames passed as command line parameters each time. zip will add those files into archive.zip on each invocation. But zip will need to read the ever-larger archive.zip on each invocation, which takes more and more time as the job progresses.

If you know for sure that the file names in question have no pathological characters like spaces or newlines, then the single-pass

find . -name "*pattern*.txt" | zip -@ archive.zip

will be faster; and it works just fine on OS X, because zip on OS X is actually the Unix port. The warning in the manpage doesn't apply.

0

As your version information shows, the base code (and thus presumably the documentation) is quite old, MacOS has changed quite a bit in the meantime. Also, the build is much newer than the base code, there might be changes to the code/configuration for the build that just never made it into the documentation.

In any case, better check (with a small example perhaps) that the command works, and really stores the files it is asked to. If it is important, don't believe colored squares with missing pieces on random Internet sites...

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .