2

find manpage states:

   -exec command {} +
          This variant of the -exec action runs the specified command on the selected files,
          but the command line is built by appending each selected file name at the end;
          the total number of invocations of the  command  will  be
          much  less than the number of matched files.

Always thought that this would cause find to execute command exactly once. Is there a way to know how many times command gets called?

Note this is important as if it's only one time as I've thought, then there's danger of building too large of a argument list for command to handle; but if find will end up splitting the invocations up (somewhat similar to parallel), then this would be mitigated.

2
  • 1
    I don't think this is answerable in a quantifiable way. The number of invocations is going to be inversely dependent on the length of each path argument. Commented Sep 15, 2020 at 22:15
  • 1
    The + is the extension that makes -exec run a list in a similar way to xargs. With a plain {}, the command does run once per filename. My Linux understands getconf ARG_MAX and tells me 2097152. Your command could also be a script that chunks the initial arg-list and calls the final command multiple times with a more sensible arg-list, and could also handle Begin and End conditions. Commented Sep 16, 2020 at 8:05

4 Answers 4

3

The buffer used depends on the find version and seems to be about 256Kb in size in the SuSE box I've available here.

So to calculate how many times "command" gets invoked, you'd need to know the length of each found filepath, then it would be (approximately) the sum of all the path lengths increased by one for the dividing space, minus the command itself, divided by the buffer size.

E.g. you find 20,000 files with an average path length of 200 bytes, that is 4,020,000 bytes, divided by 256 Kb is 15.33, so you'd need about 16 calls.

The exact calculation would be slightly more complex to take into account the need of not breaking a filepath between two consecutive calls, but you get a ballpark figure.

See here for a thread (with source code) where the size is reported to be 32Kb, and deemed unnecessarily low (now that I think of it, maybe my own find is using the syslimits. I haven't experimented); coreutils's version, by inference, appears to be four times that, i.e. 128 Kb.

2

The limit will depend on find(1)'s buffers, and what the command handles (kernel-dependent). Unless the very last percent of performance is critical, the defaults on your system should be fine.

If you are worrying about performance, consider the whole system that does this, and measure where the bottlenecks are. Chances are you'll be very surprised by your findings. Bentley, in his delicious "Writing efficient programs" (Prentice-Hall, 1982), sadly long out of print, shares several stories of careful "optimizations" that made essentially unused, fatally buggy code "faster" or optimized the idle loop of an operating system after measuring that it took up a substantial fraction of the computer's time. People are notoriously bad at guessing where inefficiencies lay. Besides, it pays off much more to work on the higher levels (system architecture, overall organization, algorithms and data structures) than on details.

2

Preliminary note: the manual and your question uses command to denote the command, but since POSIX defines an utility literally named command, my answer will use cmmnd.


If you want to actually run cmmnd(s) and just count the number of invocations (to know it after find finishes) then create a wrapper that does something you can count (e.g. prints to stderr, prints to a logfile, beeps) and eventually runs the cmmnd. Example:

#!/bin/sh
echo "invoking cmmnd" >&2
cmmnd "$@"

Then use the wrapper instead of the cmmnd inside find.

Note find will use the /absolute/path/to/wrapper while creating commands that are not too long; then the wrapper will use /absolute/path/to/cmmnd. If the latter is longer then some command line(s) containing it may turn out to be too long anyway. So this approach is not as straightforward as we wish. You can extend the former path by supplying it to find verbatim with additional slashes (e.g. /absolute/path/to/////wrapper).


Now I assume you want to know the number before you decide to run cmmnd(s). Like in a case where calling the cmmnd twice is a bad thing (for whatever reason) and you want to make sure find will run it exactly once.

The above wrapper with cmmnd "$@" commented out can be used. Below are few other ideas (in the end not so different though).

Let's assume you want to do this:

find . -exec cmmnd … {} +

(where denotes constant arguments). Find out what the absolute path to cmmnd really is. E.g. it can be /bin/cmmnd. Then run something like this:

find . -exec /aaa/zzzzz … {} +

where /aaa/zzzzz is a nonexistent command whose name is of the same length as /bin/cmmnd. Now find will build command lines with /aaa/zzzzz that will be of the same length as command lines with /bin/cmmnd would be. You will get

find: '/aaa/zzzzz': No such file or directory

one or more times. Count them to get the number you want. This simple approach:

find . -exec /aaa/zzzzz … {} + 2>&1 | wc -l

is not the best because find may also print e.g. permission denied for some files it encounters. But if you create /aaa/zzzzz as a valid executable that prints exactly one line (it can be an empty line), then this should work:

find . -exec /aaa/zzzzz … {} + | wc -l

Another improvement is to name the tool /a (instead of /aaa/zzzzz), and call it as /////a or /////////////////a etc., depending on the length you need. Example:

find . -exec /////////a … {} + | wc -l

For completeness, this is what a may look like:

#!/bin/sh
echo

It's almost like our wrapper without cmmnd "$@", it uses the stdout though.

Notes:

  • The exact number of / characters is not critical. A mistake by few will not change the result drastically. If you need an estimate result, you can blindly use ///////////a or so, unless the path to the cmmnd is unusually long. Note that using exactly /a will give you the lower bound.

  • In practice you often have other tests before -exec cmmnd … {} +. If you replace cmmnd with /////////a or so, the other tests will still be performed. You should not omit them because they decide what pathnames get to -exec in the first place. But if the tests do or change something, it may be that performing them without the cmmnd is wrong.

    E.g. you may want to delete files with -delete -exec cmmnd … {} +, where cmmnd generates a report on files that have been deleted. In this case using /////////a will delete files without generating any report. So think before you act.

  • Make sure tests/actions/whatever other than -exec /////////a … {} + print nothing to stdout. Or let /a use some other channel.

  • Processing the given directory tree(s) and performing (other) tests may take a while even without cmmnd(s).

0

Well, the standard text says:

The size of any set of two or more pathnames shall be limited such that execution of the utility does not cause the system's {ARG_MAX} limit to be exceeded.

So it shouldn't build an argument list too large to execute. That would defeat the point of a feature like this.

How many invocations it does exactly, is up to the implementation, and is probably something you shouldn't care too much about. The standard does promise that invocations of the same -exec clause don't overlap, which may be relevant for correctness if you execute something that has outside state.

However, on Linux, the actual maximum size of the command line arguments is based on the stack size, and can be indirectly changed with ulimit -s. And it appears that unlike e.g. xargs, the find on my Debian and Ubuntu doesn't actually check the limit at runtime, so it's theoretically possible to hit problems.

$ mkdir bar
$ touch bar/{00000..99999}
$ ulimit -Ss 512
$ getconf ARG_MAX
131072
$ find bar -type f -exec sh ./args.sh {} +
find: ‘sh’: Argument list too long
find: ‘sh’: Argument list too long
...

However, the default for ulimit -s is 8192, so you're not likely to get that problem, except on a very constrained system.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .