How to find files that don’t have a suffixed version?

Question

I have a few million .jpg files and I want to generate a .jpg.webp version for each one (foo.jpg -> foo.jpg.webp). For this, I need to find all files ending in .jpg for which there’s no .jpg.webp version.

Right now, I do it like this:

find "$path" -type f -iname "*.jpg" |
  while read -r image_path; do
      if [ ! -f "$image_path.webp" ]; then
        echo "$image_path"
      fi
  done |
  # treat only 10000 files per run
  head -n 10000 |
  ...

However because I’m using a pipe, this creates a subshell. I wonder if there would be a more efficient way to do so, especially because the more WebP images I generate, the more time the script spends filtering paths to find candidates. Would there be some way to do this using find only?

I’m using Ubuntu 20.04. Files are spread across sub directories.

It's only one subshell total, not one subshell per file. That doesn't mean it's efficient (bash's read operates one character at a time), but it's a whole lot less inefficient than more naive approaches would be. — Charles Duffy, Commented Aug 29, 2023 at 17:00
(that said, you'd have correctness benefits from using -print0 on the find command, and making it while IFS= read -r -d '' image_path; do) — Charles Duffy, Commented Aug 29, 2023 at 17:02

terdon · Accepted Answer · 2023-08-29 11:48:02Z

8

I'd do the following:

Find all suffixed (i.e., *.jpg.webp) files, put them in a sorted list removing their suffix;
Find all files without suffix (i.e., *.jpg), put them in a second sorted list
compare the two list, removing those entries that are in the first list.
operate your conversion on the "set difference" list that results from it.

So, something like

#!/bin/bash
comm -z -1 -3 \
   <(find -name '*.jpg.webp' -print0 | sed 's/\.webp\x0/\x0/g' | sort -z) \
   <(find -name '*.jpg'      -print0 | sort -z) \
| parallel -0 gm convert '{}' '{}.webp'

assuming you're using GraphicsMagick gm for conversion (in my experience, speed- and reliability-wise much preferable to ImageMagick's convert), and assuming you have GNU parallel installed (if not, xargs might work just as well).

edited Aug 29, 2023 at 11:48

terdon♦

245k67 gold badges464 silver badges696 bronze badges

answered Aug 29, 2023 at 11:24

Marcus Müller

32.7k3 gold badges53 silver badges77 bronze badges

@terdon: I do think <() is a POSIX shell construct, isn't it?
– Marcus Müller
Commented Aug 29, 2023 at 11:50
No, it isn't. Try it with sh or dash.
– terdon ♦
Commented Aug 29, 2023 at 11:52
works with sh for me! (but that sh might be a self-limiting bash… so.)
– Marcus Müller
Commented Aug 29, 2023 at 11:52
4

That's because your sh is probably a symlink to bash, so it isn't sh but bash running in POSIX mode which is not an accurate representation of an actual POSIX shell. Dash fails with Syntax error: "(" unexpected.
– terdon ♦
Commented Aug 29, 2023 at 11:53
You might consider using cwebp itself directly instead of having graphicsmagick or ImageMagick call it.
– Mark Setchell
Commented Aug 31, 2023 at 7:31

| Show 1 more comment

cas · Accepted Answer · 2023-08-29 11:34:18Z

7

Try something like this:

find "$path" -type f -iname "*.jpg" -exec \
  sh -c 'for f; do [ -e "$f.webp" ] || echo "$f" ; done' find-sh {} +

This executes sh as few times as possible (depending how many .jpg files found by find), limited by ARG_MAX (around 2 million bytes on Linux), and avoids the need for an excruciatingly slow while read ... loop by passing all the filenames as command line arguments. See Why is using a shell loop to process text considered bad practice? and Why is looping over find's output bad practice?

To efficiently process batches of these files, I'd redirect the output to a file and then split that into batches of 10,000 (or however many you need), e.g. with split -l 10000.

NOTE: If any of your .jpg filenames contain newlines then you will need to use NUL as the separator between them, otherwise use newline as the separator. To use NUL separators, replace echo "$f" with printf "%s\0" "$f". BTW, split supports NUL-separated input with -t '\0'.

The script that process the batches should read in the filenames, and check again that the corresponding .jpg.webp file doesn't exist (in case one was generated after the list was generated) before running whatever it needs to generate the .jpg.webp version.

If you had to use NUL as the filename separator, it would be easiest to use readarray (AKA mapfile) to read the entire batch's list into an array and iterate over the array of filenames. Or use awk or perl to process the filenames.

Actually, using an array would be better than a while-read loop, even with newlines as separators.

edited Aug 29, 2023 at 11:34

answered Aug 29, 2023 at 9:11

cas

79.2k7 gold badges124 silver badges194 bronze badges

1

Ideally, they'd do any processing on the files in the in-line script that find calls, and not downstream in some pipeline. If that means the in-line script becomes too complicated, then write it as a separate script that you call with a batch of pathnames from find.
– Kusalananda ♦
Commented Aug 29, 2023 at 11:28
i'd agree with that, but they said they had millions of jpeg files and wanted to process them in batches of 10000. The find one-liner in my answer is just to generate the initial list of all .jpg files without corresponding .jpg.webp files - they can then split up that list however they want and use it as required to generate the .jpg.webp files.
– cas
Commented Aug 29, 2023 at 11:33
1

The OP's original code only starts a single shell already. By starting a shell per batch of files this is strictly worse.
– Charles Duffy
Commented Aug 29, 2023 at 17:02
1

Ah. This remains worse in terms of number of shells run (in that anything equal-to-or-greater-than 1 is worse than having a guarantee of exactly 1), but you're right that read reading from a pipe is not good. I'm curious where the breakeven point is; personally, I'd rewrite to awk or Python or such.
– Charles Duffy
Commented Aug 30, 2023 at 0:27
1

1. a one-liner that runs much faster and more efficiently than the original one-liner is in no way worse than the original, no matter how many shells are run. 2. if "number of shells run" is your metric, then that's a worse-than-useless metric. 3. personally, i'd use awk or perl (probably perl, something like find ... | perl -lne 'print unless -e "$_.webp"', optionally with -0 for NUL separators). python's pretty bad for jobs like this, and clumsy for one-liners.
– cas
Commented Aug 30, 2023 at 0:39

| Show 1 more comment

Peter Cordes · Accepted Answer · 2023-08-30 09:25:26Z

6

This sounds like a job for make. It will only generate the files that are missing, or have older modification time than the files they're generated from.

.PHONY: all
all: $(addsuffix .webp,$(shell find . -name '*.jpg'))

%.jpg.webp: %.jpg
    cwebp $< -o $@   #Some command that generates $@ from $<

Save this to a file named Makefile, and run make.
Or make -j $(nproc) to run as many parallel jobs as you have logical cores. Or pick an explicit number, perhaps tthe number of physical cores, to leave some idle logical cores for other work.)

This will break if any files or subdirectories have spaces in their names.

%.jpg.webp: %.jpg is a pattern rule.

edited Aug 30, 2023 at 9:25

Peter Cordes

6,49622 silver badges41 bronze badges

answered Aug 30, 2023 at 4:01

HolyBlackCat

1991 silver badge10 bronze badges

3

I've never tried using Make with a variable that expands to a few million words. Might be an interesting experiment...
– Toby Speight
Commented Aug 30, 2023 at 8:03

Add a comment |

terdon · Accepted Answer · 2023-08-29 09:30:01Z

The problem isn't the subshell, per se, it's the fact that you are looping over the output using the shell to iterate over and check the files. The shell is slow, so it would be faster to just run find to get the files once, and then iterate over the list of target files. Something like this:

find . -name '*jpg' > jpgs
find . -name '*jpg.webp' > webps
awk 'NR==FNR{webps[$0]++; next} !($0".webp" in webps)' webps jpgs | 
  head -n 10000 | ...

Or even

awk 'NR==FNR{webps[$0]++; next} !($0".webp" in webps)' webps jpgs > target.files
## split into lists of 10000 file names each (assuming no newlines in your file names)
split -l 10000 target.file list
## Process each list
for list in list*; do
  while read jpg; do
    whatever "$jpg" > "$jpg".webp
  done < "$list"
done

But since you're dealing with millions of files, you probably want to use GNU parallel, assuming you have access to it, so you can run the commands in parallel. Assuming the command you use to create the .webp is command foo.jpg > foo.webp, you would do:

find . -name '*jpg' > jpgs
find . -name '*jpg.webp' > webps
awk 'NR==FNR{webps[$0]++; next} 
    !($0".webp" in webps){ 
       printf "command \"%s\" > \"%s\".webp\n",$0,$0 
}' webps jpgs > script.sh
parallel -j 30 < script.sh

That will create the script script.sh with all the conversions you want to run, and then would execute them running 30 conversions in parallel at a time.

aviro · Accepted Answer · 2023-08-29 10:48:46Z

2

Would there be some way to do this using find only?

Without considering performance and timing, this is the simplest find command to perform this task:

find "$path" -type f -iname "*.jpg" ! -exec test -e '{}.webp' \; -print

It probably won't be as fast as the other answers, but just as a reference.

By the way, if you're only looking for files that end with lowercase jpg, it's better to use -name (case sensitive) instead of -iname (case insensitive) which might be be a little bit slower, especially for millions of files.

answered Aug 29, 2023 at 10:48

aviro

5,92213 silver badges31 bronze badges

In terms of performance, this will have find fork/exec /bin/test once per .jpg file. That's not as bad as starting a whole shell for each file, but it's (much?) worse than a shell or perl loop over .jpg args. (The standalone /bin/test is a fairly small program that only links libc, but makes more system calls than really necessary for test -e file. But most of its system calls are startup overhead from libc and the dynamic-linker. And of course process creation and exit costs time, and the context switches on exit / wait() take time.)
– Peter Cordes
Commented Aug 30, 2023 at 3:15

Add a comment |

Kevin · Accepted Answer · 2023-08-31 00:21:55Z

1

If you use zsh, you can do this with a glob:

echo **/*.jpg(.e:'! [ -e "$REPLY.webp" ]':)

Or with the max count, if you want that:

echo **/*.jpg(.Y10000e:'! [ -e "$REPLY.webp" ]':)

How exactly you use it depends on what you want to do, how your generation script/binary args work, and whether you hit the max command line length, so give details for followup questions.

answered Aug 31, 2023 at 0:21

Kevin

41k16 gold badges89 silver badges113 bronze badges

I don’t use zsh (I tagged the question with bash), but why would you use a glob rather than find? With Bash, the glob takes 1m44 to end while find does it in 6s.
– bfontaine
Commented Aug 31, 2023 at 8:52
1

Depends how you're using it, in some cases it's simpler and cleaner. If it all fits on one line and the tool takes it, or if you need it in a for loop, you can use it like any other glob, those get used plenty. As for performance, I don't know if your test had a flaw or if zsh is just more optimized, but in my test, find took 5 minutes and the glob was a few seconds. In any case, the time for a command to run is normally dwarfed by the time to write it so how you end up doing it doesn't matter that much.
– Kevin
Commented Aug 31, 2023 at 15:28

Add a comment |

Stack Exchange Network

How to find files that don’t have a suffixed version?

6 Answers 6

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
bash
files
find
.

Linked

Hot Network Questions

How to find files that don’t have a suffixed version?

6 Answers 6

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged bashfilesfind.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
bash
files
find
.