sort
introduces blocking here: it has to wait till find
completed before outputting its results. find
on a large filesystem, especially with hdd or nfs, may take a while.
You may like to sort at the very end to allow md5sum
to run in parallel with find
, e.g.:
find ./vendor -type f -print0 | xargs -0 md5sum | grep -vf /usr/local/bin/vchecker_ignore | sort -k2 > MD5sums
md5sum
may take some time for large files. You may like to run it with GNU parallel
instead of xargs
if there are many files or files are large.
You may also like to play with line-buffered mode. In this case it needs to use new-line delimiters for filenames (that prohibits new-line symbols in filenames, which would be rather unusual) instead of 0-delimiter for line-buffered mode to work. E.g.:
stdbuf -oL find ./vendor -type f | stdbuf -oL grep -vf /usr/local/bin/vchecker_ignore | xargs -n50 -d'\n' md5sum | sort -k2 > MD5sums
The above command is going to filter each file through that grep
first and then execute md5sum
on batches of 50 files. For small files you may like larger batches (and may be remove both stdbuf -oL
completely), for large files - smaller.
they are much better than starting a process for each input line.
. What do you mean? How does it make sense to run sort on each input line (for example)? I think you are mixing something