2

I use that bash command to search files and execute md5dsum on my local system. In my opinion this command has bad performance on large vendor directories. Is there a better style instead of using pipe after pipe with higher performance?

find ./vendor -type f -print0 | sort -z | xargs -0 md5sum | grep -vf /usr/local/bin/vchecker_ignore > MD5sums
4
  • 2
    I don't think pipes are too costly; they are much better than starting a process for each input line. Commented Aug 22, 2018 at 16:08
  • Also, exclude the undesired files (those listed in vchecker_ignore) in the very first step instead of processing them first only to throw them away later. Commented Aug 22, 2018 at 16:24
  • they are much better than starting a process for each input line.. What do you mean? How does it make sense to run sort on each input line (for example)? I think you are mixing something
    – hek2mgl
    Commented Aug 22, 2018 at 21:17
  • pipes itself are no performance problem at all. The only problem here is that you calculate checksums for files which you filter out anyway afterwards.
    – hek2mgl
    Commented Aug 22, 2018 at 21:33

1 Answer 1

8

sort introduces blocking here: it has to wait till find completed before outputting its results. find on a large filesystem, especially with hdd or nfs, may take a while.

You may like to sort at the very end to allow md5sum to run in parallel with find, e.g.:

find ./vendor -type f -print0 | xargs -0 md5sum | grep -vf /usr/local/bin/vchecker_ignore | sort -k2 > MD5sums

md5sum may take some time for large files. You may like to run it with GNU parallel instead of xargs if there are many files or files are large.


You may also like to play with line-buffered mode. In this case it needs to use new-line delimiters for filenames (that prohibits new-line symbols in filenames, which would be rather unusual) instead of 0-delimiter for line-buffered mode to work. E.g.:

stdbuf -oL find ./vendor -type f | stdbuf -oL grep -vf /usr/local/bin/vchecker_ignore | xargs -n50 -d'\n' md5sum | sort -k2 > MD5sums

The above command is going to filter each file through that grep first and then execute md5sum on batches of 50 files. For small files you may like larger batches (and may be remove both stdbuf -oL completely), for large files - smaller.

12
  • 2
    One could sort after the md5sum call (by column 2, the file name output of md5sum); this would ideally allow the hash computation to start while the search is still running; one is disk, the other one is CPU, so there may be a real benefit. Perhaps one should use unbuffer or stdbuf (see unix.stackexchange.com/questions/25372/…) to let md5sum start immediately. Commented Aug 22, 2018 at 16:19
  • @PeterA.Schneider Yep, line-buffered find into parallel md5sum and then collect, filter and sort md5sum outputs. Commented Aug 22, 2018 at 16:27
  • 1
    @PeterA.Schneider Thinking more about buffering, the command uses 0-separators, so that line-buffered mode won't affect it. Commented Aug 22, 2018 at 19:56
  • @MaximEgorushkin You can use a buffer size of 0 instead of line buffered if you want that.
    – hek2mgl
    Commented Aug 22, 2018 at 21:23
  • 1
    @hek2mgl I wouldn't suggest to use xargs like that. newlines in filenames would break it. - if you have newlines in filenames you may have a bigger problem. Commented Aug 22, 2018 at 23:21