0

I have the command:

awk 'BEGIN{print "Name, Number"}/value/{print FILENAME, "," $8}' *.txt >> out.csv

Which works perfectly to go through my txt files in the directory, parse the value(s) and write the final csv file with the header (Name, Number).

My issue is that I have "too many" and so I modify it with find and xarg:

find ./ -maxdepth 1 -type f -name '*.txt' | xargs awk 'BEGIN{print "Name, Number"}/value/{print FILENAME, "," $8}' | sed 's/\.\///g' >> out.csv

This has worked in the past, but now I find that -- on occasion -- the header is written more than once to the final csv file. I don't know why. It does ssem to be related to the total number of txt files in the directory such that if I hit a certain number, this happens, but I am not really sure.

thanks.

1
  • 1
    xargs runs the awk command as many times as necessary to avoid the "too many" problem - the BEGIN block will be executed each time it does so Commented Oct 3, 2021 at 1:51

2 Answers 2

3

The find will call awk in batches of files so the BEGIN will be executed once per batch instead of once for all files as you want. Instead of having awk called with all the files as arguments and having the shell fail with a "too many arguments" error, you can have awk read all the files as input and populate it's internal array of files to read (ARGV[]) from that:

find ./ -maxdepth 1 -type f -name '*.txt' |
awk '
    BEGIN { OFS=","; print "Name", "Number" }
    NR==FNR { ARGV[ARGC++]=$0; next }
    /value/ { print substr(FILENAME,3), $8 }
' - > out.csv

I also tidied up a couple of things in the awk script and got rid of the pipe to sed as you never need sed when you're using awk. I changed >> to > as I assume you want to create the output file from scratch whenever the above command is called rather than appending to it.

The above assumes none of your file names contain newlines. If they do then use GNU tools and add -print0 to the end of the find command and RS="\0"; to the BEGIN section of the awk command. It also assumes your file names don't contain " as then the output wouldn't be valid CSV but your first script that you said works perfectly apart from the "too many arguments" issue would fail if your file names contained any of those so they must not.

3
  • It also assumes file names don't contain = character and as the output is meant to be in CSV format, you'd need to wrap the filenames in "..." and escape the "s in them as "". Commented Oct 3, 2021 at 15:27
  • That's true but the script that works for the OP (aside from "too many arguments") already has those assumptions so we know their file names don't have those issues while we don't know for sure that they don't contain newlines. I added a note about that though.
    – Ed Morton
    Commented Oct 3, 2021 at 15:29
  • @StéphaneChazelas I also added a fix for the case of a file name that includes = by not stripping the ./ till the output is occurring.
    – Ed Morton
    Commented Oct 3, 2021 at 17:56
0

Run find and awk in a group command (i.e. wrapped in { and }) or in a sub-shell (i.e. wrapped in ( and )) and print the header before running find. Redirect the output from the entire group command or sub-shell to your output file.

For example:

{
  echo "Name,Number"
  find ./ -maxdepth 1 -type f -name '*.txt' -exec \
    awk -v OFS=, '
      FNR==1 { fn = FILENAME; sub(/^\.\//, "", fn };
     /value/ {print fn, $8}' {} +
} >> out.csv

NOTES:

  1. See man bash and search for Compound Commands
  2. You don't need xargs here - use find's -exec option instead, e.g. find ... -exec awk ... {} +.
  3. You don't need sed, either - awk's built-in sub() function can be used to remove the leading ./ from the filenames coming from find. BTW, gsub() can be used for global search and replace, like the /g modifier in sed.
5
  • He's got too many files and yet you recommend -exec instead of xargs? Why? Are you aware of the extra overhead of calling a new instance of awk for every single file?
    – D4RIO
    Commented Oct 4, 2021 at 1:39
  • @D4RIO you need to learn the difference between using -exec with {} \; and using -exec with {} +. Are you aware that xargs is over-used by people who don't understand find? or of the bugs that will be introduced if you use xargs on filenames without using NUL as the separator? or that the only good reason to ever use xargs with find is if you need to first sort the filenames (with sort -z) or eliminate (or modify) some of them in ways that find itself can't do (e.g. with sed -z, perl -0, or awk -v RS='\0', etc)?
    – cas
    Commented Oct 4, 2021 at 3:14
  • No, I wasn't aware of {} + behaving that way. Curiously it's not mentioned on the manpage (this would help avoid over-use of xargs). About the bugs, -print0 is not that unknown.
    – D4RIO
    Commented Oct 4, 2021 at 4:11
  • It's mentioned several times in the man page for GNU find, (search for -exec) and for BSD find. I can't find a posix find man page right now, but the HISTORY section of the GNU man page says it's occurs in POSIX.
    – cas
    Commented Oct 7, 2021 at 7:08
  • You are right, it's just omitted in the spanish manpage for GNU's find. Maybe I will send a patch to fix that. Thank you.
    – D4RIO
    Commented Oct 7, 2021 at 12:38

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .