0

I have a file with millions of entries in a column and for this reason I'm using awk, which is the fastest method I know for these calculations. I need to calculate the mean of values in a column and I have done it this way:

allsamples="R3 SM261_T SM382_T R6"

for sample in $allsamples
do
awk BEGIN {print "ID","Coverage"}; '{sum+=$2} END { print "Average = ",sum/NR}' $sample.dep > $sample.mean_coverage.temp >> All_samples_coverage.txt
done

The script works correctly and prints the headers I need but I also need to print the filename next to the mean value.

I have tried this:

awk 'BEGIN {print "ID","Coverage"}; {print FILENAME} {sum+=$2} END {print "Average = ",sum/NR}'

but it prints the filename for each line of the original file (so if R3.dep has 60 million lines, it will print 60 million times the filename and then the function result).

Example file would be:

Locus   Total_Depth Average_Depth_sample    Depth_for_R3
chr1:10001  4   4.00    4
chr1:10002  5   5.00    5
chr1:10003  7   7.00    7
chr1:10004  9   9.00    9

What I get is:

ID Coverage
R3.txt
R3.txt
R3.txt
R3.txt
R3.txt
Average =  5

What I would need is:

ID Coverage
R3.txt Average =  5

Any suggestion of what I'm doing wrong?

5
  • 3
    print the filename in the END block
    – Fravadona
    Commented Nov 14, 2022 at 12:20
  • I tried this but it gave me a syntax error. I wrote: END {print "Average = ",sum/NR %% print FILENAME}'. Then I added this: END {print "Average = ",sum/NR} END {print FILENAME} and I think it worked fine. I need to verify on the real files, i tested on a test file
    – Gf.Ena
    Commented Nov 14, 2022 at 13:37
  • 1
    I don't know what %% represents, but statements are separated by semicolons in Awk. END {print "Average = ",sum/NR; print FILENAME} or you can print the name and average on the same line like this END {print FILENAME, "Average = ",sum/NR} Commented Nov 14, 2022 at 13:51
  • I expect %% was supposed to be "%"; to print a % character then end the statement before the next print.
    – Ed Morton
    Commented Nov 14, 2022 at 16:27
  • I'm sorry it's simply a typo. %% was meant to be &&
    – Gf.Ena
    Commented Nov 18, 2022 at 9:30

1 Answer 1

1

From what you stated, I believe your header should not be part of the AWK statement, simply a bash echo before the loop, since it seems like that is shared for all the files. I would also include the "Average" label as part of that header and remove it from the printf command shown below.

Your AWK statement should then become

awk 'BEGIN{
    sum=0 ;
}{
    sum+=$2 ;
}END{
    #printf("%10s:  Average = %s\n", FILENAME, sum/NR ) ;
    printf("%10s:  %s\n", FILENAME, sum/NR ) ;
}'

Not the answer you're looking for? Browse other questions tagged or ask your own question.