I am dealing with the csv produced by the concatenation (via cat) of several CSVs:
ID(Prot), ID(lig), ID(cluster), dG(rescored), dG(before), POP(before)
1000, lig40, 1, 0.805136, -5.5200, 79
1000, lig868, 1, 0.933209, -5.6100, 42
1000, lig278, 1, 0.933689, -5.7600, 40
1000, lig619, 3, 0.946354, -7.6100, 20
1000, lig211, 1, 0.960048, -5.2800, 39
1000, lig40, 2, 0.971051, -4.9900, 40
1000, lig868, 3, 0.986384, -5.5000, 29
1000, lig12, 3, 0.988506, -6.7100, 16
1000, lig800, 16, 0.995574, -4.5300, 40
1000, lig800, 1, 0.999935, -5.7900, 22
1000, lig619, 1, 1.00876, -7.9000, 3
1000, lig619, 2, 1.02254, -7.6400, 1
1000, lig12, 1, 1.02723, -6.8600, 5
1000, lig12, 2, 1.03273, -6.8100, 4
1000, lig211, 2, 1.03722, -5.2000, 19
1000, lig211, 3, 1.03738, -5.0400, 21
ID(Prot), ID(lig), ID(cluster), dG(rescored), dG(before), POP(before)
10V1, lig40, 1, 0.513472, -6.4600, 150
10V1, lig211, 2, 0.695981, -6.8200, 91
10V1, lig278, 1, 0.764432, -7.0900, 70
10V1, lig868, 1, 0.787698, -7.3100, 62
10V1, lig211, 1, 0.83416, -6.8800, 54
10V1, lig868, 3, 0.888408, -6.4700, 44
10V1, lig278, 2, 0.915932, -6.6600, 35
10V1, lig12, 1, 0.922741, -9.3600, 19
10V1, lig12, 8, 0.934144, -7.4600, 24
10V1, lig40, 2, 0.949955, -5.9000, 34
10V1, lig800, 5, 0.964194, -5.9200, 30
10V1, lig868, 2, 0.966243, -6.9100, 20
10V1, lig12, 2, 0.972575, -8.3000, 10
10V1, lig619, 6, 0.979168, -8.1600, 9
10V1, lig619, 4, 0.986202, -8.7800, 5
10V1, lig800, 2, 0.989599, -6.2400, 20
10V1, lig619, 1, 0.989725, -9.2900, 3
10V1, lig12, 7, 0.991535, -7.5800, 9
ID(Prot), ID(lig), ID(cluster), dG(rescored), dG(before), POP(before)
10V2, lig40, 1, 0.525767, -6.4600, 146
10V2, lig211, 2, 0.744702, -6.8200, 78
10V2, lig278, 1, 0.749015, -7.0900, 74
10V2, lig868, 1, 0.772025, -7.3100, 66
10V2, lig211, 1, 0.799829, -6.8700, 63
10V2, lig12, 1, 0.899345, -9.1600, 25
10V2, lig12, 4, 0.899606, -7.5500, 32
10V2, lig868, 3, 0.903364, -6.4800, 40
10V2, lig278, 3, 0.913145, -6.6300, 36
10V2, lig800, 5, 0.94576, -5.9100, 35
To post-process this CSV I need 1) to remove repetitions of the header line
ID(Prot), ID(lig), ID(cluster), dG(rescored), dG(before), POP(before)
keeping the header only in the begining of the fused csv (on the first line!)
Then I need to sort all the lines (ignoring the 1st line header) according to the numbers in the 4th (dG(rescored)) column.
To accomplish the first task I have tried to use the following awk one-liner which is looking for the 1st line and then remove its repeates
awk '{first=$1;gsub("ID(Prot)","");print first,$0}' mycsv.csv > csv_without_repeats.csv
however it did not recognize the header line, meaning that the pattern was not defined correctly.
then to sort the data according to the values in the 4th column I have used sort:
LC_ALL=C sort -k4,4g
how it could be piped it to my AWK code or othervise everything accomplished directly by the AWK ?
E.g. I've tried
awk '{first=$1;gsub(/ID(Prot)?(\([-azA-Z]+\))?/,"");print first,$0}' | LC_ALL=C sort -k4,4g input.csv > sorted_and_without_repeats.csv
but the script could been been terminated, while correctly producing ssorted CSV (still with the repeats due to theproblem in awk part).
awk 'NR==1 {header=$0;print} "foo"$0!="foo"header {print}'
. Replace the lastprint
with code designed to processes non-header lines."foo"
added so even a numerical header will be treated as a string (see this). The advantage is you don't need to know anything about what the header looks like, so this approach does not need to be adjusted every time you need it.