Post-processing of multi-column CSV: remove dublicate lines + sort

Question

I am dealing with the csv produced by the concatenation (via cat) of several CSVs:

ID(Prot),   ID(lig),    ID(cluster),    dG(rescored),   dG(before), POP(before)
1000,   lig40,  1,  0.805136,   -5.5200,    79
1000,   lig868, 1,  0.933209,   -5.6100,    42
1000,   lig278, 1,  0.933689,   -5.7600,    40
1000,   lig619, 3,  0.946354,   -7.6100,    20
1000,   lig211, 1,  0.960048,   -5.2800,    39
1000,   lig40,  2,  0.971051,   -4.9900,    40
1000,   lig868, 3,  0.986384,   -5.5000,    29
1000,   lig12,  3,  0.988506,   -6.7100,    16
1000,   lig800, 16, 0.995574,   -4.5300,    40
1000,   lig800, 1,  0.999935,   -5.7900,    22
1000,   lig619, 1,  1.00876,    -7.9000,    3
1000,   lig619, 2,  1.02254,    -7.6400,    1
1000,   lig12,  1,  1.02723,    -6.8600,    5
1000,   lig12,  2,  1.03273,    -6.8100,    4
1000,   lig211, 2,  1.03722,    -5.2000,    19
1000,   lig211, 3,  1.03738,    -5.0400,    21
ID(Prot),   ID(lig),    ID(cluster),    dG(rescored),   dG(before), POP(before)
10V1,   lig40,  1,  0.513472,   -6.4600,    150
10V1,   lig211, 2,  0.695981,   -6.8200,    91
10V1,   lig278, 1,  0.764432,   -7.0900,    70
10V1,   lig868, 1,  0.787698,   -7.3100,    62
10V1,   lig211, 1,  0.83416,    -6.8800,    54
10V1,   lig868, 3,  0.888408,   -6.4700,    44
10V1,   lig278, 2,  0.915932,   -6.6600,    35
10V1,   lig12,  1,  0.922741,   -9.3600,    19
10V1,   lig12,  8,  0.934144,   -7.4600,    24
10V1,   lig40,  2,  0.949955,   -5.9000,    34
10V1,   lig800, 5,  0.964194,   -5.9200,    30
10V1,   lig868, 2,  0.966243,   -6.9100,    20
10V1,   lig12,  2,  0.972575,   -8.3000,    10
10V1,   lig619, 6,  0.979168,   -8.1600,    9
10V1,   lig619, 4,  0.986202,   -8.7800,    5
10V1,   lig800, 2,  0.989599,   -6.2400,    20
10V1,   lig619, 1,  0.989725,   -9.2900,    3
10V1,   lig12,  7,  0.991535,   -7.5800,    9
ID(Prot),   ID(lig),    ID(cluster),    dG(rescored),   dG(before), POP(before)
10V2,   lig40,  1,  0.525767,   -6.4600,    146
10V2,   lig211, 2,  0.744702,   -6.8200,    78
10V2,   lig278, 1,  0.749015,   -7.0900,    74
10V2,   lig868, 1,  0.772025,   -7.3100,    66
10V2,   lig211, 1,  0.799829,   -6.8700,    63
10V2,   lig12,  1,  0.899345,   -9.1600,    25
10V2,   lig12,  4,  0.899606,   -7.5500,    32
10V2,   lig868, 3,  0.903364,   -6.4800,    40
10V2,   lig278, 3,  0.913145,   -6.6300,    36
10V2,   lig800, 5,  0.94576,    -5.9100,    35

To post-process this CSV I need 1) to remove repetitions of the header line

ID(Prot),   ID(lig),    ID(cluster),    dG(rescored),   dG(before), POP(before)

keeping the header only in the begining of the fused csv (on the first line!)

Then I need to sort all the lines (ignoring the 1st line header) according to the numbers in the 4th (dG(rescored)) column.

To accomplish the first task I have tried to use the following awk one-liner which is looking for the 1st line and then remove its repeates

 awk '{first=$1;gsub("ID(Prot)","");print first,$0}' mycsv.csv > csv_without_repeats.csv

however it did not recognize the header line, meaning that the pattern was not defined correctly.

then to sort the data according to the values in the 4th column I have used sort:

LC_ALL=C sort -k4,4g

how it could be piped it to my AWK code or othervise everything accomplished directly by the AWK ?

E.g. I've tried

awk '{first=$1;gsub(/ID(Prot)?(\([-azA-Z]+\))?/,"");print first,$0}' | LC_ALL=C sort -k4,4g input.csv > sorted_and_without_repeats.csv

but the script could been been terminated, while correctly producing ssorted CSV (still with the repeats due to theproblem in awk part).

If there are single tab characters after commas (and I think there are) then please state it clearly. The problem is the site displays them with multiple spaces instead and users will copy your example data with spaces, so their solutions may or may not work with your actual data and this may lead to misunderstandings when a solution does not work for you despite being good for data copied from the site. — Kamil Maciorowski, Commented May 28, 2021 at 9:00
the header line (which repeats should be removed) was producing using another AWK script via: print "ID(Prot)","ID(lig)","ID(cluster)","dG(rescored)","dG(before)", "POP(before)"; and the lines that should be sorted: print prefix, suffix, $1, sqrt((($3-lowest_dG)/lowest_dG)^2+(($2-240)/240)^2), $3, $2 — Hot JAMS, Commented May 28, 2021 at 9:04
so the separators between colums is comma and space I suppose ... — Hot JAMS, Commented May 28, 2021 at 9:04
Nice columns (achieved by displaying multiple spaces) and the source make me suspect you actually pasted tab characters. Oh well. — Kamil Maciorowski, Commented May 28, 2021 at 9:13
If all the headers are identical, an easy method to skip them is to memorize the first line and then skip lines that are identical. awk 'NR==1 {header=$0;print} "foo"$0!="foo"header {print}'. Replace the last print with code designed to processes non-header lines. "foo" added so even a numerical header will be treated as a string (see this). The advantage is you don't need to know anything about what the header looks like, so this approach does not need to be adjusted every time you need it. — Kamil Maciorowski, Commented May 28, 2021 at 9:32

Thor · Accepted Answer · 2021-05-28 09:16:59Z

Here is one way with GNU awk:

parse.awk

BEGIN { 
  # Arrays should be sorted numerically by their index
  PROCINFO["sorted_in"] = "@ind_num_asc" 

  # Set field-separator to comma followed by optional space
  FS = ",[ \t]*"
}

# Print the header
NR==1 { print; next }

# Collect lines into the `h` hashmap
NR>1 && $1 !~ /^ID/ { 
  h[$4] = $0
} 

# Print the sorted hashmap `h`
END { 
  for(k in h) print h[k]
}

Run it like this:

awk -f parse.awk infile

Output:

ID(Prot),   ID(lig),    ID(cluster),    dG(rescored),   dG(before), POP(before)
10V1,   lig40,  1,  0.513472,   -6.4600,    150
10V2,   lig40,  1,  0.525767,   -6.4600,    146
10V1,   lig211, 2,  0.695981,   -6.8200,    91
10V2,   lig211, 2,  0.744702,   -6.8200,    78
10V2,   lig278, 1,  0.749015,   -7.0900,    74
10V1,   lig278, 1,  0.764432,   -7.0900,    70
10V2,   lig868, 1,  0.772025,   -7.3100,    66
10V1,   lig868, 1,  0.787698,   -7.3100,    62
10V2,   lig211, 1,  0.799829,   -6.8700,    63
1000,   lig40,  1,  0.805136,   -5.5200,    79
10V1,   lig211, 1,  0.83416,    -6.8800,    54
10V1,   lig868, 3,  0.888408,   -6.4700,    44
10V2,   lig12,  1,  0.899345,   -9.1600,    25
10V2,   lig12,  4,  0.899606,   -7.5500,    32
10V2,   lig868, 3,  0.903364,   -6.4800,    40
10V2,   lig278, 3,  0.913145,   -6.6300,    36
10V1,   lig278, 2,  0.915932,   -6.6600,    35
10V1,   lig12,  1,  0.922741,   -9.3600,    19
1000,   lig868, 1,  0.933209,   -5.6100,    42
1000,   lig278, 1,  0.933689,   -5.7600,    40
10V1,   lig12,  8,  0.934144,   -7.4600,    24
10V2,   lig800, 5,  0.94576,    -5.9100,    35
1000,   lig619, 3,  0.946354,   -7.6100,    20
10V1,   lig40,  2,  0.949955,   -5.9000,    34
1000,   lig211, 1,  0.960048,   -5.2800,    39
10V1,   lig800, 5,  0.964194,   -5.9200,    30
10V1,   lig868, 2,  0.966243,   -6.9100,    20
1000,   lig40,  2,  0.971051,   -4.9900,    40
10V1,   lig12,  2,  0.972575,   -8.3000,    10
10V1,   lig619, 6,  0.979168,   -8.1600,    9
10V1,   lig619, 4,  0.986202,   -8.7800,    5
1000,   lig868, 3,  0.986384,   -5.5000,    29
1000,   lig12,  3,  0.988506,   -6.7100,    16
10V1,   lig800, 2,  0.989599,   -6.2400,    20
10V1,   lig619, 1,  0.989725,   -9.2900,    3
10V1,   lig12,  7,  0.991535,   -7.5800,    9
1000,   lig800, 16, 0.995574,   -4.5300,    40
1000,   lig800, 1,  0.999935,   -5.7900,    22
1000,   lig619, 1,  1.00876,    -7.9000,    3
1000,   lig619, 2,  1.02254,    -7.6400,    1
1000,   lig12,  1,  1.02723,    -6.8600,    5
1000,   lig12,  2,  1.03273,    -6.8100,    4
1000,   lig211, 2,  1.03722,    -5.2000,    19
1000,   lig211, 3,  1.03738,    -5.0400,    21

exellent sollution, thank you! Cheers
– Hot JAMS
Commented May 28, 2021 at 12:06 — Hot JAMS, Commented May 28, 2021 at 12:06

Stack Exchange Network

Post-processing of multi-column CSV: remove dublicate lines + sort

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
csv
sorting
awk
.

Hot Network Questions

Post-processing of multi-column CSV: remove dublicate lines + sort

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged csvsortingawk.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
csv
sorting
awk
.