I have a text file like this:
Fam1000: CMIN|CMIN_9-RA CMIN|ABC_7-RA GCLA|EFX5.1 GCUC|GCUC_7-RA
Fam1001: GCLA|EFX6.1 GCLA|EFX7.1
Fam1002: GCLA|EFX5.1 GCLA|EFX2.1 GCUC|GCUC_8-RA GCUC|GCUC_8-RA
Fam1003: CMIN|CMIN_001265-RA CMIN|CMIN_007282-RA
In this file, each line containing a number of values (which are space separated). Each value has a specific group identifier for their group preceding the pipe symbol (for example CMIN|CMIN_9-RA and CMIN|ABC_7-RA belong to CMIN group). The letters following the pipe can be in any random letters and numbers.
Knowing the total number and name of group identifiers in the file (in this case I have 3: which are CMIN, GCLA and GCUC). Now I want to parse this file into a file which show the number of value from each group for each line. At the end, I would like to have the output like this (which can be either space or tab separated):
CMIN GCLA GCUC
Fam1000: 2 1 1
Fam1001: 0 2 0
Fam1002: 0 2 2
Fam1003: 2 0 0
I was thinking I should first delete all the elements after the | for each of the value, then count the number of unique identifiers for each row but I couldn't figure out how to do this with awk. Can anybody please help?
Also, this is just a simplify example, the actual file is fairly large with few thousands of lines and a couple dozens of groups.
Thanks.