1

I have a text file like this:

Fam1000: CMIN|CMIN_9-RA CMIN|ABC_7-RA GCLA|EFX5.1 GCUC|GCUC_7-RA
Fam1001: GCLA|EFX6.1 GCLA|EFX7.1
Fam1002: GCLA|EFX5.1 GCLA|EFX2.1 GCUC|GCUC_8-RA GCUC|GCUC_8-RA
Fam1003: CMIN|CMIN_001265-RA CMIN|CMIN_007282-RA

In this file, each line containing a number of values (which are space separated). Each value has a specific group identifier for their group preceding the pipe symbol (for example CMIN|CMIN_9-RA and CMIN|ABC_7-RA belong to CMIN group). The letters following the pipe can be in any random letters and numbers.

Knowing the total number and name of group identifiers in the file (in this case I have 3: which are CMIN, GCLA and GCUC). Now I want to parse this file into a file which show the number of value from each group for each line. At the end, I would like to have the output like this (which can be either space or tab separated):

            CMIN    GCLA    GCUC
Fam1000:    2       1       1
Fam1001:    0       2       0
Fam1002:    0       2       2
Fam1003:    2       0       0

I was thinking I should first delete all the elements after the | for each of the value, then count the number of unique identifiers for each row but I couldn't figure out how to do this with awk. Can anybody please help?

Also, this is just a simplify example, the actual file is fairly large with few thousands of lines and a couple dozens of groups.

Thanks.

0

1 Answer 1

2

Not the most beautiful solution, but it works. This script was tested on Linux Ubuntu. It may not work on a Mac because I use gawk.

You need to save the following code in a file, e.g parsetext.sh

Run this command to enable execution:

chmod +x parsetext.sh

Then run it with your inputfile.txt:

./parsetext.sh inputfile.txt

Following is the script that does the job:

#!/bin/bash
sed -e 's/|[^ ]\+//g; s/://' "$1"|\
gawk '{

        for ( i = 2; i <= NF; i++) {
        rows[$1][$i]++
        keys[$i]++
    } 
    } 
END {
    n = asorti(keys, tmp)
    printf("\t")
    for ( i=1; i<= n; i++) { printf("%s\t", tmp[i]) }
    printf("\n")
    for ( r in rows ) { 
        printf("%s\t", r)
        for (i=1; i<= n; i++) {
            value = 0
            k = tmp[i]
            if (rows[r][k] > 0) value = rows[r][k] 
            printf("%s\t", value)
        }
        printf("\n")
    }

}'

Sample output:

    CMIN    GCLA    GCUC    
Fam1000 2   1   1   
Fam1001 0   2   0   
Fam1002 0   2   2   
Fam1003 2   0   0   
0

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .