2

I have two text files with different number of columns in them and I want to compare values of a given column in file1.txt with values of a given column in file2.txt: if these values found in file2.txt then copy a corresponding value in a particular column from file2.txt and insert it into a new column in file1.txt. If no corresponding value found in file2.txt, return FALSE for that row in file1.txt

To make it easier to follow, I have an example as follow:

Say I have file1.txt

Query       No.     Accession   Name    DB
EFX03602.1  1006    PHI:1006    HMR1    Not_Available
EFX00827.1  101     PHI:101     ALB1    AAC39471
EFX01509.1  101     PHI:101     ALB1    AAC39471
EFX05810.1  1010    PHI:1010    SID1    XM_385547
EFX00466.1  1026    PHI:1026    bcplc1  AAB39564

And file2.txt

Accession   DB_Type     DB_Accession    Function
PHI:1006    Uniprot     I1RXX1          HMG-CoA Reductase
PHI:101     Entrez      AAC39471        Polyketide synthase
PHI:7       Entrez      CAA42824        Effector protein
PHI:1026    Entrez      AAB39564        Phospholipase C
PHI:1028    Entrez      CAC29255        pectin methylesterase
PHI:1030    Entrez      CAA93142        ABC Transporter
PHI:17      Entrez      CAA43678        Acid proteinase

I want to compare the column Accession (col. 3) of file1.txt with column Accession (col. 1) in file2.txt and if the values are found in file2.txt, then copy corresponding values in Function column (col. 4) of file2.txt and insert in a new column in file1.txt. At the end I want the result file to look like this (it doesn't matter the position of the new insert column though, it can be anywhere in the file):

Query       No.     Accession   Function            Name    DB
EFX03602.1  1006    PHI:1006    HMG-CoA Reductase   HMR1    Not_Available
EFX00827.1  101     PHI:101     Polyketide synthase ALB1    AAC39471
EFX01509.1  101     PHI:101     Polyketide synthase ALB1    AAC39471
EFX05810.1  1010    PHI:1010    FALSE               SID1    XM_385547
EFX00466.1  1026    PHI:1026    Phospholipase C     bcplc1  AAB39564

Can I do this with some shell commands or I will need a script?

Thanks in advance for your help.

2
  • Are your input files tab separated, or are they fixed-width fields? The spaces in the function text will complicate things. Commented Apr 22, 2015 at 19:32
  • My input files are tab separated. The command you gave worked but the result file is not tab separated.
    – atd
    Commented Apr 23, 2015 at 10:56

1 Answer 1

1

With tab-separated files:

awk '
    BEGIN { FS = OFS = "\t" }
    NR == FNR {fn[$1] = $4; next} 
    {print $1, $2, $3, ($3 in fn ? fn[$3] : "FALSE"), $4, $5}
' file2.txt file1.txt 
Query   No. Accession   Function    Name    DB
EFX03602.1  1006    PHI:1006    HMG-CoA Reductase   HMR1    Not_Available
EFX00827.1  101 PHI:101 Polyketide synthase ALB1    AAC39471
EFX01509.1  101 PHI:101 Polyketide synthase ALB1    AAC39471
EFX05810.1  1010    PHI:1010    FALSE   SID1    XM_385547
EFX00466.1  1026    PHI:1026    Phospholipase C bcplc1  AAB39564
0

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .