0

I have a script that receives two files as input and creates a dictionary based on lines. Finally, it overwrites the first file.

I am looking for a way to run this script on all file pairs of a folder, choosing as sys.argv[1] and sys.argv[2] based on a pattern in the name.

import re
import sys

datafile = sys.argv[1]
schemaseqs = sys.argv[2]

datafile_lines = []
d = {}
prev = None
with open(datafile, 'r') as f:
    i = 0
    for line in f:
        if i % 2 == 0:
            d[line.strip()]=0
            prev = line.strip()
        else:
            d[prev] = line.strip()
        i+=1

new_d = {}
with open(schemaseqs, 'r') as f:
    i=0
    prev = None
    for line in f:
        if i % 2 == 0:
            new_d[line.strip()]=0
            prev = line.strip()
        else:
            new_d[prev] = line.strip()
        i+=1

for key, value in d.items():
    if value in new_d:
        d[key] = new_d[value]

print(d)

with open(datafile,'w') as filee:
    for k,v in d.items():
        filee.writelines(k)
        filee.writelines('\n')
        filee.writelines(v)
        filee.writelines('\n')

I have hundreds of file pairs all sharing the same pattern proteinXXXX (where XXXX is a number) This number can have up to four digits (e.g. 9,99,999 or 9999). So I have protein 555.txt and protein 555.fasta

I've seen I can use glob or os.listdir to read files from a directory. However, I cannot assign them to a variable and extract the lines one pair at a time in every pair of the directory.

Any help is appreciated.

2
  • are you reading files from single folder or multiple folders?
    – deadshot
    Commented May 8, 2020 at 20:37
  • From two different folders, but I can join into one
    – Matteoli
    Commented May 9, 2020 at 14:09

1 Answer 1

2

Just the concept.

Import required libraries.

import glob
import os.path

Define function that extracts only the basename (the part without extension) from filename.

def basename(fn):
    return os.path.splitext(os.path.basename(fn))[0]

Create two sets, one with .txt files, another with .fasta files.

t = {basename(fn) for fn in glob.glob("protein*.txt")}
f = {basename(fn) for fn in glob.glob("protein*.fasta")}

Calculate intersection of these two sets to be sure that both .txt and .fasta files exist with the same basename. Then add the missing suffixes and let them process with the existing code.

for bn in t.intersection(f):
    process(bn + ".txt", bn + ".fasta")

Not the answer you're looking for? Browse other questions tagged or ask your own question.