4

I apologize if the question has been asked before, but I have been searching for days and could not find a solution in Python.

I have a large fasta file, containing headers and sequences.

>cavPor3_rmsk_tRNA-Leu-TTA(m) range=chrM:2643-2717 5'pad=0 3'pad=0 strand=+ repeatMasking=none
GTTAAGGTGGCAGAGCCGGTAATTGCATAAAATTTAAGACTTTACTCTCA
GAGGTTCAACTCCTCTCCTTAACAC

>cavPor3_rmsk_tRNA-Gln-CAA_ range=chrM:3745-3815 5'pad=0 3'pad=0 strand=- repeatMasking=none
AGAGGGTCATAAAGGTTATGGGGTTGGCTTGAAACCAGCTTTAGGGGGTT
CAATTCCTTCCTCTCT

>cavPor3_rmsk_tRNA-Ser-TCA(m) range=chrM:6875-6940 5'pad=0 3'pad=0 strand=- repeatMasking=none
AGAGGGTCATAAAGGTTATGGGGTTGGCTTGAAACCAGCTTTAGGGGGTT
CAATTCCTTCCTCTCT

This is a very small fragment of what the file looks like. I want to keep only the first entry (header and sequence) if, as you can see for the last two entries, the sequences are the same.

The output would look like this:

>cavPor3_rmsk_tRNA-Leu-TTA(m) range=chrM:2643-2717 5'pad=0 3'pad=0 strand=+ repeatMasking=none
GTTAAGGTGGCAGAGCCGGTAATTGCATAAAATTTAAGACTTTACTCTCA
GAGGTTCAACTCCTCTCCTTAACAC

>cavPor3_rmsk_tRNA-Gln-CAA_ range=chrM:3745-3815 5'pad=0 3'pad=0 strand=- repeatMasking=none
AGAGGGTCATAAAGGTTATGGGGTTGGCTTGAAACCAGCTTTAGGGGGTT
CAATTCCTTCCTCTCT

The problem is that the FASTA file is over one gigabyte in size. I have found ways of solving this by removing duplicates based on duplicate IDs or by using bash, but sadly I can't do this on my computer. This task is for a research project, not a homework or task.

Thank you in advance for your help!

9
  • shouldnt fasta file be like: >SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL >SEQUENCE_2 SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
    – pippo1980
    Commented Mar 3, 2021 at 18:59
  • 1
  • so you should read first sequence and append to new file then read second sequence check new file for it and if not present append it to it and so on until the end of first file
    – pippo1980
    Commented Mar 3, 2021 at 19:27
  • biostars.org/p/250123 Question: Best way to processing huge fasta files using python.
    – pippo1980
    Commented Mar 3, 2021 at 19:35
  • biostars.org/p/710 : Question: Correct Way To Parse A Fasta File In Python
    – pippo1980
    Commented Mar 3, 2021 at 20:10

3 Answers 3

5

this copied from here: Remove Redundant Sequences from FASTA file in Python

uses Biopython, but works with fasta file where headers are of the:

'> header' type see FAsta Format Wiki

from Bio import SeqIO
import time

start = time.time() 

seen = []
records = []

for record in SeqIO.parse("INPUT-FILE", "fasta"):  
    if str(record.seq) not in seen:
        seen.append(str(record.seq))
        records.append(record)


#writing to a fasta file
SeqIO.write(records, "OUTPUT-FILE", "fasta")
end = time.time()

print(f"Run time is {(end- start)/60}") 


faster as suggested by MattMDo using a set istead of a list:

seen = set()
records = []

for record in SeqIO.parse("b4r2.fasta", "fasta"):  
    if record.seq not in seen:
        seen.add(record.seq)
        records.append(record)

I've got a longer one that uses argparser but its slower because counts the sequences can post it if needed

6
  • The OP's FASTA entries are properly formatted, it just wasn't showing up due to the > character indicating a blockquote in Markdown. I have reformatted the original post.
    – MattDMo
    Commented Mar 4, 2021 at 1:17
  • Regarding your code, for large files it would be much more efficient if seen was a set instead of a list.
    – MattDMo
    Commented Mar 4, 2021 at 1:20
  • @MattDMo and how will it cope with the GB size file problem ??
    – pippo1980
    Commented Mar 4, 2021 at 7:16
  • It depends on how much memory the machine has, but any reasonably recent machine should have at least 2-4GB of RAM, so there shouldn't be a problem here. In general, though, if the size of a data structure exceeds available memory, you'll get a MemoryError.
    – MattDMo
    Commented Mar 4, 2021 at 16:28
  • @MattDMo biostars.org/p/710: Correct Way To Parse A Fasta File In Python. According to this would my idea be feasible (dont care if its slow as hell): in case of not enough memory create ''fasta_sequences = SeqIO.parse(open(input_file),'fasta')'' just once and append sequence 1 to a new file in a loop that recreate the parser of the output file each time to check if the n sequence of the input file is to be appended to the output file. Am I wrong ? Where can I get a 2GB fasta file to try it out ? Most sequence repos have size limit even for blast alignment results
    – pippo1980
    Commented Mar 4, 2021 at 17:27
1

If the lines all have that same format, so that there are 6 space-separated fields before the sequences, then this is easy. You will have to store all of the unique values in memory.

memory = set()
for line in open('x.txt'):
    if len(line) < 5:
        continue
    parts = line.split(' ', 6)
    if parts[-1] not in memory:
        print( line.strip() )
        memory.add( parts[-1] )
0

if you want to keep both header while removing duplicates you can use:

input1 = open("fasta.fasta")
dict_fasta = {record.id:record.seq for record in SeqIO.parse(input1,"fasta")}

tmp = {}
for key, value in dict_fasta.items():
  if value in tmp:
    tmp[value].append(key)
  else:
    tmp[value] = [ key ]

Not the answer you're looking for? Browse other questions tagged or ask your own question.