I have genomes data in a file, genomes-seq.txt
. The titles of the sequences begin with >
, and then the genome name:
>genome.1
atcg
atcg
atcggtc
>genome.2
atct
tgcgtgctt
attttt
>genome.
sdkf
sdf;ksdf
sdlfkjdslc
edsfsfv
>genome.3
as;ldkhaskjd
asdkljdsl
asdkljasdk;l
>genome.4
ekjfhdhsa
dsfkjskajd
asdknasd
>genome.1
iruuwi
sdkljbh
sdfljnsdl
>genome.234
efijhusidh
siduhygfhuji
>genome.1
ljhdcj
sdljhsdil
fweusfhygc
I want to collect the similar data for genome.1 in one file so it looks like this:
>genome.1
atcg
atcggtc
iruuwi
sdkljbh
sdfljnsdl
ljhdcj
sdljhsdil
fweusfhygc
But every time I do it using sed I get:
>genome.1
atcg
atcg
atcggtc
>genome.1
iruuwi
sdkljbh
sdfljnsdl
>genome.1
ljhdcj
sdljhsdil
fweusfhygc
That is, multiple genome.1
s. How can I do it correctly so on large data set I don't need to remove all the repetitions?
sed
command that you used?sed
...": You ought to include thesed
line in the question. Otherwise, this amounts to a work order (using this site as a script-writing service).