Collecting specific genome data from a file and collect it in the same title

Question

I have genomes data in a file, genomes-seq.txt. The titles of the sequences begin with >, and then the genome name:

>genome.1
atcg
atcg
atcggtc

>genome.2
atct
tgcgtgctt
attttt

>genome.
sdkf
sdf;ksdf
sdlfkjdslc
edsfsfv

>genome.3
as;ldkhaskjd
asdkljdsl
asdkljasdk;l

>genome.4
ekjfhdhsa
dsfkjskajd
asdknasd


>genome.1
iruuwi
sdkljbh
sdfljnsdl

>genome.234
efijhusidh
siduhygfhuji

>genome.1
ljhdcj
sdljhsdil
fweusfhygc

I want to collect the similar data for genome.1 in one file so it looks like this:

>genome.1
atcg
atcggtc

iruuwi
sdkljbh
sdfljnsdl
ljhdcj
sdljhsdil
fweusfhygc

But every time I do it using sed I get:

>genome.1
atcg
atcg
atcggtc

>genome.1
iruuwi
sdkljbh
sdfljnsdl

>genome.1
ljhdcj
sdljhsdil
fweusfhygc

That is, multiple genome.1s. How can I do it correctly so on large data set I don't need to remove all the repetitions?

Re "...every time I do it using sed...": You ought to include the sed line in the question. Otherwise, this amounts to a work order (using this site as a script-writing service). — Peter Mortensen, Commented Oct 9, 2018 at 17:39

score 3 · Accepted Answer · 2018-10-09 15:39:06Z

3

$sed -nr /\>genome.1/,/^$/p file | sed '2,${/^>genome.1$/d}'

>genome.1
atcg
atcggtc

iruuwi
sdkljbh
sdfljnsdl
ljhdcj
sdljhsdil
fweusfhygc

genome.1 is the key word, change depending on the list you would like to generate.

edited Oct 9, 2018 at 15:39

answered Oct 9, 2018 at 15:04

user88036

Hi Goro, can I be your sed friend? Where can I find such sed knowledge?
– schweik
Commented Oct 9, 2018 at 16:49
1) \> matches end-of-word, not the literal character >. 2) Please use \. to escape the literal dot in the regular expression that is supposed to match >genome.1. 3) You should also anchor it at the line start and end to avoid false matches: /^>genome\.1$/. 4) The -r flag on the first sed command is not required but harmless since you don't use any characters affected by it. 5) As a rule of thumb you should escape sed commands provided through a shell to avoid easily overlooked issues.
– David Foerster
Commented Oct 9, 2018 at 18:23

Add a comment |

glenn jackman · Accepted Answer · 2018-10-09 15:35:42Z

0

With perl

perl -00 -ne 'if (/^>genome\.1\n/) {s/// if $. > 1; print}' file

answered Oct 9, 2018 at 15:35

glenn jackman

86.8k16 gold badges120 silver badges173 bronze badges

Add a comment |

David Foerster · Accepted Answer · 2018-10-09 18:39:20Z

0

With Awk:

{
  if (/^>/)
    in_section = 0;
  if ($0 == ">genome.1") {
    in_section = 1;
    if (!section_count++)
      print;
  } else if (in_section)
    print;
}

Usage:

awk '{ if (/^>/) in_section = 0; if ($0 == ">genome.1") { in_section = 1; if (!section_count++) print; } else if (in_section) print; }' genome.txt

answered Oct 9, 2018 at 18:39

David Foerster

1,5951 gold badge11 silver badges18 bronze badges

Add a comment |

schweik · Accepted Answer · 2018-10-10 07:32:36Z

0

Well, if started with awk, then try this:

echo ">genome.1";awk 'BEGIN{RS=">"}{if($1 == "genome.1"){for(i=1;i<NF;i++){print $(i+1)}}}' file |sort -u

With RS=">" separate each running "genome" records, in each record print all fields but first, then sort the output list unique (parametr -u). If you set the RS=">genome\." you can write short:

echo -n ">genome.";awk 'BEGIN{RS=">genome."}/1/{print $0}' file |sort -ur

answered Oct 10, 2018 at 7:32

schweik

1,31010 silver badges16 bronze badges

Add a comment |

Stack Exchange Network

Collecting specific genome data from a file and collect it in the same title

4 Answers 4

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
bash
.

Hot Network Questions

Collecting specific genome data from a file and collect it in the same title

4 Answers 4

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged bash.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
bash
.