0

I have genomes data in a file, genomes-seq.txt. The titles of the sequences begin with >, and then the genome name:

>genome.1
atcg
atcg
atcggtc

>genome.2
atct
tgcgtgctt
attttt

>genome.
sdkf
sdf;ksdf
sdlfkjdslc
edsfsfv

>genome.3
as;ldkhaskjd
asdkljdsl
asdkljasdk;l

>genome.4
ekjfhdhsa
dsfkjskajd
asdknasd


>genome.1
iruuwi
sdkljbh
sdfljnsdl

>genome.234
efijhusidh
siduhygfhuji

>genome.1
ljhdcj
sdljhsdil
fweusfhygc

I want to collect the similar data for genome.1 in one file so it looks like this:

>genome.1
atcg
atcggtc

iruuwi
sdkljbh
sdfljnsdl
ljhdcj
sdljhsdil
fweusfhygc

But every time I do it using sed I get:

>genome.1
atcg
atcg
atcggtc

>genome.1
iruuwi
sdkljbh
sdfljnsdl

>genome.1
ljhdcj
sdljhsdil
fweusfhygc

That is, multiple genome.1s. How can I do it correctly so on large data set I don't need to remove all the repetitions?

5
  • 2
    Hi @paul, what is your sed command that you used?
    – user88036
    Commented Oct 9, 2018 at 14:58
  • I tried but it didn't work
    – paul
    Commented Oct 9, 2018 at 15:01
  • 2
    Show what you tried and we can help fix your errors. Commented Oct 9, 2018 at 15:32
  • 1
    Re "...every time I do it using sed...": You ought to include the sed line in the question. Otherwise, this amounts to a work order (using this site as a script-writing service). Commented Oct 9, 2018 at 17:39
  • The file format is FASTA. Commented Oct 9, 2018 at 17:48

4 Answers 4

3
$sed -nr /\>genome.1/,/^$/p file | sed '2,${/^>genome.1$/d}'

>genome.1
atcg
atcggtc

iruuwi
sdkljbh
sdfljnsdl
ljhdcj
sdljhsdil
fweusfhygc

genome.1 is the key word, change depending on the list you would like to generate.

2
  • Hi Goro, can I be your sed friend? Where can I find such sed knowledge?
    – schweik
    Commented Oct 9, 2018 at 16:49
  • 1) \> matches end-of-word, not the literal character >. 2) Please use \. to escape the literal dot in the regular expression that is supposed to match >genome.1. 3) You should also anchor it at the line start and end to avoid false matches: /^>genome\.1$/. 4) The -r flag on the first sed command is not required but harmless since you don't use any characters affected by it. 5) As a rule of thumb you should escape sed commands provided through a shell to avoid easily overlooked issues. Commented Oct 9, 2018 at 18:23
0

With perl

perl -00 -ne 'if (/^>genome\.1\n/) {s/// if $. > 1; print}' file
0

With Awk:

{
  if (/^>/)
    in_section = 0;
  if ($0 == ">genome.1") {
    in_section = 1;
    if (!section_count++)
      print;
  } else if (in_section)
    print;
}

Usage:

awk '{ if (/^>/) in_section = 0; if ($0 == ">genome.1") { in_section = 1; if (!section_count++) print; } else if (in_section) print; }' genome.txt
0

Well, if started with awk, then try this:

echo ">genome.1";awk 'BEGIN{RS=">"}{if($1 == "genome.1"){for(i=1;i<NF;i++){print $(i+1)}}}' file |sort -u

With RS=">" separate each running "genome" records, in each record print all fields but first, then sort the output list unique (parametr -u). If you set the RS=">genome\." you can write short:

echo -n ">genome.";awk 'BEGIN{RS=">genome."}/1/{print $0}' file |sort -ur

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .