3

I am learning Perl, but I don't know how to solve this problem.

I have a .txt file of the following form:

1 16.3346384
2 11.43483
3 1.19819
4 1.1113829
5 1.0953443
6 1.9458343
7 1.345645
8 1.3847385794
9 1.3534344
10 2.1117454
11 1.17465
12 1.4587485

The first column only contains the line number, which is not of interest here, but it is present in the file; the values in the second column are the relevant part.

I want to output the longest contiguous sequence of lines which feature numbers smaller than 2.00 in the second column. For the above example, this would be lines 3 to 9 , and the output should be:

1.19819
1.1113829
1.0953443
1.9458343
1.345645
1.3847385794
1.3534344
3
  • 8
    Never apologize for not speaking a foreign language perfectly! It is hard, and you are trying, and that is all anyone can ask!
    – terdon
    Commented Jan 4 at 10:58
  • 1
    What if there's a tie (equal longest contiguous sequence of lines satisfying the criteria)? Do you want the first such sequence, the last such sequence, or something else? Commented Jan 5 at 7:48
  • 1
    tie is 14 and 14, e.g.? I need one longest, first found Commented Jan 5 at 9:09

7 Answers 7

5

Perl one line:

perl -ne '$n = (split)[1]; if ($n > 2) {if ($i > $max) {$longest=$cur; $cur=""; $max=$i}; $i=0} else {$cur .= $n . "\n"; $i++} END {print $i > $max ? $cur : $longest}' < file.txt

Multi line for better readability:

perl -ne '
  $n = (split)[1];
  if ($n > 2) {
    if ($i > $max) {
      $longest=$cur;
      $cur="";
      $max=$i;
     }
     $i=0
  } else {
    $cur.= $n . "\n";
    $i++
  } 
  END {
    print $i > $max ? $cur : $longest
  }' < file.txt

One liner with awk:

awk '$2 > 2 { if (i > max) {res=cur; cur=""; max=i} i=0} $2 < 2 {cur = cur $2 "\n"; i++} END {if (i > max) res=cur; printf res}' file.txt

Multi line:

awk '
  $2 > 2 { 
    if (i > max) {
      res=cur
      cur=""
      max=i
    }
    i=0
  } 
  $2 < 2 {
    cur = cur $2 "\n"
    i++
  }
  END {
    if (i > max) res=cur
    printf res
  }' file.txt
3

This is not quite a trivial task. There is also debate whether providing a finished program is helpful for others learning to solve a problem in a programming language, but I believe it has its merits, so I propose the following program (let's call it findlongestsequence.pl:

#!/usr/bin/perl
use strict;
use Getopt::Long;

my $limit; my $infile;
GetOptions( 'limit=f' => \$limit, 'infile=s' => \$infile );

my $lineno=0; my $groupstart;
my $currlength=0; my $maxlength=0; my $ingroup=0;
my @columns; my @groupbuf; my @longestgroup;

if (! open(fileinput, '<', "$infile" )) {exit 1;};
while (<fileinput>)
{
    $lineno++;
    @columns = split(/\s+/,$_);

    if ( $ingroup == 0 && $columns[1]<$limit )
    {
        $ingroup=1;
        $groupstart=$lineno;
        @groupbuf=();
    }

    if ( $ingroup == 1 )
    {
        if ($columns[1]>=$limit )
        {
            $ingroup=0;
            $currlength=$lineno-$groupstart;
    
            if ( $currlength>$maxlength )
            {
                $maxlength=$currlength;
                @longestgroup=@groupbuf;
            }
        }
        else
        {
            push(@groupbuf,$columns[1]);
        }
    }
}
close(fileinput);

if ( $ingroup == 1 )
{
    $currlength=$groupstart-$lineno;
    if ( $currlength>$maxlength )
    {
        $maxlength=$currlength;
        @longestgroup=@groupbuf;
    }
}

print join("\n",@longestgroup),"\n";
exit 0;

You can call the program as

./findlongestsequence.pl --infile input.txt --limit 2.0

This will first interpret the command-line parameters using Getopt::Long.

It will then open the file and read it line-wise while, and keep a line-counter in $lineno. Every line will be split into columns at whitespace.

  • If the program is not inside a group of lines with values < $limit ($ingroup is zero), but encounters a suitable line, it will record that it is now in such a group ($ingroup set to one), store the group start in $groupstart and start buffering the column 2 values in an array @groupbuf.
  • If the program is inside such a group, but the current value is larger than the $limit, it will recognize the end-of-group and calculate its length. If this is longer than the previously recorded longest group, the content (@groupbuf) and length ($currlength) of the new longest group is copied to @longestgroup and $maxlength, respectively.

Since it is possible that a group is terminated by end-of-file rather than a line with value > $limit, perform this check also if $ingroup is true at end-of-file.

At the end, the content of @longestgroup is printed with \n as token separator.

4
  • Hmmm, without regex... Very interesting Commented Jan 4 at 11:53
  • @694201970 Well, if you look at the split call, there is one regex (albeit only to specify where to split into columns) ;)
    – AdminBee
    Commented Jan 4 at 11:59
  • i wanna use this ^(?:1\.\d{2}|[0-1]\.\d{2,20}|2\.00-)$ Commented Jan 4 at 12:10
  • 2
    @694201970 I understand you want to use regex, and I can give you an answer based on regex, but maybe it's better you try to explain what would you expect your regex to work. Why ?:1 at the beginning? Why - at the end? Try to explain, and then we could tell you where you had you mistake. Also try to show this inside your own perl code.
    – aviro
    Commented Jan 4 at 12:24
2

Using any awk:

$ cat tst.awk
$2 >= 2 {
    max = getMax(cur,max)
    cur = ""
    next
}
{ cur = cur $2 ORS }
END {
    printf "%s", getMax(cur,max)
}
function getMax(a,b) {
    return ( gsub(ORS,"&",a) > gsub(ORS,"&",b) ? a : b )
}

$ awk -f tst.awk file
1.19819
1.1113829
1.0953443
1.9458343
1.345645
1.3847385794
1.3534344
2
  • yea, it works, but i see in output 12.022760651465799 between 1.xxxxx and 1.xxxxx Commented Jan 5 at 17:20
  • That's impossible since 12.022760651465799 isn't present in the input. Given the input you provided in your question for us to test with, my script produces the expected output you provided for us to test with, as shown in my answer.
    – Ed Morton
    Commented Jan 5 at 17:52
1

Maybe something like:

<input perl -snle '
  if ($_ < $limit) {
    $n++;
  } else {
    $max = $n if $n > $max;
    $n = 0;
  }
  END {
    print ($n > $max ? $n : $max);
  }' -- -limit=2 -max=0

Or if instead of the number of lines in that largest group of lines you want to see those lines as per newer edits to your question:

<input perl -snle '
  if ($_ < $limit) {
    push @lines, $_;
  } else {
    @max = @lines if @lines > @max;
    @lines = ();
  }
  END {
    print for @lines > @max ? @lines : @max;
  }' -- -limit=2

If, as someone edited in your question, the line numbers are part of the data, add the -a option (awk mode where records are split into the @F array) and replace $_ (the whole record) with $F[1] (the second field, $F[0] being the first).

9
  • 1
    nope, just print 114035 (count string in a file) Commented Jan 4 at 9:52
  • @694201970 I just realised the fist number in the lines of the sample in your question was not part of the data. See edit Commented Jan 4 at 10:01
  • Even though your post doesn't answer the question, your original answer (before the edit) was actually better (in terms of what you were attempting to answer) than the current answer. Before the edit you were just missing the -n for the loop. Now it just takes the entire line without any splitting.
    – aviro
    Commented Jan 4 at 13:33
  • @avriro, -a which I had before, implies -n. The OP indicates (not very clearly) that the line numbers are not part of the data Commented Jan 4 at 15:08
  • Oops, my -n comment was wrong. But anyway, the OP does say it's part of the data. In the original question he said he only wanted to check only the second column. Also now, after the edit: "The first column only contains the line number, which is not of interest here, but it is present in the file"
    – aviro
    Commented Jan 4 at 15:38
1

Idiomatic solution using <> for reading input and the flipflop operator.

#!/usr/bin/env perl
use strict;
use warnings;
# https://unix.stackexchange.com/questions/766081/how-to-print-the-longest-sequence-of-lines-featuring-numbers-smaller-than-a-thre
my $threshold = 2.00;
my ($section, $maxsection, $len, $maxlen);
my $flipflop;
while (<>) {
    # Remove leading line number
    s/^(\d+)\s+//;
    # Flip flop operator
    # https://www.effectiveperlprogramming.com/2010/11/make-exclusive-flip-flop-operators/
    if ($flipflop = $_ <= $threshold .. $_ > $threshold) {
        if ($flipflop =~ /E0$/) {
            # End of section
            if (!defined($maxlen) || $len > $maxlen) {
                $maxsection = $section;
                $maxlen = $len;
            }
            $len = 0;
            $section = "";
        } else {
            $len++;
            $section .= $_;
        }
    }
}
# One last possible end of section
if ($flipflop && $len > $maxlen) {
    $maxsection = $section;
}
print $maxsection;
1
  • Nice solve too. Commented Jan 5 at 5:18
1

Using Raku (formerly known as Perl_6)

~$ raku -ne 'BEGIN my (@max,@tmp);  $_ .= words;  \
             if .[1]  < 2 { @tmp.push: .[1] };    \
             if .[1] !< 2 { @max = @tmp if @tmp.elems > @max.elems; @tmp = Empty };  \
             END @max.elems >= @tmp.elems ?? (.put for @max) !! (.put for @tmp);'  file

OR:

~$ raku -ne 'BEGIN my (@max,@tmp);  $_ .= words;  \
             when .[1]  < 2 { @tmp.push: .[1] };  \
             default { @max = @tmp if @tmp.elems > @max.elems; @tmp = Empty };  \
             END @max.elems >= @tmp.elems ?? (.put for @max) !! (.put for @tmp);'  file

Here are answers written in Raku, a member of the Perl-family of programming languages. Raku features rational numbers, if ever you need to maintain precision when performing simple math operations (e.g. say 0.1 + 0.2 - 0.3;).

  • The first answer reads lines into $_ using the -ne non-autoprinting linewise flags. Both a @max and @tmp array are declared. The line is broken on whitespace into words and .= saved back into $_. If (if statement) the .[1] second columm satisfies the criteria, the values is pushed onto the @tmp array. If not, the @tmp array overwrites the @max array if it has more elems (elements). Regardless, the @tmp array is Empty (emptied). At the END to make sure a final contiguous sequence is/isn't the longest, Raku's Test ?? True !! False ternary operator is used to output the longest array.

  • The second answer is similar to the first except when statements are used. In Raku once a when conditional is satisfied its associated block is executed and control reverts to the outer block, skipping any subsequent when or default statements. See reference below.

Sample Input:

1 16.3346384
2 11.43483
3 1.19819
4 1.1113829
5 1.0953443
6 1.9458343
7 1.345645
8 1.3847385794
9 1.3534344
10 2.1117454
11 1.17465
12 1.4587485

Sample Output:

1.19819
1.1113829
1.0953443
1.9458343
1.345645
1.3847385794
1.3534344

NOTE: The code above will output the first longest contiguous sequence in the case of a tie.

https://docs.raku.org/syntax/when
https://docs.raku.org/
https://raku.org

-2

If you do not want to over-engineer, try this command line one-liner:

awk '{print $2}' yourfile.txt | sort -g > youroutput.txt
  1. The first command will pick the second column of your file
  2. The second command will sort the selected column based on general numeric sort and write into the output file. For more details and fiddling, check the man pages of awk and sort.
2
  • This doesn't print the longest sequence of lines whose second field is less than two, it simply prints al the values of the second field but sorted. This isn't what the question asks for.
    – terdon
    Commented Jan 4 at 10:51
  • Ah you're right! I've obviously misread here.
    – Jas
    Commented Jan 4 at 10:56

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .