How can I print the longest sequence of lines featuring numbers smaller than a threshold?

Question

I am learning Perl, but I don't know how to solve this problem.

I have a .txt file of the following form:

1 16.3346384
2 11.43483
3 1.19819
4 1.1113829
5 1.0953443
6 1.9458343
7 1.345645
8 1.3847385794
9 1.3534344
10 2.1117454
11 1.17465
12 1.4587485

The first column only contains the line number, which is not of interest here, but it is present in the file; the values in the second column are the relevant part.

I want to output the longest contiguous sequence of lines which feature numbers smaller than 2.00 in the second column. For the above example, this would be lines 3 to 9 , and the output should be:

1.19819
1.1113829
1.0953443
1.9458343
1.345645
1.3847385794
1.3534344

Never apologize for not speaking a foreign language perfectly! It is hard, and you are trying, and that is all anyone can ask! — terdon, Commented Jan 4 at 10:58
What if there's a tie (equal longest contiguous sequence of lines satisfying the criteria)? Do you want the first such sequence, the last such sequence, or something else? — jubilatious1, Commented Jan 5 at 7:48

aviro · Accepted Answer · 2024-01-04 12:25:49Z

Perl one line:

perl -ne '$n = (split)[1]; if ($n > 2) {if ($i > $max) {$longest=$cur; $cur=""; $max=$i}; $i=0} else {$cur .= $n . "\n"; $i++} END {print $i > $max ? $cur : $longest}' < file.txt

Multi line for better readability:

perl -ne '
  $n = (split)[1];
  if ($n > 2) {
    if ($i > $max) {
      $longest=$cur;
      $cur="";
      $max=$i;
     }
     $i=0
  } else {
    $cur.= $n . "\n";
    $i++
  } 
  END {
    print $i > $max ? $cur : $longest
  }' < file.txt

One liner with awk:

awk '$2 > 2 { if (i > max) {res=cur; cur=""; max=i} i=0} $2 < 2 {cur = cur $2 "\n"; i++} END {if (i > max) res=cur; printf res}' file.txt

Multi line:

awk '
  $2 > 2 { 
    if (i > max) {
      res=cur
      cur=""
      max=i
    }
    i=0
  } 
  $2 < 2 {
    cur = cur $2 "\n"
    i++
  }
  END {
    if (i > max) res=cur
    printf res
  }' file.txt

AdminBee · Accepted Answer · 2024-01-04 11:52:30Z

This is not quite a trivial task. There is also debate whether providing a finished program is helpful for others learning to solve a problem in a programming language, but I believe it has its merits, so I propose the following program (let's call it findlongestsequence.pl:

#!/usr/bin/perl
use strict;
use Getopt::Long;

my $limit; my $infile;
GetOptions( 'limit=f' => \$limit, 'infile=s' => \$infile );

my $lineno=0; my $groupstart;
my $currlength=0; my $maxlength=0; my $ingroup=0;
my @columns; my @groupbuf; my @longestgroup;

if (! open(fileinput, '<', "$infile" )) {exit 1;};
while (<fileinput>)
{
    $lineno++;
    @columns = split(/\s+/,$_);

    if ( $ingroup == 0 && $columns[1]<$limit )
    {
        $ingroup=1;
        $groupstart=$lineno;
        @groupbuf=();
    }

    if ( $ingroup == 1 )
    {
        if ($columns[1]>=$limit )
        {
            $ingroup=0;
            $currlength=$lineno-$groupstart;
    
            if ( $currlength>$maxlength )
            {
                $maxlength=$currlength;
                @longestgroup=@groupbuf;
            }
        }
        else
        {
            push(@groupbuf,$columns[1]);
        }
    }
}
close(fileinput);

if ( $ingroup == 1 )
{
    $currlength=$groupstart-$lineno;
    if ( $currlength>$maxlength )
    {
        $maxlength=$currlength;
        @longestgroup=@groupbuf;
    }
}

print join("\n",@longestgroup),"\n";
exit 0;

You can call the program as

./findlongestsequence.pl --infile input.txt --limit 2.0

This will first interpret the command-line parameters using Getopt::Long.

It will then open the file and read it line-wise while, and keep a line-counter in $lineno. Every line will be split into columns at whitespace.

If the program is not inside a group of lines with values < $limit ($ingroup is zero), but encounters a suitable line, it will record that it is now in such a group ($ingroup set to one), store the group start in $groupstart and start buffering the column 2 values in an array @groupbuf.
If the program is inside such a group, but the current value is larger than the $limit, it will recognize the end-of-group and calculate its length. If this is longer than the previously recorded longest group, the content (@groupbuf) and length ($currlength) of the new longest group is copied to @longestgroup and $maxlength, respectively.

Since it is possible that a group is terminated by end-of-file rather than a line with value > $limit, perform this check also if $ingroup is true at end-of-file.

At the end, the content of @longestgroup is printed with \n as token separator.

@694201970 Well, if you look at the split call, there is one regex (albeit only to specify where to split into columns) ;) — AdminBee, Commented Jan 4 at 11:59
@694201970 I understand you want to use regex, and I can give you an answer based on regex, but maybe it's better you try to explain what would you expect your regex to work. Why ?:1 at the beginning? Why - at the end? Try to explain, and then we could tell you where you had you mistake. Also try to show this inside your own perl code. — aviro, Commented Jan 4 at 12:24

Ed Morton · Accepted Answer · 2024-01-05 17:53:21Z

2

Using any awk:

$ cat tst.awk
$2 >= 2 {
    max = getMax(cur,max)
    cur = ""
    next
}
{ cur = cur $2 ORS }
END {
    printf "%s", getMax(cur,max)
}
function getMax(a,b) {
    return ( gsub(ORS,"&",a) > gsub(ORS,"&",b) ? a : b )
}

$ awk -f tst.awk file
1.19819
1.1113829
1.0953443
1.9458343
1.345645
1.3847385794
1.3534344

edited Jan 5 at 17:53

answered Jan 5 at 14:53

Ed Morton

32.4k6 gold badges23 silver badges52 bronze badges

yea, it works, but i see in output 12.022760651465799 between 1.xxxxx and 1.xxxxx
– 69 420 1970
Commented Jan 5 at 17:20
That's impossible since 12.022760651465799 isn't present in the input. Given the input you provided in your question for us to test with, my script produces the expected output you provided for us to test with, as shown in my answer.
– Ed Morton
Commented Jan 5 at 17:52

Add a comment |

Stéphane Chazelas · Accepted Answer · 2024-01-04 16:41:30Z

1

Maybe something like:

<input perl -snle '
  if ($_ < $limit) {
    $n++;
  } else {
    $max = $n if $n > $max;
    $n = 0;
  }
  END {
    print ($n > $max ? $n : $max);
  }' -- -limit=2 -max=0

Or if instead of the number of lines in that largest group of lines you want to see those lines as per newer edits to your question:

<input perl -snle '
  if ($_ < $limit) {
    push @lines, $_;
  } else {
    @max = @lines if @lines > @max;
    @lines = ();
  }
  END {
    print for @lines > @max ? @lines : @max;
  }' -- -limit=2

If, as someone edited in your question, the line numbers are part of the data, add the -a option (awk mode where records are split into the @F array) and replace $_ (the whole record) with $F[1] (the second field, $F[0] being the first).

edited Jan 4 at 16:41

answered Jan 4 at 9:42

Stéphane Chazelas

554k92 gold badges1.1k silver badges1.6k bronze badges

1

nope, just print 114035 (count string in a file)
– 69 420 1970
Commented Jan 4 at 9:52
@694201970 I just realised the fist number in the lines of the sample in your question was not part of the data. See edit
– Stéphane Chazelas
Commented Jan 4 at 10:01
Even though your post doesn't answer the question, your original answer (before the edit) was actually better (in terms of what you were attempting to answer) than the current answer. Before the edit you were just missing the -n for the loop. Now it just takes the entire line without any splitting.
– aviro
Commented Jan 4 at 13:33
@avriro, -a which I had before, implies -n. The OP indicates (not very clearly) that the line numbers are not part of the data
– Stéphane Chazelas
Commented Jan 4 at 15:08
Oops, my -n comment was wrong. But anyway, the OP does say it's part of the data. In the original question he said he only wanted to check only the second column. Also now, after the edit: "The first column only contains the line number, which is not of interest here, but it is present in the file"
– aviro
Commented Jan 4 at 15:38

| Show 4 more comments

Simon Branch · Accepted Answer · 2024-01-04 17:20:49Z

Idiomatic solution using <> for reading input and the flipflop operator.

#!/usr/bin/env perl
use strict;
use warnings;
# https://unix.stackexchange.com/questions/766081/how-to-print-the-longest-sequence-of-lines-featuring-numbers-smaller-than-a-thre
my $threshold = 2.00;
my ($section, $maxsection, $len, $maxlen);
my $flipflop;
while (<>) {
    # Remove leading line number
    s/^(\d+)\s+//;
    # Flip flop operator
    # https://www.effectiveperlprogramming.com/2010/11/make-exclusive-flip-flop-operators/
    if ($flipflop = $_ <= $threshold .. $_ > $threshold) {
        if ($flipflop =~ /E0$/) {
            # End of section
            if (!defined($maxlen) || $len > $maxlen) {
                $maxsection = $section;
                $maxlen = $len;
            }
            $len = 0;
            $section = "";
        } else {
            $len++;
            $section .= $_;
        }
    }
}
# One last possible end of section
if ($flipflop && $len > $maxlen) {
    $maxsection = $section;
}
print $maxsection;

jubilatious1 · Accepted Answer · 2024-01-05 10:45:00Z

Using Raku (formerly known as Perl_6)

~$ raku -ne 'BEGIN my (@max,@tmp);  $_ .= words;  \
             if .[1]  < 2 { @tmp.push: .[1] };    \
             if .[1] !< 2 { @max = @tmp if @tmp.elems > @max.elems; @tmp = Empty };  \
             END @max.elems >= @tmp.elems ?? (.put for @max) !! (.put for @tmp);'  file

OR:

~$ raku -ne 'BEGIN my (@max,@tmp);  $_ .= words;  \
             when .[1]  < 2 { @tmp.push: .[1] };  \
             default { @max = @tmp if @tmp.elems > @max.elems; @tmp = Empty };  \
             END @max.elems >= @tmp.elems ?? (.put for @max) !! (.put for @tmp);'  file

Here are answers written in Raku, a member of the Perl-family of programming languages. Raku features rational numbers, if ever you need to maintain precision when performing simple math operations (e.g. say 0.1 + 0.2 - 0.3;).

The first answer reads lines into $_ using the -ne non-autoprinting linewise flags. Both a @max and @tmp array are declared. The line is broken on whitespace into words and .= saved back into $_. If (if statement) the .[1] second columm satisfies the criteria, the values is pushed onto the @tmp array. If not, the @tmp array overwrites the @max array if it has more elems (elements). Regardless, the @tmp array is Empty (emptied). At the END to make sure a final contiguous sequence is/isn't the longest, Raku's Test ?? True !! False ternary operator is used to output the longest array.
The second answer is similar to the first except when statements are used. In Raku once a when conditional is satisfied its associated block is executed and control reverts to the outer block, skipping any subsequent when or default statements. See reference below.

Sample Input:

1 16.3346384
2 11.43483
3 1.19819
4 1.1113829
5 1.0953443
6 1.9458343
7 1.345645
8 1.3847385794
9 1.3534344
10 2.1117454
11 1.17465
12 1.4587485

Sample Output:

1.19819
1.1113829
1.0953443
1.9458343
1.345645
1.3847385794
1.3534344

NOTE: The code above will output the first longest contiguous sequence in the case of a tie.

https://docs.raku.org/syntax/when
https://docs.raku.org/
https://raku.org

terdon · Accepted Answer · 2024-01-04 10:47:55Z

-2

If you do not want to over-engineer, try this command line one-liner:

awk '{print $2}' yourfile.txt | sort -g > youroutput.txt

The first command will pick the second column of your file
The second command will sort the selected column based on general numeric sort and write into the output file. For more details and fiddling, check the man pages of awk and sort.

edited Jan 4 at 10:47

terdon♦

245k67 gold badges464 silver badges696 bronze badges

answered Jan 4 at 10:05

Jas

3272 gold badges3 silver badges11 bronze badges

This doesn't print the longest sequence of lines whose second field is less than two, it simply prints al the values of the second field but sorted. This isn't what the question asks for.
– terdon ♦
Commented Jan 4 at 10:51
Ah you're right! I've obviously misread here.
– Jas
Commented Jan 4 at 10:56

Add a comment |

Stack Exchange Network

How can I print the longest sequence of lines featuring numbers smaller than a threshold?

7 Answers 7

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
text-processing
awk
sed
perl
.

Hot Network Questions

How can I print the longest sequence of lines featuring numbers smaller than a threshold?

7 Answers 7

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged text-processingawksedperl.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
text-processing
awk
sed
perl
.