1

Consider that I have the file listed below. I need to select all lines from every instance of the regex pattern Word A to before the regex pattern Word D.

Word A
Word B
Word C
Word D
Word E
Word F
Word G
Word A
Word H
Word I
Word D
Word J
Word A
Word K
Word D
Word L
Word M
Word A
Word D

Note the variable number of rows between A and D. Sometimes, D is the very next row. Here's what I need the output to be:

Word A
Word B
Word C
Word A
Word H
Word I
Word A
Word K
Word A

Can be done with awk, perl, python, or sed. Doesn't matter as long as it's installed on the RHEL6 server where the file is.

13
  • 1
    What if the file contains 2 Word Ds after Word A - stop at the first or the last one? What if there's Word A but no Word D - print to end of file or not? Word D without Word A - print from start of file? OtherWord A - should that match Word A? If Word D appears mid-line should the part of the line before it be printed? What if both exist on the same line? etc., etc.....
    – Ed Morton
    Commented Nov 22, 2023 at 13:00
  • See how-do-i-find-the-text-that-matches-a-pattern for considerations when asking a pattern matching question on how to make your requirements clear.
    – Ed Morton
    Commented Nov 22, 2023 at 13:03
  • @EdMorton "What if the file contains 2 Word Ds after Word A"? Stop at the first one.
    – RonJohn
    Commented Nov 22, 2023 at 15:26
  • 1
    Please edit your question to state your rainy-day requirements like that and include the rainy-day cases in your sample input/output so we have something to test a potential solution with. Regarding "That's a very low probability occurrence" - not handling those is where most software bugs show up.
    – Ed Morton
    Commented Nov 22, 2023 at 15:28
  • 1
    The thing is, with any pattern matching problem it's always FAR easier to match what you want than it is to not match similar text you don't want so it's important to think through and state what those rainy-day "similar text I don't want" cases are and how they should be handled, and include them in your sample input/output.
    – Ed Morton
    Commented Nov 22, 2023 at 19:35

7 Answers 7

4

Using AWK:

awk '/Word A/ { m = 1 } /Word D/ { m = 0 } m'
2

Here's an awk solution

awk \
  -vstart='Word A' \
  -vend='Word D' \
  '{
     if ($0==end  ) {flag=0;next};
     if ($0==start) {flag=1};
     if (flag==1) {print $0};
  }'

Only a minor change required for regex handling

awk \
  -vstart='Word[ ]A' \
  -vend='Word[ ]D' \
  '{
     if ($0 ~ end  ) {flag=0;next};
     if ($0 ~ start) {flag=1};
     if (flag==1) {print $0};
  }'
4
  • Close, but does not work for when Word D comes right after Word A. Also, I apparently wasn't explicit enough when I wrote that it must work on patterns; they'll be simple regex patterns.
    – RonJohn
    Commented Nov 22, 2023 at 2:37
  • Worked in my tests where A and D and on consecutive lines eg tio.run/##S0oszvj/…
    – bxm
    Commented Nov 23, 2023 at 7:50
  • If you have a requirement where matches can occur on the same line in the input, this is not clear from the question as posed, so please amend accordingly.
    – bxm
    Commented Nov 23, 2023 at 12:02
  • I worked around the situation where Word D is on the same line as Word A (which happens every time) with a sed substitution that adds "marker text" to the beginning of every Word A line. Combined with @nezabudka's answer, my problem is solved.
    – RonJohn
    Commented Nov 24, 2023 at 2:05
2

Using Raku (formerly known as Perl_6)

~$ raku -ne '.put if / Word \h A / fff^ / Word \h D /;'  file

Raku is a programming language in the Perl-family. It's an "operator-rich" language that features a powerful Regex engine. Above, the -ne non-autoprinting linewise flags are used, in conjunction with Raku's sed-like fff "Flip-flop" operator.

Raku includes various 'flavors' of its sed-like fff infix operator, including fff^, ^fff and even ^fff^. While each Regex is recognized, the ^ caret indicates that recognized line should be dropped from the output:

Sample Input:

Word A
Word B
Word C
Word D
Word E
Word F
Word G
Word A
Word H
Word I
Word D
Word J
Word A
Word K
Word D
Word L
Word M
Word A
Word D

Sample Output:

Word A
Word B
Word C
Word A
Word H
Word I
Word A
Word K
Word A

The above code solves the OP's test case. But what if the /start/ and /stop/ Regexes are actually on the same line? For that problem you could try Raku's awk-like ff operator:

~$ echo 'AB\nCD\nEF' | raku -ne 'say $_ if /A/ ff /B/;'
AB
~$ echo 'AB\nCD\nEF' | raku -ne 'say $_ if /A/ ff /C/;'
AB
CD

As compared to Raku's sed-like fff operator:

~$ echo 'AB\nCD\nEF' | raku -ne 'say $_ if /A/ fff /B/;'
AB
CD
EF
~$ echo 'AB\nCD\nEF' | raku -ne 'say $_ if /A/ fff /C/;'
AB
CD

https://docs.raku.org/routine/fff
https://docs.raku.org/routine/ff
https://raku.org

1

GNU sed only:

sed '/Word A/!d;:1;n;/Word D/d;b1' file

In more complex cases - invalid blocks:

sed -n '/Word A/!b;:1;/Word A/h;n;/Word D/{g;p;d};H;b1' file
1

TXR Lisp's awk macro supports this directly; the rng (range) operator has nine variants for various ways of excluding records from the start or end of a range:

$ txr -e '(awk ((rng- #/Word A/ #/Word D/)))' data
Word A
Word B
Word C
Word A
Word H
Word I
Word A
Word K
Word A

Also, unlike Awk's range operator, it combines with other operators. E.g. suppose you wanted to print records which are simultaneously in a foo to bar range, and in a start to end range, no matter how those kinds of ranges overlap in the data:

(awk ((and (rng #/foo/ #/bar/)
           (rng #/start/ #/end/))))
3
  • I'll have to take your word for it.
    – RonJohn
    Commented Nov 24, 2023 at 2:00
  • Never heard of TXR Lisp. Will have to investigate. Commented Dec 24, 2023 at 3:53
  • Your final example: tried similar with Raku (a.k.a Perl6) and it works! raku -ne '.put if (/ A / fff / C /) & (/ B / fff / D /);' file. Commented Dec 24, 2023 at 3:54
0

Using awk:

$ awk '
    $0 == "Word A" { f=1; rec=$0; next }
    { if ( $0 == "Word D" ) { print rec; f=0 } }
    f{rec = rec ORS $0}'

# For regex pattern
$ awk '          
    (/Word A/ && !/Word D/) { f=1; rec=$0; next }
    (/Word D/ && rec){ print rec; f=0; rec="" }
    f{rec = rec ORS $0}
'

If Word D matches Word A everytime, then the following command may be used.

$ awk '/Word A/,/Word D/ { if (!/Word D/) print }'
-2

sed lets one do arithmetic on line specifications:

sed -n -e '/Word A/,/Word D/-1p' The_File

Read man sed.

5
  • This doesn't seem to be supported by GNU sed - range addresses only appear to allow positive offsets relative to the start of the range (like /Word A/,+3p). However you could do /Word A/,/Word D/{/Word D/!p} I think. Commented Nov 22, 2023 at 1:06
  • Tested in GNU sed; does not work.
    – RonJohn
    Commented Nov 22, 2023 at 2:30
  • @steeldriver sed -n -e '/Word A/,/Word D/{/Word D/!p}' The_file works. Make this an answer, and I'll accept.
    – RonJohn
    Commented Nov 22, 2023 at 2:41
  • @RonJohn if you use range expressions then you end up having to specify the same regexp twice while if you use a flag you don't. That makes a flag solution better than a range solution. Sed doesn't have variables to use as flags but awk does.
    – Ed Morton
    Commented Nov 22, 2023 at 12:56
  • Not my downvote by the way.
    – Ed Morton
    Commented Nov 22, 2023 at 13:05

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .