gawk hangs when using a regex for RS combined with reading a continuous stream from stdin

Question

I'm streaming data using netcat and piping the output to gawk. Here is an example byte sequence that gawk will receive:

=AAAA;=BBBB;;CCCC==DDDD;

The data includes nearly any arbitrary characters, but never contains NULL characters, where = and ; are reserved to be delimiters. As chunks of arbitrary characters are written, each chunk will always be prefixed by one of the delimiters, and always be suffixed by one of the delimiters, but either delimiter can be used at any time: = is not always the prefix, and ; is not always the suffix. It will never write a chunk without also writing an appropriate prefix and suffix. As the data is parsed, I need to distuingish between which delimiter was used, so that my downstream code can properly interpret that information.

Since this is a network stream, stdin remains open after this sequence is read, as it waits for future data. I'd want gawk to read until either delimiter is encountered, and then execute the body of my gawk script with whatever data was found, while ensuring that it properly handles the continuous stream of stdin. I explain this in more detail below.

Thus far

Here is what I have attempted thus far (zsh script, using gawk, on macOS). For this post, I simplified the body to just print the data - my full gawk script has a much more complicated body. I also simplified the netcat stream to instead just cat a file (along with cat'ing stdin in order to mimic the stream behavior).

cat example.txt - | gawk '
BEGIN {
    RS = "=|;";
}
{
    if ($0 != "") {
        print $0;
        fflush();
    }
}
'

example.txt

=AAAA;=BBBB;=CCCC;=DDDD;

My attempt successfully handles most of the data......up until the most-recent record. It hangs waiting for more data from stdin, and fails to execute the body of my script for the most-recent record, despite an appropriate delimiter clearly being available in stdin.

Current output: (fails to process the most-recent record of DDDD)

AAAA
BBBB
CCCC
[hang here, waiting for future data]

Desired output: (successfully processes all records, including the most-recent)

AAAA
BBBB
CCCC
DDDD
[hang here, waiting for future data]

What, exactly, could be the cause of this problem, and how can I potentially address it? I recognize that this seems to be somewhat of an edge-case scenario. Thank you all very much for your help!

Edit: Comment consolidation, misc clarifications, and various observations/realizations

Here are some misc observations I found during debugging, both before and after I originally made this post. These edits also clarify some questions that came up in the comments, and consolidate the info scattered across various comments into a single place. Also includes some realizations I made about how gawk works internally, based on the extremely insightful information in the comments. Info in this edit supersedes any potentially conflicting info that may have been discussed in the comments.

I briefly investigated whether this could be a pipe buffering issue imposed by the OS. After messing with the stdbuf tool to disable all pipe buffering, it seems that buffering is not the problem at all, at least not in the traditional sense (see item #3).
I noticed that if stdin is closed and a regex is used for RS, no problems occur. Conversely, if stdin remains open and RS is not a regex (i.e. a plaintext string), no problems occur either. The problem only occurs if both stdin remains open and RS is a regex. Thus, we can reasonably assume that it's something related to how regex handles having a continuous stream of stdin.
I noticed that if my RS with regex (RS = "=|;";) is 3 characters long...and stdin remains open...it stops hanging after exactly 3 additional characters appear in stdin. If I adjust the length of my regex to be 5 chars (RS = "(=|;)"), the amount of additional characters necessary to return from hanging adjusts accordingly. Combined with the extremely insightful discussion with Kaz, this establishes that the hanging is an artifact of the regex engine itself. Like Kaz said, when the regex engine parses RS = "=|;";, it ends up trying to read additional characters from stdin in order to be sure that the regex is a match, despite this additional read not being strictly necessary for the regex in question, which obviously causes a hang waiting on stdin. I also tried adding lazy quantifiers to the regex, which in theory means the regex engine can return immediately, but alas it does not, as this is an implementation detail of the regex engine.
The gawk docs here and here state that when RS is a single character, it is treated as a plaintext string, and causes RS to match without invoking the regex engine. Conversely, if RS has 2 or more characters, it is treated as a regex, and the regex engine will be invoked (subsequently bringing the problem discussed in item #3 into play). However, this seems to be slightly misleading, which is an implementation detail of gawk. I tried RS = "xy"; (and adjusted my data accordingly), and re-tested my experiment from #3. No hanging occurred and the correct output was printed, which must mean that despite RS being 2 characters, it is still being treated as a plaintext string - the regex engine is never invoked, and the hanging problem never occurs. So, there seems to be some further filtering on whether RS is treated as plaintext or as a regex.
So....now that we've figured out the root cause of the problem....what do we do about it? An obvious idea would be to avoid using regex....but that points toward writing a custom data parser in C or some other language. This hypothetical custom program would parse the input entirely from scratch, and gawk/regex would never be involved anywhere in the lifecycle of my script. Although I could do this, and this would certainly solve the problem, the extent of my full data parsing is somewhat complex, so I'd rather not go down this path of weeds.
This brings us to Ed Morton's workaround, which is probably the best way to go, or some derivative thereof. Summarizing his approach below:

Basically, use other CLI tools to do an ahead-of-time conversion, before data is given to gawk, to add a suffixed NULL character after each potential delimiter. Next, invoke gawk with RS as the NULL character, which would treat RS as a plaintext string and not a regex, which means the hanging problem never comes into play. From there, the real delimiter and data chunk could be decoded and processed in whatever way you want.

Although I have now marked Ed's answer as the solution, I think that my final solution will be a hybrid of Ed's approach, Kaz's insight, some subsequent realizations I made thanks to them, and some arbitrary approach that I can come up with in order to add those suffixed NULL characters. Wish I could mark two answers as solutions! Thank you everyone for your help, especially Ed Morton and Kaz!

If you don't have newlines in your input stream you could try gawk -v FPAT='[^;=]+' '{ for(i = 1; i <= NF; i++) { print $i; fflush() }}'. — Renaud Pacalet, Commented Jul 3 at 4:49
@RenaudPacalet Unfortunately, there are newlines in the input stream, and plenty of them. The input stream can be any arbitrary characters, with the only stipulation being that = and ; are delimiters. — user12280249, Commented Jul 3 at 4:55
Then, you should probably edit your question and provide a more representative input/output example. — Renaud Pacalet, Commented Jul 3 at 5:27
@WalterA I did. I liked the fact you made it portable to non-GNU awks but I wasn't a fan of both the shell loop and the awk script needing to know the separator chars (control coupling between the 2 parts of the script) and a couple of other things so I added an alternative way to do essentially the same approach to the bottom of your script. Hope that was OK, feel free to remove it again if you like. — Ed Morton, Commented Jul 5 at 10:42
Just realised what I had added wouldn't work as it;d mean you couldn't tell an empty record between 2 ;s vs a ; with a ; I was adding after it so I removed my script again from your answer. I don't want to spend any more time on this question unless the GNU folks have another suggestion on how to do this in response to my bug report. — Ed Morton, Commented Jul 5 at 10:52

Ed Morton · Accepted Answer · 2024-07-03 13:05:03Z

A workaround inserting a shell read loop into the pipeline to carve the original awk input (the OPs actual netcat output) up into individual characters and then feed them to awk one at a time:

cat example.txt - |
while IFS= read -r -d '' -N1 char; do printf '%s\0' "$char"; done |
awk -v RS='\0' '
    /[;=]/ { if (rec != "") { print rec; fflush() }; rec=""; next }
    { rec=rec $0 }
'
AAAA
AAAA
AAAA
AAAA

That requires GNU awk or some other that can handle a NUL character as the RS as that's non-POSIX behavior. It does assume your input can't contain NUL bytes, i.e. it's a valid POSIX text "file".

Read on for how we got there if interested...

I thought there was at least 1 bug here as I found multiple oddities (see below) so I opened a gawk bug report at https://lists.gnu.org/archive/html/bug-gawk/2024-07/msg00006.html but per the gawk provider, Arnold, the differences in behavior in this case are just implementation details of having to read ahead to ensure the regexp matches the right string.

It seems there are 3 issues at play here, e.g. using GNU awk 5.3.0 on cygwin:

Different supposedly equivalent regexps produce different behavior:

$ printf 'A;B;C;\n' > file

$ cat file - | awk -v RS='(;|=)' '{print NR, $0}'
1 A

$ cat file - | awk -v RS=';|=' '{print NR, $0}'
1 A
2 B

$ cat file - | awk -v RS='[;=]' '{print NR, $0}'
1 A
2 B
3 C

(;|=), ;|= and [;=] should be equivalent but clearly they aren't in this case.

The good news is you can apparently work around that problem using a bracket expression as in the 3rd case above instead of an "or".

The output record trails the input record when the record separator character is the last one in the input, e.g. with no newline after the last ;:

$ printf 'A;B;C;' > file

$ cat file - | awk -v RS='(;|=)' '{print $0; fflush()}'

$ cat file - | awk -v RS=';|=' '{print $0; fflush()}'
A

$ cat file - | awk -v RS='[;=]' '{print $0; fflush()}'
A
B

The bad news is that that impacts the OPs example:

$ printf ';AAAA;BBBB;CCCC;DDDD;' > file

With a literal character RS:

$ cat file - | awk -v RS=';' '{print $0; fflush()}'

AAAA
BBBB
CCCC
DDDD

With a regexp RS that should also make that char literal:

$ cat file - | awk -v RS='[;]' '{print $0; fflush()}'

AAAA
BBBB
CCCC

$ printf ';AAAA;BBBB;CCCC;DDDD;x' > file

$ cat file - | awk -v RS='[;]' '{print $0; fflush()}'

AAAA
BBBB
CCCC
DDDD

Adding different characters to the RS bracket expression produces inconsistent behavior (I stumbled across this by accident):

$ printf 'A;B;C;\n' > file

$ cat file - | awk -v RS='[;|=]' '{print $0; fflush()}'
A

$ cat file - | awk -v RS='[;a=]' '{print $0; fflush()}'
A
B
C

FWIW I tried setting a timeout:

$ cat file - | awk -v RS='[;]' 'BEGIN{PROCINFO["-", "READ_TIMEOUT"]=100} {print $0; fflush()}'
A
B
awk: cmd. line:1: (FILENAME=- FNR=3) fatal: error reading input file `-': Connection timed out

$ cat file - | awk -v RS='[;]' -v GAWK_READ_TIMEOUT=1 '{print $0; fflush()}'
A
B

and stdbuf to disable buffering:

$ cat file - | stdbuf -i0 -o0 -e0 awk -v RS='[;]' '{print $0; fflush()}'
A
B

and matching every character (thinking I could then use RT ~ /[=;]/ to find the separator):

$ cat file - | awk -v RS='(.)' '{print RT; fflush()}'
A
;
B
;
C

but none of them would let me read the last record separator so at this point I don't know what the OP could do to successfully read the last record of continuing input using a regexp other than something like this:

$ printf 'A;B;C;' > file

$ cat file - |
    while IFS= read -r -d '' -N1 char; do printf '%s\0' "$char"; done |
    awk -v RS='\0' '/[;=]/ { print rec; fflush(); rec=""; next } { rec=rec $0 }'
A
B
C

and using the OPs sample input but with different text per record to make the mapping of input to output records clearer:

$ printf '=AAAA=BBBB;CCCC;DDDD=' > example.txt

$ cat example.txt - |
    while IFS= read -r -d '' -N1 char; do printf '%s\0' "$char"; done |
    awk -v RS='\0' '/[;=]/ { print rec; fflush(); rec=""; next } { rec=rec $0 }'

AAAA
BBBB
CCCC
DDDD

We're using NUL chars as the delimiters and various options above to make the shell read loop robust enough to handle blank lines and other white space in the input, see https://unix.stackexchange.com/a/49585/133219 and https://unix.stackexchange.com/a/169765/133219 for details on those issues. We're additionally using a NUL char for the awk RS so it can distinguish between newlines coming from the original input vs a newline as a terminating character being added by the shell printf, otherwise rec in the awk script could never contain a newline as they'd ALL be consumed by matching the default RS.

We're using a pipe to/from the while-read loop instead of process substitution just to ease clarity since the OP is already using pipes.

@Armali it makes no difference to the problem of 3 supposedly equivalent regexps producing different results if the \n is there or not. I put it there to rule out the possibility that not having it (and so the input not being a valid POSIX text file) was causing the problem. — Ed Morton, Commented Jul 3 at 9:33
Even the 3rd case doesn't work without the \n (GNU Awk 4.1.4). — Armali, Commented Jul 3 at 9:42
@Armali get a newer version of gawk, that one is 8 years out of date and we're now on gawk 5.3.0, there's been several bug fixes and enhancements in between. — Ed Morton, Commented Jul 3 at 9:58
@WalterA while that solves one problem, it won't solve the OPs whole problem as item "2" from my list would still exist so they still wouldn't see the last record from the input unless they do what I show at the top of my answer. — Ed Morton, Commented Jul 3 at 14:32

Kaz · Accepted Answer · 2024-07-03 07:26:55Z

Awk is waiting for the record to be delimited. A record will be delimited when two things happen: there is a match for the RS regex, or the input ends.

You've not given it either, because you used cat <file> -, which means that cat's output tream continues with standard input (your TTY) after <file> is exhausted.

You must use Ctrl-D on an empty line to generate the necessary EOF condition that Gawk is looking for.

Edit:

The issue is, why does the last record not appear even though it is delimited by the trailing =?

This behavior reproduces exactly in an Awk implementation that I wrote as a macro in a Lisp language, side by side with GNU Awk.

$ (echo -n 'AAAA=AAAA;AAAA;AAAA='; cat) | gawk 'BEGIN { RS = "=|;"; } { print $0; fflush(); }'
AAAA
AAAA
AAAA
# hangs here until Ctrl-D, then:
AAAA

Exactly the same thing:

$ (echo -n 'AAAA=AAAA;AAAA;AAAA='; cat) | txr -e '(awk (:set rs #/=|;/) (t))'
AAAA
AAAA
AAAA
# hangs here until Ctrl-D, then:
AAAA

In the case of the second Awk implementation, since I wrote everything from scratch, including the regex engine, I can explain the behavior of that which forms a hypothesis about why Gawk is the same.

The regex-delimited reading is based on a function written in C called read_until_match which is a wrapper for a helper called scan_until_common. This function works by feeding characters one by one from the stream into a regex state machine, checking the state.

Here is the thing. When the regex state machine says "we have a match!" we cannot stop there. The reason is that we need to find the longest match.

The function does not know that the regex is a trivial one-character regex, for which the first match is already the longest match. Therefore, it needs to feed one more character of the input. At that point, the regex state machine says "fail!". The function then knows that there had been a successful match previously. It backtracks to that point, pushing the extra character back into the stream.

So, of course, if there is no next character available in the stream, we get an I/O blocking hang.

Why it has to work this way is that some regexes successfully match prefixes of the longest match. A trivial example is: suppose we have #+ as a delimiter. When one # is seen, that's a match! But when another # is seen, that is also a match! We have to see all the # characters to get the full match, which means we have to see the first non-matching character which follows.

GNU Awk cannot easily escape from doing something very similar; the theory calls for it.

A way to solve the problem would be to have a function maxmatchlen(R) which for a regex R reports the maximum length of the match for the regex (possibly infinite). maxmatchlen(/.*/) is Inf, but matchmatchlen(/abc/) is 3. You get the picture. With this function, we would know that if we have just fed the regex matchmatchlen characters, and the regex state machine is reporting a matching state, we are done; we don't have to look ahead into the stream.

Thanks for the pointers! Unfortunately sending EOF is not an option because I'd want the network stream to continue indefinitely. You mentioned that gawk will delimit when a match for the RS regex occurs. Do you know why, specifically, my regex does not match the stdin data? I am using RS = "=|;";. I'd imagine this matches upon the first = or ;, and the most recent char in my stdin is indeed =. Correct me if I'm wrong, but I'd imagine this would cause a match, even though stdin is still open? Thanks! — user12280249, Commented Jul 3 at 5:47
I see. We have a trailing = record separator, so why does it hang without processing that last record which is clearly delimited? I suspect that the regex engine is in Gawk working with a character of lookahead, even though that specific regex doesn't require it. — Kaz, Commented Jul 3 at 7:06
I'm not abundantly clear why it's hanging, but after more debugging, I have my suspicions on two items: 1) Like you said, regex engine is trying to read more data in order to finish the regex, which would block, despite my regex not requiring this read. 2) Some sort of buffering issue. I noticed that if my input is =AAAA=AAAA;AAAA;AAAA=AA (23 characters), it hangs......but if I append another A (making it 24 characters, and crossing a 4-byte boundary).....then everything works correctly and produces my desired output. Very strange. I'm using macOS. Any suggestions perhaps? Thanks a bunch! — user12280249, Commented Jul 3 at 7:18
Yes, that regex doesn't require the read, but regexes in general do need to scan more characters in spite of hitting a matching state in their state machine. This seems worth fixing in my implementation. We can calculate the maxmatchlen property while we are compiling the regex, making it cheaply available at execution time. — Kaz, Commented Jul 3 at 7:29
Adding onto this, I also observed that....if my regex is 5 characters long....then my input data must have at least 5 characters after the most-recent delimiter for it to output proper results. If my input has less than 5 chars after, then it will do this truncation behavior. Similarly, if I adjust the length of my regex string, that also adjusts how many chars are necessary after the most-recent delimiter in order to invoke this behavior. Given that, this seems to point to being an artifact of the regex engine. — user12280249, Commented Jul 3 at 7:42

Ed Morton · Accepted Answer · 2024-07-05 10:52:03Z

4

A combination of the solutions of @daweo and @EdMorton:
OP wants to have logic based on discern the two delimiters, and might want to use RT for it.
First use Ed's work-around for reading the input one character a time.
When a = is found, add a ; as a delimiter.
In awk, fix the RT when the = is part of the line.

I will print the RT after printing $0.

cat example.txt - | 
while IFS= read -r -d '' -N1 char; do
  if [[ "$char" == '=' ]]; then
    printf "=;"
  else
    printf '%s' "$char"
  fi
done  | awk '
  BEGIN {
    RS = ";"
  }
  /=/ {
        RT="=";
        sub(/=/,"", $0) 
  }
  {
    if ($0 != "") {
        print $0 "(RT=" RT ")";
        fflush();
    }
  }
'

Result:

AAAA(RT==)
AAAA(RT=;)
AAAA(RT=;)
AAAA(RT==)

edited Jul 5 at 10:52

Ed Morton

199k18 gold badges85 silver badges197 bronze badges

answered Jul 3 at 15:23

Walter A

19.8k2 gold badges26 silver badges44 bronze badges

A good idea; still, I would use perl instead of bash +1
– Fravadona
Commented Jul 3 at 16:29
1

something like perl -npe 'BEGIN{$/ = \1; $| = 1} $_ .= ";" if $_ eq "="'
– Fravadona
Commented Jul 3 at 18:35

Add a comment |

Daweo · Accepted Answer · 2024-07-03 06:50:50Z

Multiple Line (The GNU Awk User's Guide) says that

RS == any single character

Records are separated by each occurrence of the character. Multiple successive occurrences delimit empty records. (...)

RS == regexp

Records are separated by occurrences of characters that match regexp. Leading and trailing matches of regexp delimit empty records.(...)

Observe that Leading and trailing is mentioned only for latter, so I suspect source of troubles might be how it is implemented in GNU AWK.

If you do not need discern between = and ; I propose following workaround

cat -u example.txt - | sed -u 'y/;/=/' | gawk '
BEGIN {
    RS = "=";
}
{
    if ($0 != "") {
        print $0;
        fflush();
    }
}
'

which for example.txt content being

=AAAA=AAAA;AAAA;AAAA=

gives output

AAAA
AAAA
AAAA
AAAA

and hangs. Explanation: I added GNU sed running in unbuffered mode (-u) with single y command which does

Transliterate any characters in the pattern space which match any of the source-chars with the corresponding character in dest-chars.

In this replaces ; using =. Then changed RS in gawk command to single-character string =.

(tested in GNU sed 4.8 and GNU Awk 5.1.0)

Thanks for the info! Unfortunately I'd need to discern between the two delimiters, as my complex gawk body script needs to handle that part. Any potential suggestions given this constraint? Thanks! — user12280249, Commented Jul 3 at 7:03
In my answer I improved above solution for discerning the delimiters. — Walter A, Commented Jul 3 at 21:21

Armali · Accepted Answer · 2024-07-04 06:06:25Z

1

A solution which doesn't require changing the awk script: Since empty records are ignored by it, we can simply duplicate each record separator in a pipe stage inserted before gawk, e. g.

python -c '
import os
for i in iter(lambda: os.read(0, 1), b""):
    os.write(1, i)
    if i in b"=;": os.write(1, i)
' |

answered Jul 4 at 6:06

Armali

19.1k14 gold badges60 silver badges182 bronze badges

1

That doesn't fix completely the problem with awk; at least -v RS=';|=' still won't work
– Fravadona
Commented Jul 5 at 6:57
If setting RS is seen as part of the script, you're right; in this case, the less demanding "[=;]" is to be used, or the record separator be replicated as needed for the regexp readahead, e. g. four times.
– Armali
Commented Jul 5 at 8:04

Add a comment |

Ed Morton · Accepted Answer · 2024-07-05 18:47:29Z

One of the gawk providers, Andy Schorr, was unable to create a Stackoverflow account for some reason so he asked me to post his suggestion for him (see https://lists.gnu.org/archive/html/bug-gawk/2024-07/msg00012.html for the original source):

From Andy:

Have you considered trying to use the select extension and its nonblocking feature?

Something like this sort of seems to work:

(echo "A;B;C;D;"; cat -) | gawk -v 'RS=[;=]' -lselect -ltime '
BEGIN {
   fd = input_fd("")
   set_non_blocking(fd)
   PROCINFO[FILENAME, "RETRY"] = 1
   while (1) {
      delete readfds
      readfds[fd] = ""
      select(readfds, writefds, exceptfds)
      while ((rc = getline x) > 0) {
         if (rc > 0)
            printf "%d [%s]\n", ++n, x
         else if (rc != 2) {
            print "Error: non-retry error"
            exit 1
         }
      }
   }
}'

RARE Kpop Manifesto · Accepted Answer · 2024-07-05 15:54:05Z

I couldn't replicate it at all with any awk variant I have :

The outputs for gawk -c and gawk -P look out of place by design
None of them triggered the timeout

 for __ in 'mawk1' 'mawk2' 'nawk' 'gawk -e'   'gawk -be' \
           'gawk -ce' 'gawk -Pe'  'gawk -Mbe' 'gawk -nbe'; do

     ( time ( timeout --foreground 10 

       echo '=AAAA;=BBBB;=CCCC;=DDDD;' | $( printf '%s' "$__" ) '

       BEGIN {  RS = "[\n=;]+"
               OFS = "\3"     
           } { 
                print NR, FNR, NR, length(), 
                       "$0 := \""($0)"\"",
                       "$1 := \""($1)"\"", 
                      "$NF := \""($NF)"\"" }' ) | gcat - ) | 

     column -s$'\3' -t

     echo "\f\t$__ done ...\n"
 done

( timeout --foreground 10 echo '=AAAA;=BBBB;=CCCC;=DDDD;' |  ; ) 
        0.00s user 0.01s system 110% cpu 0.011 total
gcat -  0.00s user 0.00s system 39% cpu 0.010 total
1  1  1  0  $0 := ""      $1 := ""      $NF := ""
2  2  2  4  $0 := "AAAA"  $1 := "AAAA"  $NF := "AAAA"
3  3  3  4  $0 := "BBBB"  $1 := "BBBB"  $NF := "BBBB"
4  4  4  4  $0 := "CCCC"  $1 := "CCCC"  $NF := "CCCC"
5  5  5  4  $0 := "DDDD"  $1 := "DDDD"  $NF := "DDDD"

    mawk1 done ...
 
( timeout --foreground 10 echo '=AAAA;=BBBB;=CCCC;=DDDD;' |  ; ) 
        0.00s user 0.01s system 127% cpu 0.008 total
gcat -  0.00s user 0.00s system  38% cpu 0.007 total

1  1  1  0  $0 := ""      $1 := ""      $NF := ""
2  2  2  4  $0 := "AAAA"  $1 := "AAAA"  $NF := "AAAA"
3  3  3  4  $0 := "BBBB"  $1 := "BBBB"  $NF := "BBBB"
4  4  4  4  $0 := "CCCC"  $1 := "CCCC"  $NF := "CCCC"
5  5  5  4  $0 := "DDDD"  $1 := "DDDD"  $NF := "DDDD"
 
    mawk2 done ...

( timeout --foreground 10 echo '=AAAA;=BBBB;=CCCC;=DDDD;' |  ; )
        0.00s user 0.01s system 112% cpu 0.007 total
gcat -  0.00s user 0.00s system  31% cpu 0.006 total

1  1  1  0  $0 := ""      $1 := ""      $NF := ""
2  2  2  4  $0 := "AAAA"  $1 := "AAAA"  $NF := "AAAA"
3  3  3  4  $0 := "BBBB"  $1 := "BBBB"  $NF := "BBBB"
4  4  4  4  $0 := "CCCC"  $1 := "CCCC"  $NF := "CCCC"
5  5  5  4  $0 := "DDDD"  $1 := "DDDD"  $NF := "DDDD"
     
    nawk done ...

( timeout --foreground 10 echo '=AAAA;=BBBB;=CCCC;=DDDD;' |  ; )
        0.00s user 0.01s system 61% cpu 0.018 total
gcat -  0.00s user 0.00s system 10% cpu 0.017 total

1  1  1  0  $0 := ""      $1 := ""      $NF := ""
2  2  2  4  $0 := "AAAA"  $1 := "AAAA"  $NF := "AAAA"
3  3  3  4  $0 := "BBBB"  $1 := "BBBB"  $NF := "BBBB"
4  4  4  4  $0 := "CCCC"  $1 := "CCCC"  $NF := "CCCC"
5  5  5  4  $0 := "DDDD"  $1 := "DDDD"  $NF := "DDDD"
     
    gawk -e done ...

( timeout --foreground 10 echo '=AAAA;=BBBB;=CCCC;=DDDD;' |  ; )
        0.00s user 0.00s system 106% cpu 0.008 total
gcat -  0.00s user 0.00s system  21% cpu 0.008 total

1  1  1  0  $0 := ""      $1 := ""      $NF := ""
2  2  2  4  $0 := "AAAA"  $1 := "AAAA"  $NF := "AAAA"
3  3  3  4  $0 := "BBBB"  $1 := "BBBB"  $NF := "BBBB"
4  4  4  4  $0 := "CCCC"  $1 := "CCCC"  $NF := "CCCC"
5  5  5  4  $0 := "DDDD"  $1 := "DDDD"  $NF := "DDDD"
     
    gawk -be done ...

( timeout --foreground 10 echo '=AAAA;=BBBB;=CCCC;=DDDD;' |  ; )
        0.00s user 0.00s system 104% cpu 0.008 total
gcat -  0.00s user 0.00s system  19% cpu 0.007 total

1  1                                 1        
                            25  $0 := "=AAAA;=BBBB;=CCCC;=DDDD;"  
                                $1 := "=AAAA;=BBBB;=CCCC;=DDDD;" 
                               $NF := "=AAAA;=BBBB;=CCCC;=DDDD;"
 
    gawk -ce done ...

( timeout --foreground 10 echo '=AAAA;=BBBB;=CCCC;=DDDD;' |  ; )
        0.00s user 0.00s system 108% cpu 0.007 total
gcat -  0.00s user 0.00s system  21% cpu 0.007 total

1  1                                 1        
                            25  $0 := "=AAAA;=BBBB;=CCCC;=DDDD;"  
                                $1 := "=AAAA;=BBBB;=CCCC;=DDDD;" 
                               $NF := "=AAAA;=BBBB;=CCCC;=DDDD;"
    gawk -Pe done ...

( timeout --foreground 10 echo '=AAAA;=BBBB;=CCCC;=DDDD;' |  ; )
        0.00s user 0.00s system 79% cpu 0.011 total
gcat -  0.00s user 0.00s system 13% cpu 0.010 total

1  1  1  0  $0 := ""      $1 := ""      $NF := ""
2  2  2  4  $0 := "AAAA"  $1 := "AAAA"  $NF := "AAAA"
3  3  3  4  $0 := "BBBB"  $1 := "BBBB"  $NF := "BBBB"
4  4  4  4  $0 := "CCCC"  $1 := "CCCC"  $NF := "CCCC"
5  5  5  4  $0 := "DDDD"  $1 := "DDDD"  $NF := "DDDD"
 
    gawk -Mbe done ...    
 
( timeout --foreground 10 echo '=AAAA;=BBBB;=CCCC;=DDDD;' |  ; )
        0.00s user 0.00s system 108% cpu 0.007 total
gcat -  0.00s user 0.00s system  23% cpu 0.007 total

1  1  1  0  $0 := ""      $1 := ""      $NF := ""
2  2  2  4  $0 := "AAAA"  $1 := "AAAA"  $NF := "AAAA"
3  3  3  4  $0 := "BBBB"  $1 := "BBBB"  $NF := "BBBB"
4  4  4  4  $0 := "CCCC"  $1 := "CCCC"  $NF := "CCCC"
5  5  5  4  $0 := "DDDD"  $1 := "DDDD"  $NF := "DDDD"
     
    gawk -nbe done ...

Collectives™ on Stack Overflow

gawk hangs when using a regex for RS combined with reading a continuous stream from stdin

Thus far

Edit: Comment consolidation, misc clarifications, and various observations/realizations

7 Answers 7

Not the answer you're looking for? Browse other questions tagged
regex
shell
awk
zsh
stdin
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

Thus far

Edit: Comment consolidation, misc clarifications, and various observations/realizations

7 Answers 7

Not the answer you're looking for? Browse other questions tagged regexshellawkzshstdin or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
regex
shell
awk
zsh
stdin
or ask your own question.