Non-greedy match with SED regex (emulate perl's .*?)

Question

I want to use sed to replace anything in a string between the first AB and the first occurrence of AC (inclusive) with XXX.

For example, I have this string (this string is for a test only):

ssABteAstACABnnACss

and I would like output similar to this: ssXXXABnnACss.

I did this with perl:

$ echo 'ssABteAstACABnnACss' | perl -pe 's/AB.*?AC/XXX/'
ssXXXABnnACss

but I want to implement it with sed. The following (using the Perl-compatible regex) does not work:

$ echo 'ssABteAstACABnnACss' | sed -re 's/AB.*?AC/XXX/'
ssXXXss

This doesn't make sense. You have a working solution in Perl, but you want to use Sed, why? — Kusalananda, Commented Jul 23, 2016 at 6:44
@Kusalananda perl may not be available on all *nix platform. Whereas sed is generally available on almost every *nix platform. — Sagar, Commented Mar 2, 2023 at 20:14
@Sagar Those are interesting statements. Let me know a Unix where Perl is unavailable as part of the base system and as a package. Also, "on almost every platform" seems to insinuate that there are Unix systems without sed. Which ones are these? — Kusalananda, Commented Mar 2, 2023 at 20:31

Will Bradley · Accepted Answer · 2020-05-24 22:20:46Z

Sed regexes match the longest match. Sed has no equivalent of non-greedy.

What we want to do is match

AB,
followed by
any amount of anything other than AC,
followed by
AC

Unfortunately, sed can’t do #2 — at least not for a multi-character regular expression. Of course, for a single-character regular expression such as @ (or even [123]), we can do [^@]* or [^123]*. And so we can work around sed’s limitations by changing all occurrences of AC to @ and then searching for

AB,
followed by
any number of anything other than @,
followed by
@

like this:

sed 's/AC/@/g; s/AB[^@]*@/XXX/; s/@/AC/g'

The last part changes unmatched instances of @ back to AC.

But this is a reckless approach because the input could already contain @ characters. So, by matching them, we could get false positives. However, since no shell variable will ever have a NUL (\x00) character in it, NUL is likely a good character to use in the above work-around instead of @:

$ echo 'ssABteAstACABnnACss' | sed 's/AC/\x00/g; s/AB[^\x00]*\x00/XXX/; s/\x00/AC/g'
ssXXXABnnACss

The use of NUL requires GNU sed. (To make sure that GNU features are enabled, the user must not have set the shell variable POSIXLY_CORRECT.)

If you are using sed with GNU's -z flag to handle NUL-separated input, such as the output of find ... -print0, then NUL will not be in the pattern space and NUL is a good choice for the substitution here.

Although NUL cannot be in a bash variable it is possible to include it in a printf command. If your input string can contain any character at all, including NUL, then see Stéphane Chazelas' answer which adds a clever escaping method.

I just edited your answer to add a lengthy explanation; feel free to trim it or roll it back. — G-Man Says 'Reinstate Monica', Commented Jul 23, 2016 at 5:33
@G-Man That is an excellent explanation! Very nicely done. Thank you. — John1024, Commented Jul 23, 2016 at 6:51
You can echo or printf an `\000' just fine in bash (or the input could come from a file). But in general, a string of text is of course not likely have NULs. — ilkkachu, Commented Jul 23, 2016 at 14:39
@ilkkachu You are right about that. What I should have written is that no shell variable or parameter can contain NULs. Answer updated. — John1024, Commented Jul 23, 2016 at 19:24
Wouldn't this be a whole lot safer if you changed AC to AC@ and back again? — Michael Vehrs, Commented Jul 25, 2016 at 8:52

gresolio · Accepted Answer · 2022-07-23 02:40:59Z

17

To do a non-greedy match on a single character, match all characters excluding the one that terminates the match.

Greedy matching:

$ echo "<b>foo</b>bar" | sed 's/<.*>//g'
bar

Non greedy matching:

$ echo "<b>foo</b>bar" | sed 's/<[^>]*>//g'
foobar

Source: sed - non greedy matching by Christoph Sieghart

edited Jul 23, 2022 at 2:40

user114651

answered Oct 12, 2017 at 21:49

gresolio

2872 silver badges4 bronze badges

7

The term “no-brainer” is ambiguous. In this case, it is not clear that you (or Christoph Sieghart) thought this through. In particular, it would have been nice if you had showed how to solve the specific problem in the question (where the zero-of-more-of- expression is followed by more than one character). You may find that this answer doesn’t work well in that case.
– Scott - Слава Україні
Commented Oct 12, 2017 at 22:14
The rabbit hole is much deeper than it seemed to me at first glance. You are right, that workaround doesn't work well for multi-character regular expression.
– gresolio
Commented Oct 15, 2017 at 20:15

Add a comment |

Stéphane Chazelas · Accepted Answer · 2020-04-26 09:00:38Z

Some sed implementations have support for that. ssed has a PCRE mode:

ssed -R 's/AB.*?AC/XXX/'

AT&T ast sed supports the *? operator as a non-greedy version of * in its extended (with -E) and augmented (with -A regexps).

sed -E 's/AB.*?AC/XXX/'
sed -A 's/AB.*?AC/XXX/'

In that implementation and those -E/-A modes, more generally, perl-like regexps can be used inside (?P:perl-like regexp here), though as seen above, it's not necessary for the *? operator.

Its augmented regexps also have conjunction and negation operators:

sed -A 's/AB(.*&(.*AC.*)!)AC/XXX/'

Portably, you can use this technique: replace the end string (here AC) with a single character that doesn't occur in either the beginning or end string (like : here) so you can do s/AB[^:]*://, and in case that character may appear in the input, use an escaping mechanism that doesn't clash with the begin and end strings.

An example:

sed 's/_/_u/g; # use _ as the escape character, escape it
     s/:/_c/g; # escape our replacement character
     s/AC/:/g; # replace the end string
     s/AB[^:]*:/XXX/; # actual replacement
     s/:/AC/g; # restore the remaining end strings
     s/_c/:/g; # revert escaping
     s/_u/_/g'

With GNU sed, an approach is to use newline as the replacement character. Because sed processes one line at a time, newline never occurs in the pattern space, so one can do:

sed 's/AC/\n/g;s/AB[^\n]*\n/XXX/;s/\n/AC/g'

That generally doesn't work with other sed implementations because they don't support [^\n]. With GNU sed you have to make sure that POSIX compatibility is not enabled (like with the POSIXLY_CORRECT environment variable).

Gilles 'SO- stop being evil' · Accepted Answer · 2016-07-24 19:58:07Z

No, sed regexes don't have non-greedy matching.

You can match all text up to the first occurrence of AC by using “anything not containing AC” followed by AC, which does the same as Perl's .*?AC. The thing is, “anything not containing AC” cannot be expressed easily as a regular expression: there is always a regular expression that recognizes the negation of a regular expression, but the negation regex gets complicated fast. And in portable sed, this isn't possible at all, because the negation regex requires grouping an alternation which is present in extended regular expressions (e.g. in awk) but not in portable basic regular expressions. Some versions of sed, such as GNU sed, do have extensions to BRE that make it able to express all possible regular expressions.

sed 's/AB\([^A]*\|A[^C]\)*A*AC/XXX/'

Because of the difficulty of negating a regex, this doesn't generalize well. What you can do instead is to transform the line temporarily. In some sed implementations, you can use newlines as a marker, since they can't appear in an input line (and if you need multiple markers, use newline followed by a varying character).

sed -e 's/AC/\
&/g' -e 's/AB[^\
]*\nAC/XXX/' -e 's/\n//g'

However, beware that backslash-newline doesn't work in a character set with some sed versions. In particular, this doesn't work in GNU sed, which is the sed implementation on non-embedded Linux; in GNU sed you can use \n instead:

sed -e 's/AC/\
&/g' -e 's/AB[^\n]*\nAC/XXX/' -e 's/\n//g'

In this specific case, it's enough to replace the first AC by a newline. The approach I presented above is more general.

A more powerful approach in sed is to save the line into the hold space, remove all but the first “interesting” part of the line, exchange the hold space and the pattern space or append the pattern space to the hold space and repeat. However, if you start doing things that are this complicated, you should really think about switching to awk. Awk doesn't have non-greedy matching either, but you can split a string and save the parts into variables.

@ilkkachu No, it doesn't. s/\n//g removes all newlines.
– Gilles 'SO- stop being evil'
Commented Jul 24, 2016 at 19:28 — Gilles 'SO- stop being evil', Commented Jul 24, 2016 at 19:28
asdf. Right, my bad.
– ilkkachu
Commented Jul 24, 2016 at 20:06 — ilkkachu, Commented Jul 24, 2016 at 20:06

undercat · Accepted Answer · 2020-01-10 11:37:49Z

The solution is quite simple. .* is greedy, but it is not absolutely greedy. Consider matching ssABteAstACABnnACss against the regexp AB.*AC. The AC that follows .* must actually have a match. The problem is that because .* is greedy, the subsequent AC will match the last AC rather than the first one. .* eats up the first AC while the literal AC in the regexp matches the last one in ssABteAstACABnnACss. To prevent this from happening, simply replace the first AC with something ridiculous to differentiate it from the second one and from anything else.

echo ssABteAstACABnnACss | sed 's/AC/-foobar-/; s/AB.*-foobar-/XXX/'
ssXXXABnnACss

The greedy .* will now stop at the foot of -foobar- in ssABteAst-foobar-ABnnACss because there is no other -foobar- than this -foobar-, and the regexp -foobar- MUST have a match. The previous problem was that the regexp AC had two matches, but because .* was greedy, the last match for AC was selected. However, with -foobar-, only one match is possible, and this match proves that .* is not absolutely greedy. The bus stop for .* occurs where only one match remains for the rest of the regexp following .*.

Note that this solution will fail if an AC appears before the first AB because the wrong AC will be replaced with -foobar-. For example, after the first sed substitution, ACssABteAstACABnnACss becomes -foobar-ssABteAstACABnnACss; therefore, a match cannot be found against AB.*-foobar-. However, if the sequence is always ...AB...AC...AB...AC..., then this solution will succeed.

bu5hman · Accepted Answer · 2020-01-10 16:28:05Z

1

One alternative is to change the string so you want the greedy match

echo "ssABtCeCAstACABnnACss" | rev | sed -E "s/(.*)CA.*BA(.*)/\1CA+-+-+-+-BA\2/" | rev

Use rev to reverse the string, reverse your match criteria, use sed in the usual fashion and then reverse the result....

ssAB-+-+-+-+ACABnnACss

answered Jan 10, 2020 at 16:28

bu5hman

4,7712 gold badges15 silver badges29 bronze badges

Add a comment |

AdminBee · Accepted Answer · 2023-03-03 09:24:12Z

It doesn't appear that typical/vanilla sed supports non-greedy RegEx repetitions (aka Minimal Repetitions), so our solution cannot rely on that if portability matters. With that sed however (pun intended), re_format(7) does appear to document it as a potential feature achievable by appending ? to a repetition operator.

Anyway, albeit perhaps a bit hard to understand, I believe that the following is a concise and straightforward solution. I am going to avoid the use of injecting special delimiters (such as NUL characters or unique strings) as other answers have already exercised that strategy.

sed -E 's/ABA*(C|([^CA]+C*A)+C)/XXX/g' <<< 'ssABteAstACABnnACss'

Some things to note here:

There's no use of makeshift delimiters (e.g. injecting null chars/bytes, unique strings, etc.) here.
This solution assumes that the input isn't broken across multiple lines (i.e. the input is a single line), as otherwise our sed script would first need to concatenate every line of the input into the pattern space verbatim before doing anything else. This is a mandatory first step if it is possible for a line break to appear between an AB and a following AC, or between A and B or A and C itself (as in A\nB or A\nC). Just to be extra safe, we'd also want to ensure that patterns like [^CA], . or [[:space:]] are going to match end of line characters given the: end-of-line character(s) of the input, version of sed in use, and the locale in use, when our sed program is run. [^CA] should match any character though (except C or A of course) including control and EOL characters, just beware of input containing NUL characters or multibyte characters (plus locale, encoding, etc.). Lastly, keep in mind that the pattern space is likely limited to a fixed number of bytes, therefore the following may error/overflow on large inputs.
```
sed -n -E \
    -e ':start' \
    -e '$!N' \
    -e '$!bstart' \
    -e 's/ABA*(C|([^CA]+C*A)+C)/XXX/g' \
    -e 'p' <<\EOF
ssABteAstACABnnACss
ssABteAstACABnnACss
ssABteA
stACABnnACss
EOF
```
There's no special handling or consideration for an AB that occurs between an AB and the very next AC, or, an AC occurring after an AC with no AB in between them (if such cases are even possible in your input?).
This solution uses the -E flag in order to enable extended regex patterns, which is required for alternation (i.e. |). This flag may not be very portable when it comes to older sed's, however, it should be available on any modern BSD, GNU, or macOS sed. Here's what the GNU sed manual has to say about it:

-E

-r

--regexp-extended

Use extended regular expressions rather than basic regular expressions. Extended regexps are those that egrep accepts; they can be clearer because they usually have fewer backslashes.

Historically this was a GNU extension, but the -E extension has since been added to the POSIX standard, so use -E for portability. GNU sed has accepted -E as an undocumented option for years, and *BSD seds have accepted -E for years as well, but scripts that use -E might not port to other older systems. See Extended regular expressions.

P.S I'm fairly certain that this can also be solved/accomplished just by utilizing multiple "sed commands" (i.e. utilizing multiple s commands, the hold space, conditional branching via the t command, etc.), and such an alternative solution would probably be more readable, understandable, and less error-prone than this is (maybe I'll add it to my answer sometime in the future if I find the time 🙂).

On what system does re_format(7) document "potential features"? — Kusalananda, Commented Mar 2, 2023 at 20:33
@Kusalananda hmmm you're right. Good question. So I was just referring to the manual on my system (macOS 13.2) currently which is dated Sept 29, 2011. It is documented under Minimal Repetitions (available for enhanced extended REs only) under ENHANCED FEATURES. See here or here. It seems that the one provided with macOS is outdated and rather unique? — tmillr, Commented Mar 2, 2023 at 21:23
Ah, yes. The macOS regular expression library seems to implement these expressions, but they are unavailable with sed on that platform. I would classify this as "rather unique" and specific for the macOS system. I'm not sure what utilities on macOS actually use this. Possibly the native Perl. — Kusalananda, Commented Mar 2, 2023 at 21:36

Stack Exchange Network

Non-greedy match with SED regex (emulate perl's .*?)

7 Answers 7

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
text-processing
sed
regular-expression
.

Linked

Hot Network Questions

Non-greedy match with SED regex (emulate perl's .*?)

7 Answers 7

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged text-processingsedregular-expression.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
text-processing
sed
regular-expression
.