strange regex matching with grep/egrep

Question

GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu)
grep (GNU grep) 2.20
grep-2.20-3.el7.x86_64

Can someone explain this puzzle? I'm getting false matches with grep/egrep.

echo "somestringthing" | egrep  '\bstring*'
(no output as expected)
echo "somestringthing" | egrep '\bsomestring*'
somestringthing
echo "somestringthing" | egrep '\bsomestringthingy*'
somestringthing
echo "somestringthing" | egrep '\bsomestringthing1*'
somestringthing
echo "somestringthing" | egrep '\bsomestringthingX*'
somestringthing

That last three should NOT match because of the single char before the wildcard. Experimenting, I've found that any string will match as if the single character before the wildcard did not exist.

'\b' is a word boundary, FYI.

So am I missing something here, or is this a bug in grep? (Talk about hair-pulling madness trying to debug code you think is working properly.)

* in a regexp means zero-or-more. so y* means zero-or-more y characters. use a y+ (or y\+ in BRE) if you mean "one-or-more y characters". or use .* if you mean "followed by zero-or-more of any other characters" — cas, Commented Aug 30, 2019 at 3:17
also worth mentioning is that unless you're capturing the match (e.g. with grep's -o option), grep -E '\bstring*' is functionally identical to grep -E '\bstrin'. — cas, Commented Aug 30, 2019 at 3:24

Kusalananda · Accepted Answer · 2019-08-30 06:33:48Z

The y*, 1* and X* at the end of the last three regular expressions will match zero or more y, 1 and X respectively.

At the end of the input string somestringthing you do actually have zero or more of these characters (exactly zero), so all three expressions matches.

If you want to match one or more y at the end of the string, use y+ or y{1,} in an extended regular expression, or yy* or y\{1,\} in a basic regular expression (grep without -E):

echo somestringthing | grep -E 'somestringthingy+'

(this produces no output)

Also note that egrep is deprecated and you should be using grep -E. If you want to match complete words only, use grep -E -w (this would require a word boundary at the start and end of the match in the input).

Zizzyzizzy · Accepted Answer · 2019-08-30 02:15:42Z

0

Bahh..more messing around and it seems the character before the * wildcard is being treated as a .

The proper wildcard use for grep is apparently .* not just *

Also, the \b was not required once I used the .* as the wildcard. The -w flag works as expected:

echo "somestringthing" | egrep -w 'somestring.*'
somestringthing

echo "somestringthing" | egrep -w 'somestringy.*'
(no output as expected)

answered Aug 30, 2019 at 2:15

Zizzyzizzy

1193 bronze badges

no, the character before the * is NOT treated as a . unless it IS a .. It's treated as zero-or-more of whatever character it happens to be. .* isn't the "proper wildcard for grep", it's a pattern that matches zero-or-more of any character (. matches any character). And, unless you want to capture to the end of the line, you generally don't need to have a .* at the end of a regexp pattern. regular expressions are not globs (shell/filename wildcards)....they may look similar and share some common features, but they are not the same.
– cas
Commented Aug 30, 2019 at 13:11
see man 7 glob and man 7 regex
– cas
Commented Aug 30, 2019 at 13:13

Add a comment |

Stack Exchange Network

strange regex matching with grep/egrep

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
grep
regular-expression
.

Linked

Hot Network Questions

strange regex matching with grep/egrep

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged grepregular-expression.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
grep
regular-expression
.