112

I wrote a regular expression which works well in a certain program (grep, sed, awk, perl, python, ruby, ksh, bash, zsh, find, emacs, vi, vim, gedit, …). But when I use it in a different program (or on a different unix variant), it stops matching. Why?

1 Answer 1

156

Unfortunately, for historical reasons, different tools have slightly different regular expression syntax, and sometimes some implementations have extensions that are not supported by other tools. While there is a common ground, it seems like every tool writer made some different choices.

The consequence is that if you have a regular expression that works in one tool, you may need to modify it to work in another tool. The main differences between common tools are:

  • whether the operators +?|(){} require a backslash;
  • what extensions are supported beyond the basics .[]*^$ and usually +?|()

In this answer, I list the main standards. Check the documentation of the tools you're using for the details.

Wikipedia's comparison of regular expression engines has a table listing the features supported by common implementations.

Basic regular expressions (BRE)

Basic regular expressions are codified by the POSIX standard. It is the syntax used by grep, sed and vi. This syntax provides the following features:

  • ^ and $ match only at the beginning and end of a line.
  • . matches any character (or any character except a newline).
  • […] matches any one character listed inside the brackets (character set). If the first character after the opening bracket is a ^, the characters which are not listed are matched instead. To include a ], put it immediately after the opening [ (or after [^ if it's a negative set). If - is between two characters, it denotes a range; to include a literal -, put it where it can't be parsed as a range.
  • Backslash before any of ^$.*\[ quotes the next character.
  • * matches the preceding character or subexpression 0, 1 or more times.
  • \(…\) is a syntactic group, for use with the * operator or backreferences and \DIGIT replacements.
  • Backreferences \1, \2, … match the exact text matched by the corresponding group, e.g. \(fo*\)\(ba*\)\1 matches foobaafoo but not foobaafo. There is no standard way to refer to the 10th group and beyond (the standard meaning of \10 is the first group followed by a 0).

The following features are also standard, but missing from some restricted implementations:

  • \{m,n\} matches the preceding character or subexpression between m to n times; n or m can be omitted, and \{m\} means exactly m.
  • Inside brackets, character classes can be used, for example [[:alpha:]] matches any letter. Modern implementations of bracket expressions) also include collating elements like [.ll.] and equivalence classes like [=a=].

The following are common extensions (especially in GNU tools), but they are not found in all implementations. Check the manual of the tool you're using.

  • \| for alternation: foo\|bar matches foo or bar.
  • \? (short for \{0,1\}) and \+ (short for \{1,\}) match the preceding character or subexpression at most 1 time, or at least 1 time respectively.
  • \n matches a newline, \t matches a tab, etc.
  • \w matches any word constituent (short for [_[:alnum:]] but with variation when it comes to localisation) and \W matches any character that isn't a word constituent.
  • \< and \> match the empty string only at the beginning or end of a word respectively; \b matches either, and \B matches where \b doesn't.

Note that tools without the \| operator do not have the full power of regular expressions. Backreferences allow a few extra things that can't be done with regular expressions in the mathematical sense.

Extended regular expressions (ERE)

Extended regular expressions are codified by the POSIX standard. Their major advantage over BRE is regularity: all standard operators are bare punctuation characters, a backslash before a punctuation character always quotes it. It is the syntax used by awk, grep -E or egrep, BSD (and GNU and soon POSIX) sed -E (formerly sed -r in GNU sed), and bash / ksh93 / yash / zsh¹'s =~ operator. This syntax provides the following features:

  • ^ and $ match only at the beginning and end of a line.
  • . matches any character (or any character except a newline).
  • […] matches any one character listed inside the brackets (character set). Complementation with an initial ^ and ranges work like in BRE (see above). Character classes can be used but are missing from a few implementations. Modern implementations also support equivalence classes and collating elements. A backslash inside brackets quotes the next character in some but not all implementations; use \\ to mean a backslash for portability.
  • (…) is a syntactic group, for use with * or \DIGIT replacements.
  • | for alternation: foo|bar matches foo or bar.
  • *, + and ? matches the preceding character or subexpression a number of times: 0 or more for *, 1 or more for +, 0 or 1 for ?.
  • Backslash quotes the next character if it is not alphanumeric.
  • {m,n} matches the preceding character or subexpression between m and n times (missing from some implementations); n or m can be omitted, and {m} means exactly m.
  • Some common extensions as in BRE: \DIGIT backreferences (notably absent in awk except in the busybox implementation where you can use $0 ~ "(...)\\1"); special characters \n, \t, etc.; word boundaries \b and \B, word constituents \b and \B, …

PCRE (Perl-compatible regular expressions)

PCRE are extensions of ERE, originally introduced by Perl and adopted by GNU grep -P and many modern tools and programming languages, usually via the PCRE library. See the Perl documentation for nice formatting with examples. Not all features of the latest version of Perl are supported by PCRE (e.g. Perl code execution is only supported in Perl). See the PCRE manual for a summary of supported features. The main additions to ERE are:

  • (?:…) is a non-capturing group: like (…), but does not count for backreferences.
  • (?=FOO)BAR (lookahead) matches BAR, but only if there is also a match for FOO starting at the same position. This is most useful to anchor a match without including the following text in the match: foo(?=bar) matches foo but only if it's followed by bar.
  • (?!FOO)BAR (negative lookahead) matches BAR, but there is not also a match for FOO at the same position. For example (?!foo)[a-z]+ matches any lowercase word that does not start with foo; [a-z]+(?![0-9) matches any lowercase word that is not followed by a digit (so in foo123, it matches fo but not foo).
  • (?<=FOO)BAR (lookbehind) matches BAR, but only if it is immediately preceded by a match for FOO. FOO must have a known length (you can't use repetition operators such as *). This is most useful to anchor a match without including the preceding text in the match: (?<=^| )foo matches foo but only if it's preceded by a space or by the beginning of the string.
  • (?<!FOO)BAR (negative lookbehind) matches BAR, but only if it is not immediately preceded by a match for FOO. FOO must have a known length (you can't use repetition operators such as *). This is most useful to anchor a match without including the preceding text in the match: (?<![a-z])foo matches foo but only if it is not preceded by a lowercase letter.

Emacs

Emacs's syntax is intermediate between BRE and ERE. In addition to Emacs, it is the default syntax for -regex in GNU find. Emacs offers the following operators:

  • ^, $, ., […], *, +, ? as in ERE
  • \(…\), \|, \{…\}, \DIGIT as in BRE
  • more backslash-letter sequences; \< and \> for word boundaries; and more in recent versions of Emacs, that are often not supported in other engines with an Emacs-like syntax.

Shell globs

Shell globs (wildcards) perform pattern matching with a syntax that is completely different from regular expressions and less powerful. In addition to shells, these wildcards are available with other tools such as find -name and rsync filters. POSIX patterns include the following features:

  • ? matches any single character.
  • […] is a character set as in common regular expression syntaxes. Some shells do not support character classes. Some shells require ! instead of ^ to negate the set.
  • * matches any sequence of characters (often except / when matching file paths; if / is excluded from *, then ** sometimes includes /, but check the tool's documentation).
  • Backslash quotes the next character.

Ksh offers additional features which give its pattern matching the full power of regular expressions. These features are also available in bash after running shopt -s extglob. Zsh has a different syntax but can also support ksh's syntax after setopt ksh_glob.


¹ unless the rematchpcre option is enabled in zsh in which case =~ uses PCREs there. ksh93's extended regexps also support some of perl's extended operators such as the look-around ones.

15
  • 1
    Other rich REs you may want to mention are vim's and AT&T libast (as in ksh93) ones. Commented May 5, 2014 at 6:35
  • @StéphaneChazelas Apart from vim, what program uses vim regexps? Apart from ksh, what program uses libast? Commented Sep 7, 2014 at 1:42
  • all of the AT&T tool set uses the AT&T REs (grep, tw, expr...). Except for ksh, that toolset is rarely found outside of AT&T though. Commented Sep 7, 2014 at 14:16
  • According to my understanding (and Wikipedia's), your term "Character class" actually refers to "POSIX character class" ... however, regex(7) agrees with you and calls [these] "bracket expressions" and (within "bracket expressions") [:these:] "character classes." I'm not sure how to best address that.
    – Adam Katz
    Commented Jan 16, 2015 at 6:47
  • Whatever you call them, they support ranges. It's definitely worth noting that - specifies a range and should either be escaped, first (after the optional ^), or last if it is to be taken literally. (I've seen plenty of bugs stemming from e.g. [A-z] –note the change in case–, which matches characters of codes 65 to 122 and accidentally includes each of: [\]^_`. I've also seen the valid yet confusing [!-~] to match all printable characters in ANSI, which I prefer to see as [\x21-\x7e], which is at least straightforward in its action though confusing in a different dimension.)
    – Adam Katz
    Commented Jan 16, 2015 at 6:54

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .