127

My example string is as follows:

This is 02G05 a test string 20-Jul-2012

Now from the above string I want to extract 02G05. For that I tried the following regex with sed

$ echo "This is 02G05 a test string 20-Jul-2012" | sed -n '/\d+G\d+/p'

But the above command prints nothing and the reason I believe is it is not able to match anything against the pattern I supplied to sed.

So, my question is what am I doing wrong here and how to correct it.

When I try the above string and pattern with python I get my result

>>> re.findall(r'\d+G\d+',st)
['02G05']
>>>
1
  • 6
    Python is definitely not sed. Their regex flavors are quite different.
    – tripleee
    Commented Dec 12, 2013 at 11:45

6 Answers 6

137

How about using grep -E?

echo "This is 02G05 a test string 20-Jul-2012" | grep -Eo '[0-9]+G[0-9]+'
9
  • 4
    +1 This is simpler, and will also correctly handle the case of multiple matches on the same line. A complex sed script could be devised for that case, but why bother?
    – tripleee
    Commented Jul 20, 2012 at 7:28
  • egrep uses extended regexp, sed and grep uses standard regexp, egrep or grep -e or sed -E use extended regexp, and the python code in the question uses PCRE, (perl common regular expression) GNU grep can use PCRE with -P option. Commented Aug 22, 2016 at 13:46
  • @FelipeBuccioni actually that should be egrep or grep -E or sed -r Commented Apr 13, 2018 at 15:44
  • For a single(first) match, append ` | head -1` (without backticks), as per this answer to another question. Commented Apr 13, 2018 at 15:55
  • 2
    grep has -m 1 to stop after the first match.
    – tripleee
    Commented Apr 20, 2018 at 3:42
134

The pattern \d might not be supported by your sed. Try [0-9] or [[:digit:]] instead.

To only print the actual match (not the entire matching line), use a substitution.

sed -n 's/.*\([0-9][0-9]*G[0-9][0-9]*\).*/\1/p'

The parentheses capture the text they match into a back reference. Here, the first (and only) parentheses capture the string we want to keep, and we replace the entire line with just the captured string \1, and print the resulting line. (The p option says to print the resulting line after performing a successful substitution, and the -n option prevents sed from performing its normal printing of every other line.)

8
  • 6
    Thanks it worked fine. But I have a question why .* is necessary with your regex because when I try sed -n 's/\([0-9]\+G[0-9]\+\)/\1/p' it just prints the entire line.
    – RanRag
    Commented Jul 19, 2012 at 20:47
  • 7
    That's why, isn't it? Replace whatever comes before and after the match with norhing, then print the whole line.
    – tripleee
    Commented Jul 19, 2012 at 21:01
  • 1
    @tripleee This only prints 2G05 not 02G05. The expression that works is 's/.*\([0-9][0-9]G[0-9][0-9]*\).*/\1/p' Commented Dec 12, 2013 at 10:06
  • 1
    That hard-codes it to exactly two digits. Something like sed -n 's/\(.*[^0-9]\)\?\([0-9][0-9]*G[0-9][0-9]*\).*/\2/p' would be more general. (I assume your sed supports \? for zero or one occurrence.)
    – tripleee
    Commented Dec 12, 2013 at 11:53
  • See also stackoverflow.com/a/48898886/874188 for how to replace various other common Perl escapes like \w, \s, etc.
    – tripleee
    Commented Aug 16, 2019 at 5:28
8

Try this instead:

echo "This is 02G05 a test string 20-Jul-2012" | sed 's/.* \([0-9]\+G[0-9]\+\) .*/\1/'

But note, if there is two pattern on one line, it will prints the 2nd.

1
  • Or more generally the last one if there are multiple matches.
    – tripleee
    Commented Jul 19, 2016 at 13:28
6

sed doesn't recognize \d, use [[:digit:]] instead. You will also need to escape the + or use the -r switch (-E on OS X).

Note that [0-9] works as well for Arabic-Hindu numerals.

2
1

We can use sed -En to simplify the regular expression, where:

n: suppress automatic printing of pattern space
E: use extended regular expressions in the script
$ echo "This is 02G05 a test string 20-Jul-2012" | sed -En 's/.*([0-9][0-9]+G[0-9]+).*/\1/p'

02G05
0

Try using rextract. It will let you extract text using a regular expression and reformat it.

Example:

$ echo "This is 02G05 a test string 20-Jul-2012" | ./rextract '([\d]+G[\d]+)' '${1}'

2G05
1
  • If this uses standard regex, the square brackets around \d are completely superfluous.
    – tripleee
    Commented Nov 26, 2019 at 6:16

Not the answer you're looking for? Browse other questions tagged or ask your own question.