Escaping regex string

Question

I want to use input from a user as a regex pattern for a search over some text. It works, but how I can handle cases where user puts characters that have meaning in regex?

For example, the user wants to search for Word (s): regex engine will take the (s) as a group. I want it to treat it like a string "(s)" . I can run replace on user input and replace the ( with $ and the ) with $ but the problem is I will need to do replace for every possible regex symbol.

Do you know some better way ?

what is the usual use for this in the context of regexes and matching patterns/capture groups to big strings? — Charlie Parker, Commented Jul 19, 2022 at 16:15
This is an important question with many valid use cases, but it is important not to use regex where it isn't necessary. If the goal is simply to check whether the text contains some other literal user_input string, that is built in and there is no reason to use regex - simply check whether user_input in text. See Does Python have a string 'contains' substring method?. — Karl Knechtel, Commented Aug 7, 2022 at 9:58

200_success · Accepted Answer · 2014-04-16 17:33:37Z

466

Use the re.escape() function for this:

4.2.3 re Module Contents

escape(string)

Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.

A simplistic example, search any occurence of the provided string optionally followed by 's', and return the match object.

def simplistic_plural(word, text):
    word_or_plural = re.escape(word) + 's?'
    return re.match(word_or_plural, text)

edited Apr 16, 2014 at 17:33

200_success

7,4731 gold badge44 silver badges76 bronze badges

answered Nov 11, 2008 at 9:37

ddaa

53.9k7 gold badges51 silver badges59 bronze badges

3

i dont understand why this has so many upvotes. It doesn't explain why or when we'd want to use the escape...or even mention why raw strings are relevant which imho is important to make sense of when to use this.
– Charlie Parker
Commented Jul 21, 2022 at 15:01
@CharlieParker A lot of Python canonicals are a mess. I've found it's especially bad for topics related to string escaping, string representation ("why do I get this stuff in the REPL output if I don't use print? Why do I get this other stuff if I do?"), and regular expressions. It needs top-down planning and design, which doesn't come from the organic Stack Overflow question-asking process.
– Karl Knechtel
Commented Oct 16, 2022 at 20:41
@CharlieParker I have seen this being used in lagchain text-splitter api.python.langchain.com/en/latest/_modules/langchain/… in RecursiveCharacterTextSplitter Class.
– bhoomeendra
Commented Jan 9 at 16:04
1

@CharlieParker There are lots of cases where you want to build a regex that looks for some string whose value won't be known until runtime -- say because it comes from user input. You can't just drop the string straight into your regex, because it might contain special characters. For instance, if the input string is "foo.bar", the regex will respond that "foozbar" is a match, because a "." in regex indicates "any character." Using re.escape("foo.bar") returns "foo\.bar", which will match the exact text.
– Dausuul
Commented Mar 18 at 15:42

Add a comment |

Neuron · Accepted Answer · 2022-02-17 09:01:17Z

86

You can use re.escape():

re.escape(string) Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.

>>> import re
>>> re.escape('^a.*$')
'\\^a\\.\\*\\$'

If you are using a Python version < 3.7, this will escape non-alphanumerics that are not part of regular expression syntax as well.

If you are using a Python version < 3.7 but >= 3.3, this will escape non-alphanumerics that are not part of regular expression syntax, except for specifically underscore (_).

edited Feb 17, 2022 at 9:01

Neuron

5,6435 gold badges42 silver badges61 bronze badges

answered Nov 11, 2008 at 9:49

gimel

85.3k10 gold badges78 silver badges105 bronze badges

wouldn't passing a raw string be enough or are you trying to match the literal ^? I usually use re.escape to force it to match things I want matched literally like parens and spaces.
– Charlie Parker
Commented Jul 21, 2022 at 15:16
@CharlieParker the assumption inherent in the question is that we must be able to match literal ^.
– Karl Knechtel
Commented Oct 16, 2022 at 20:42

Add a comment |

Owen · Accepted Answer · 2017-02-23 18:06:31Z

12

Unfortunately, re.escape() is not suited for the replacement string:

>>> re.sub('a', re.escape('_'), 'aa')
'\\_\\_'

A solution is to put the replacement in a lambda:

>>> re.sub('a', lambda _: '_', 'aa')
'__'

because the return value of the lambda is treated by re.sub() as a literal string.

answered Feb 23, 2017 at 18:06

Owen

39.2k14 gold badges96 silver badges127 bronze badges

9

The repl argument to re.sub is a string, not a regex; applying re.escape to it doesn't make any sense in the first place.
– tripleee
Commented Jan 29, 2018 at 6:54
12

@tripleee That's incorrect, the repl argument is not a simple string, it is parsed. For instance, re.sub(r'(.)', r'\1', 'X') will return X, not \1.
– Flimm
Commented Apr 20, 2018 at 13:45
10

Here's the relevant question for escaping the repl argument: stackoverflow.com/q/49943270/247696
– Flimm
Commented Apr 20, 2018 at 13:54
9

Changed in version 3.3: The '_' character is no longer escaped. Changed in version 3.7: Only characters that can have special meaning in a regular expression are escaped. (Why did it take so long?)
– Cees Timmerman
Commented Aug 11, 2018 at 21:58
@Flimm It's the \\1 that returns X, not the fact the second argument is a regex. You could do re.sub(r'(.)', '\\1', 'X') to get the exact same result. As far as I know, there is no reason to use a regex for the second argument. I could only assume it worked for you because it was using the string representation of the regex, and str(r"\1") == "\\1".
– Seth Falco
Commented Jun 19, 2023 at 23:06

Add a comment |

milahu · Accepted Answer · 2024-03-10 18:27:47Z

1

re.escape does too much, it also escapes space, backslash, ...

to escape only the regex special characters ][()?*+.^$

import re

def regex_escape_fixed_string(string):
    "escape fixed string for regex"
    if type(string) == bytes:
        return re.sub(rb"[][(){}?*+.^$]", lambda m: b"\\" + m.group(), string)
    return re.sub(r"[][(){}?*+.^$]", lambda m: "\\" + m.group(), string)

assert (
    regex_escape_fixed_string("a[b]c(d)e{f}g?h*i+j.k^l$m") ==
    'a\\[b\\]c\\(d\\)e\\{f\\}g\\?h\\*i\\+j\\.k\\^l\\$m'
)

edited Mar 10 at 18:27

answered Mar 10 at 15:57

milahu

3,1911 gold badge26 silver badges31 bronze badges

Add a comment |

Charlie Parker · Accepted Answer · 2022-07-21 15:15:00Z

Usually escaping the string that you feed into a regex is such that the regex considers those characters literally. Remember usually you type strings into your compuer and the computer insert the specific characters. When you see in your editor \n it's not really a new line until the parser decides it is. It's two characters. Once you pass it through python's print will display it and thus parse it as a new a line but in the text you see in the editor it's likely just the char for backslash followed by n. If you do \r"\n" then python will always interpret it as the raw thing you typed in (as far as I understand). To complicate things further there is another syntax/grammar going on with regexes. The regex parser will interpret the strings it's receives differently than python's print would. I believe this is why we are recommended to pass raw strings like r"(\n+) -- so that the regex receives what you actually typed. However, the regex will receive a parenthesis and won't match it as a literal parenthesis unless you tell it to explicitly using the regex's own syntax rules. For that you need r"(\fun $ x : nat $ :)" here the first parens won't be matched since it's a capture group due to lack of backslashes but the second one will be matched as literal parens.

Thus we usually do re.escape(regex) to escape things we want to be interpreted literally i.e. things that would be usually ignored by the regex paraser e.g. parens, spaces etc. will be escaped. e.g. code I have in my app:

    # escapes non-alphanumeric to help match arbitrary literal string, I think the reason this is here is to help differentiate the things escaped from the regex we are inserting in the next line and the literal things we wanted escaped.
    __ppt = re.escape(_ppt)  # used for e.g. parenthesis ( are not interpreted as was to group this but literally

e.g. see these strings:

_ppt
Out[4]: '(let H : forall x : bool, negb (negb x) = x := fun x : bool =>HEREinHERE)'
__ppt
Out[5]: '\\(let\\ H\\ :\\ forall\\ x\\ :\\ bool,\\ negb\\ \\(negb\\ x\\)\\ =\\ x\\ :=\\ fun\\ x\\ :\\ bool\\ =>HEREinHERE\\)'
print(rf'{_ppt=}')
_ppt='(let H : forall x : bool, negb (negb x) = x := fun x : bool =>HEREinHERE)'
print(rf'{__ppt=}')
__ppt='\\(let\\ H\\ :\\ forall\\ x\\ :\\ bool,\\ negb\\ \\(negb\\ x\\)\\ =\\ x\\ :=\\ fun\\ x\\ :\\ bool\\ =>HEREinHERE\\)'

the double backslashes I believe are there so that the regex receives a literal backslash.

btw, I am surprised it printed double backslashes instead of a single one. If anyone can comment on that it would be appreciated. I'm also curious how to match literal backslashes now in the regex. I assume it's 4 backslashes but I honestly expected only 2 would have been needed due to the raw string r construct.

btw, I am surprised it printed double backslashes instead of a single one. If anyone can comment on that it would be appreciated. I'm also curious how to match literal backslashes now in the regex. I assume it's 4 backslashes but I honestly expected only 2 would have been needed due to the raw string r construct. — Charlie Parker, Commented Jul 21, 2022 at 15:14
Please read How to Answer and note well that this is not a discussion forum. — Karl Knechtel, Commented Oct 16, 2022 at 21:05

Collectives™ on Stack Overflow

Escaping regex string

5 Answers 5

Not the answer you're looking for? Browse other questions tagged
python
regex
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Not the answer you're looking for? Browse other questions tagged pythonregex or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
regex
or ask your own question.