Complex Regex not working in Spacy entity ruler

Question

I'm trying to identify the entities by passing the Regular expression (Regex) to the Spacy model using Entity Ruler but, Spacy is unable to identify based on the below regex.

But, I tested the regex here and it's working.

import model_training
import spacy

nlp = spacy.load('en_core_web_trf')
nlp.add_pipe("spacytextblob")

nlp = model_training.train_model_with_regex(nlp)

model_training.py

def train_model_with_regex(nlp):
ruler = nlp.add_pipe("entity_ruler", before="ner")
patterns = [
    {
        "label": "VOLUME",
        "pattern": [{"LOWER": {'REGEX': "(?:\d+\s(?:million|hundred|thousand|billion)*\s*)+"}}]
    }
]

ruler.add_patterns(patterns)
return nlp

I wanted to achieve this, for the below example

text = "I have spent 5 million to buy house and 70 thousand for the furniture"

expected output:

{'result': [
    {'label': 'VOLUME', 'text': '5 million'},
    {'label': 'VOLUME', 'text': '70 thousand'}
]}

You are trying to match several tokens with a single regex, but REGEX is applied to each token separately. — Wiktor Stribiżew, Commented Sep 26, 2022 at 11:40
@WiktorStribiżew Thanks for the response, I just tried with this {"label": "VOLUME", "pattern": [{"LOWER": {'REGEX': r"(?:\d+\s(?:million)*\s*)+"}}]} but still didn't work — Kamal, Commented Sep 26, 2022 at 11:47
Backslashes in strings need to be doubled, or you need to use a raw string r"..." for the regex. — tripleee, Commented Sep 26, 2022 at 11:47
I know you tried it, and it is wrong. You must provide a pattern with REGEXs for several tokens. — Wiktor Stribiżew, Commented Sep 26, 2022 at 12:11

Wiktor Stribiżew · Accepted Answer · 2022-09-26 13:47:01Z

The problem is that your pattern is supposed to match at least two tokens, while the REGEX operator is applied to a single token.

A solution can look like

"pattern": [
    {"TEXT": {"REGEX": r"^\d+(?:[,.]\d+)*$"}},
    {"TEXT": {"REGEX": r"^(?:million|hundred|thousand|billion)s?$"}}
]

The LIKE_NUM entity is defined in Spacy source code mostly as a string of digits with all dots and commas removed, so the ^\d+(?:[,.]\d+)*$ pattern looks good enough. It matches a token that starts with one or more digits and then contains zero or more occurrences of a comma or dot and then one or more digits till the end of the token.

Collectives™ on Stack Overflow

Complex Regex not working in Spacy entity ruler

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
python
regex
nlp
spacy
named-entity-recognition
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged pythonregexnlpspacynamed-entity-recognition or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
python
regex
nlp
spacy
named-entity-recognition
or ask your own question.