2

I'm trying to identify the entities by passing the Regular expression (Regex) to the Spacy model using Entity Ruler but, Spacy is unable to identify based on the below regex.

But, I tested the regex here and it's working.

import model_training
import spacy

nlp = spacy.load('en_core_web_trf')
nlp.add_pipe("spacytextblob")

nlp = model_training.train_model_with_regex(nlp)

model_training.py

def train_model_with_regex(nlp):
ruler = nlp.add_pipe("entity_ruler", before="ner")
patterns = [
    {
        "label": "VOLUME",
        "pattern": [{"LOWER": {'REGEX': "(?:\d+\s(?:million|hundred|thousand|billion)*\s*)+"}}]
    }
]

ruler.add_patterns(patterns)
return nlp

I wanted to achieve this, for the below example

text = "I have spent 5 million to buy house and 70 thousand for the furniture"

expected output:

{'result': [
    {'label': 'VOLUME', 'text': '5 million'},
    {'label': 'VOLUME', 'text': '70 thousand'}
]}
10
  • You are trying to match several tokens with a single regex, but REGEX is applied to each token separately. Commented Sep 26, 2022 at 11:40
  • @WiktorStribiżew Thanks for the response, I just tried with this {"label": "VOLUME", "pattern": [{"LOWER": {'REGEX': r"(?:\d+\s(?:million)*\s*)+"}}]} but still didn't work
    – Kamal
    Commented Sep 26, 2022 at 11:47
  • Backslashes in strings need to be doubled, or you need to use a raw string r"..." for the regex.
    – tripleee
    Commented Sep 26, 2022 at 11:47
  • @tripleee Yes, I tried but didn't work
    – Kamal
    Commented Sep 26, 2022 at 11:51
  • I know you tried it, and it is wrong. You must provide a pattern with REGEXs for several tokens. Commented Sep 26, 2022 at 12:11

1 Answer 1

3

The problem is that your pattern is supposed to match at least two tokens, while the REGEX operator is applied to a single token.

A solution can look like

"pattern": [
    {"TEXT": {"REGEX": r"^\d+(?:[,.]\d+)*$"}},
    {"TEXT": {"REGEX": r"^(?:million|hundred|thousand|billion)s?$"}}
]

The LIKE_NUM entity is defined in Spacy source code mostly as a string of digits with all dots and commas removed, so the ^\d+(?:[,.]\d+)*$ pattern looks good enough. It matches a token that starts with one or more digits and then contains zero or more occurrences of a comma or dot and then one or more digits till the end of the token.

Not the answer you're looking for? Browse other questions tagged or ask your own question.