6

I tried to find FRT entity with EntityRuler like this:

from spacy.lang.en import English
from spacy.pipeline import EntityRuler

nlp = English()
ruler = EntityRuler(nlp)
patterns = [{"label": "FRT", "pattern": [{'REGEX': "[Aa]ppl[e|es])"}]},
            {"label": "BRN", "pattern": [{"LOWER": "granny"}, {"LOWER": "smith"}]}]

ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

doc = nlp(u"Apple is red. Granny Smith apples are green.")
print([(ent.text, ent.label_) for ent in doc.ents])

I then got this outcome

[('Apple', 'FRT'), ('is', 'FRT'), ('red', 'FRT'), ('.', 'FRT'), ('Granny Smith', 'BRN'), ('apples', 'FRT'), ('is', 'FRT'), ('green', 'FRT'), ('.', 'FRT')]

Could you please show me how to fix my code so that I will get this result

[('Apple', 'FRT'), ('Granny Smith', 'BRN'), ('apples', 'FRT')]

Thank you in advance.

2 Answers 2

8

You need to fix the whole code by using this patterns declaration:

patterns = [{"label": "FRT", "pattern": [{"TEXT" : {"REGEX": "[Aa]pples?"}}]},
            {"label": "BRN", "pattern": [{"LOWER": "granny"}, {"LOWER": "smith"}]}]

There are two things: 1) the REGEX operator itself does not work if you do not define it under the TEXT, LOWER, etc. top-level token and 2) the regex you are using is corrupt as you are using a character class instead of a grouping construct.

Note that [e|es], being a regex character class, matches e, s or |. So, if you have a Appl| is red. string, the result will contain [('Appl|', 'FRT'). You need to either use a non-capturing group - (?:es|s), or just es? that matches an e and then an optional s.

Also, cf. these scenarios:

  • [{"TEXT" : {"REGEX": "[Aa]pples?"}}] will find Apple, apple, Apples, apples, but will not find APPLES
  • [{"LOWER" : {"REGEX": "apples?"}}] will find Apple, apple, Apples, apples, APPLES, aPPleS, etc. and also stapples (a misspelling of staples)
  • [{"TEXT" : {"REGEX": r"\b[Aa]pples?\b"}}] will find Apple, apple, Apples, apples, but will not find APPLES, nor stapples since \b are word boundaries.
6
  • Thanks, Wiktor. As always, your explanation is comprehensive, providing additional resources for reference. According to your 3 scenarios, there're no regex expression that will match only Apple, apple, Apples, apples, APPLES, aPPleS but WILL ignore stapples (since it's not strictly a fruit of family Rosaceae)?
    – Nemo
    Commented Aug 27, 2019 at 8:29
  • 1
    @Nemo [{"TEXT" : {"REGEX": r"\b[Aa]pples?\b"}}] will NOT match stapples due to word boundaries. Please pay special attention to r prefix that allows using a single backslash to define regex escapes like \b. Commented Aug 27, 2019 at 8:30
  • Thanks, Wiktor. I always find RegEx so confusing. Given your obvious knowledge on the topic, could you please recommend a book/resource that would help me up to speed?
    – Nemo
    Commented Aug 27, 2019 at 8:40
  • 2
    @Nemo My usual advice is: I can suggest doing all lessons at regexone.com for beginners, reading through regular-expressions.info, regex SO tag description (with many other links to great online resources), and the community SO post called What does the regex mean. rexegg.com is worth having a look at. You should also check Python re docs. Commented Aug 27, 2019 at 8:43
  • 1
    @Nemo You do not need any computer science degree. The basic things to learn here are: 1) strings and their representation in Python code, 2) basic regex constructs. Play with the patterns at regex101.com, follow the [python] [regex] tags at SO for a month or two, try answering good on-topic questions and you will learn a lot in no time. Commented Aug 27, 2019 at 8:49
4

You have missed the top-level token attribute which you are trying to match in your regex. Since the top-lever token attribute is missed the REGEX key is ignored and the pattern is interpreted as "any token"

Working code

from spacy.lang.en import English
from spacy.pipeline import EntityRuler

nlp = English()
ruler = EntityRuler(nlp)
patterns = [{"label": "FRT", "pattern": [{'TEXT' : {'REGEX': "[Aa]ppl[e|es]"}}]},
            {"label": "BRN", "pattern": [{"LOWER": "granny"}, {"LOWER": "smith"}]}]

ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

doc = nlp(u"Apple is red. Granny Smith apples are green.")
print([(ent.text, ent.label_) for ent in doc.ents])

Output

[('Apple', 'FRT'), ('Granny Smith', 'BRN'), ('apples', 'FRT')]

Infact you can also used bellow pattern for apple

{"label": "FRT", "pattern": [{'LOWER' : {'REGEX': "appl[e|es]"}}]}

2
  • Thanks for your straight-to-the-point answer, mujjiga. How can I also accept your answer? Can we accept multiple answers?
    – Nemo
    Commented Aug 27, 2019 at 8:44
  • @Nemo thanks that it helped. It is ok :) the accepted answer is much comprehensive :)
    – mujjiga
    Commented Aug 27, 2019 at 9:56

Not the answer you're looking for? Browse other questions tagged or ask your own question.