Using RegEx for phrase pattern in EntityRuler

Question

I tried to find FRT entity with EntityRuler like this:

from spacy.lang.en import English
from spacy.pipeline import EntityRuler

nlp = English()
ruler = EntityRuler(nlp)
patterns = [{"label": "FRT", "pattern": [{'REGEX': "[Aa]ppl[e|es])"}]},
            {"label": "BRN", "pattern": [{"LOWER": "granny"}, {"LOWER": "smith"}]}]

ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

doc = nlp(u"Apple is red. Granny Smith apples are green.")
print([(ent.text, ent.label_) for ent in doc.ents])

I then got this outcome

[('Apple', 'FRT'), ('is', 'FRT'), ('red', 'FRT'), ('.', 'FRT'), ('Granny Smith', 'BRN'), ('apples', 'FRT'), ('is', 'FRT'), ('green', 'FRT'), ('.', 'FRT')]

Could you please show me how to fix my code so that I will get this result

[('Apple', 'FRT'), ('Granny Smith', 'BRN'), ('apples', 'FRT')]

Thank you in advance.

Wiktor Stribiżew · Accepted Answer · 2019-08-27 07:49:10Z

8

You need to fix the whole code by using this patterns declaration:

patterns = [{"label": "FRT", "pattern": [{"TEXT" : {"REGEX": "[Aa]pples?"}}]},
            {"label": "BRN", "pattern": [{"LOWER": "granny"}, {"LOWER": "smith"}]}]

There are two things: 1) the REGEX operator itself does not work if you do not define it under the TEXT, LOWER, etc. top-level token and 2) the regex you are using is corrupt as you are using a character class instead of a grouping construct.

Note that [e|es], being a regex character class, matches e, s or |. So, if you have a Appl| is red. string, the result will contain [('Appl|', 'FRT'). You need to either use a non-capturing group - (?:es|s), or just es? that matches an e and then an optional s.

Also, cf. these scenarios:

[{"TEXT" : {"REGEX": "[Aa]pples?"}}] will find Apple, apple, Apples, apples, but will not find APPLES
[{"LOWER" : {"REGEX": "apples?"}}] will find Apple, apple, Apples, apples, APPLES, aPPleS, etc. and also stapples (a misspelling of staples)
[{"TEXT" : {"REGEX": r"\b[Aa]pples?\b"}}] will find Apple, apple, Apples, apples, but will not find APPLES, nor stapples since \b are word boundaries.

answered Aug 27, 2019 at 7:49

Wiktor Stribiżew

621k39 gold badges477 silver badges594 bronze badges

Thanks, Wiktor. As always, your explanation is comprehensive, providing additional resources for reference. According to your 3 scenarios, there're no regex expression that will match only Apple, apple, Apples, apples, APPLES, aPPleS but WILL ignore stapples (since it's not strictly a fruit of family Rosaceae)?
– Nemo
Commented Aug 27, 2019 at 8:29
1

@Nemo [{"TEXT" : {"REGEX": r"\b[Aa]pples?\b"}}] will NOT match stapples due to word boundaries. Please pay special attention to r prefix that allows using a single backslash to define regex escapes like \b.
– Wiktor Stribiżew
Commented Aug 27, 2019 at 8:30
Thanks, Wiktor. I always find RegEx so confusing. Given your obvious knowledge on the topic, could you please recommend a book/resource that would help me up to speed?
– Nemo
Commented Aug 27, 2019 at 8:40
2

@Nemo My usual advice is: I can suggest doing all lessons at regexone.com for beginners, reading through regular-expressions.info, regex SO tag description (with many other links to great online resources), and the community SO post called What does the regex mean. rexegg.com is worth having a look at. You should also check Python re docs.
– Wiktor Stribiżew
Commented Aug 27, 2019 at 8:43
1

@Nemo You do not need any computer science degree. The basic things to learn here are: 1) strings and their representation in Python code, 2) basic regex constructs. Play with the patterns at regex101.com, follow the [python] [regex] tags at SO for a month or two, try answering good on-topic questions and you will learn a lot in no time.
– Wiktor Stribiżew
Commented Aug 27, 2019 at 8:49

| Show 1 more comment

mujjiga · Accepted Answer · 2019-08-27 07:17:26Z

4

You have missed the top-level token attribute which you are trying to match in your regex. Since the top-lever token attribute is missed the REGEX key is ignored and the pattern is interpreted as "any token"

Working code

from spacy.lang.en import English
from spacy.pipeline import EntityRuler

nlp = English()
ruler = EntityRuler(nlp)
patterns = [{"label": "FRT", "pattern": [{'TEXT' : {'REGEX': "[Aa]ppl[e|es]"}}]},
            {"label": "BRN", "pattern": [{"LOWER": "granny"}, {"LOWER": "smith"}]}]

ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

doc = nlp(u"Apple is red. Granny Smith apples are green.")
print([(ent.text, ent.label_) for ent in doc.ents])

Output

[('Apple', 'FRT'), ('Granny Smith', 'BRN'), ('apples', 'FRT')]

Infact you can also used bellow pattern for apple

{"label": "FRT", "pattern": [{'LOWER' : {'REGEX': "appl[e|es]"}}]}

answered Aug 27, 2019 at 7:17

mujjiga

16.6k2 gold badges35 silver badges54 bronze badges

Thanks for your straight-to-the-point answer, mujjiga. How can I also accept your answer? Can we accept multiple answers?
– Nemo
Commented Aug 27, 2019 at 8:44
@Nemo thanks that it helped. It is ok :) the accepted answer is much comprehensive :)
– mujjiga
Commented Aug 27, 2019 at 9:56

Add a comment |

Collectives™ on Stack Overflow

Using RegEx for phrase pattern in EntityRuler

2 Answers 2

Working code

Not the answer you're looking for? Browse other questions tagged
python
spacy
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Working code

Not the answer you're looking for? Browse other questions tagged pythonspacy or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
python
spacy
or ask your own question.