3

This link shows how to create custom entity ruler.

I basically copied and modified the code for another custom entity ruler and used it to find a match in a doc as follows:

nlp = spacy.load('en_core_web_lg')
ruler = EntityRuler(nlp)

grades = ["Level 1", "Level 2", "Level 3", "Level 4"]
for item in grades:
    ruler.add_patterns([{"label": "LEVEL", "pattern": item}])

nlp.add_pipe(ruler)

doc = nlp('Level 2 employee first 12 months 1032.70')

with doc.retokenize() as retokenizer:
    for ent in doc.ents:
        retokenizer.merge(doc[ent.start:ent.end])

matcher = Matcher(nlp.vocab)
pattern =[{'ENT_TYPE': {'REGEX': 'LEVEL'}}, {'ORTH': 'employee'}]
matcher.add('PAY_LEVEL', None, pattern)
matches = matcher(doc)

for match_id, start, end in matches:
    span = doc[start:end]
    print(span)

However, when I run the code (in Jupyter notebook), nothing returned.

Could you please tell me:

  1. If the code returned nothing, did it mean no match was found?

  2. Why couldn't my code find a match although it's almost identical to the original (except for the patterns added to the ruler)? What did I do wrong?

Thank you.

1 Answer 1

9

The problem is an interaction between the NER component provided in the English model and your EntityRuler component. The NER component finds 2 as a number (CARDINAL) and there's a restriction that entities aren't allowed to overlap, so the EntityRuler component doesn't find any matches.

You can either add your EntityRuler before the NER component:

nlp.add_pipe(ruler, before='ner')

Or tell the EntityRuler that it's allowed to overwrite existing entities:

ruler = EntityRuler(nlp, overwrite_ents=True)
4
  • 1
    A third possibility is to disable the default NER component if you are not using it : nlp = spacy.load('en_core_web_sm', disable = ['ner'])
    – DBaker
    Commented Aug 18, 2019 at 14:41
  • Thanks to both @aab and DBaker for great tips! This then leads to another question: which method is best? According to SpaCy documentation, one should add custom component to pipeline as later as possible, so I guess the option of overwrite_ents=True is better then option before='ner'? On the other hand, would overwrite_ents=True or disable=['ner'] lead to undesirable results downstream?
    – Nemo
    Commented Aug 19, 2019 at 1:38
  • It depends entirely on your task/goals: are you using the built-in NER component at all in your analysis? If not, disable it and then you only have your custom entities and don't need to worry about the interactions. Otherwise, you have to figure out which entities you need and which component should have priority.
    – aab
    Commented Aug 19, 2019 at 7:26
  • Agreed with your advice, @aab. The trick is to decide which component should have priority. Do you have any rule of thumbs for it? An example (even contrived) would be appreciated.
    – Nemo
    Commented Aug 21, 2019 at 2:55

Not the answer you're looking for? Browse other questions tagged or ask your own question.