7

The following link shows how to add custom entity rule where the entities span more than one token. The code to do that is below:

import spacy
from spacy.pipeline import EntityRuler
nlp = spacy.load('en_core_web_sm', parse=True, tag=True, entity=True)

animal = ["cat", "dog", "artic fox"]
ruler = EntityRuler(nlp)
for a in animal:
    ruler.add_patterns([{"label": "animal", "pattern": a}])
nlp.add_pipe(ruler)

doc = nlp("There is no cat in the house and no artic fox in the basement")

with doc.retokenize() as retokenizer:
    for ent in doc.ents:
        retokenizer.merge(doc[ent.start:ent.end])

I tried to add another custom entity ruler as follows:

flower = ["rose", "tulip", "african daisy"]
ruler = EntityRuler(nlp)
for f in flower:
    ruler.add_patterns([{"label": "flower", "pattern": f}])
nlp.add_pipe(ruler)

but I got this error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-47-702f460a866f> in <module>()
      4 for f in flower:
      5     ruler.add_patterns([{"label": "flower", "pattern": f}])
----> 6 nlp.add_pipe(ruler)
      7 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\spacy\language.py in add_pipe(self, component, name, before, after, first, last)
    296                 name = repr(component)
    297         if name in self.pipe_names:
--> 298             raise ValueError(Errors.E007.format(name=name, opts=self.pipe_names))
    299         if sum([bool(before), bool(after), bool(first), bool(last)]) >= 2:
    300             raise ValueError(Errors.E006)

ValueError: [E007] 'entity_ruler' already exists in pipeline. Existing names: ['tagger', 'parser', 'ner', 'entity_ruler']

My questions are:

  1. How can I add another custom entity ruler?

  2. Is it a best practice to use capital letters for the label (for example, instead of ruler.add_patterns([{"label": "animal", "pattern": a}]) one should use ruler.add_patterns([{"label": "ANIMAL", "pattern": a}]) instead?

1 Answer 1

14

You can add another custom entity ruler to your pipeline by changing its name (to avoid name collision). Here is some code to illustrate, but please read the remark below:

import spacy
from spacy.pipeline import EntityRuler
nlp = spacy.load('en_core_web_sm', disable = ['ner'])
rulerPlants = EntityRuler(nlp, overwrite_ents=True)
flowers = ["rose", "tulip", "african daisy"]
for f in flowers:
    rulerPlants.add_patterns([{"label": "flower", "pattern": f}])
animals = ["cat", "dog", "artic fox"]
rulerAnimals = EntityRuler(nlp, overwrite_ents=True)
for a in animals:
    rulerAnimals.add_patterns([{"label": "animal", "pattern": a}])

rulerPlants.name = 'rulerPlants'
rulerAnimals.name = 'rulerAnimals'
nlp.add_pipe(rulerPlants)
nlp.add_pipe(rulerAnimals)

doc = nlp("cat and artic fox, plant african daisy")
for ent in doc.ents:
    print(ent.text , '->', ent.label_)

#output:
#cat -> animal
#artic fox -> animal
#african daisy -> flower

We can verify that the pipeline does contain both entity rulers:

print(nlp.pipe_names)
# ['tagger', 'parser', 'rulerPlants', 'rulerAnimals']

Remark: I would suggest using the simpler and more natural approach of making a new entity ruler which contains the rules of both entity rulers:

rulerAll = EntityRuler(nlp)
rulerAll.add_patterns(rulerAnimals.patterns)
rulerAll.add_patterns(rulerPlants.patterns)

Finally concerning your question about best practices for entity labels, it is a common practice to use abbreviations written with capital letters (see Spacy NER documentation) for example ORG, LOC, PERSON, etc..

Edits following questions:

1)If you do not need Spacy's default Named Entity Recognition (NER), then I would suggest disabling it as that will speedup computations and avoid interference (see discussion about this here). Disabling NER will not cause unexpected downstream results (your document just won't be tagged for the default entities LOC, ORG, PERSON etc..).

2)There is this idea in programming that "Simple is better than complex." (see here). There can be some subjectivity as to what constitutes a simpler solution. I would think that a processing pipeline with fewer components is simpler (i.e. the pipeline containing both entity rulers would seem more complex to me). However depending on your needs in terms of profiling, adjustability etc.. It might be simpler for you have several different entity rulers as described in the first part of this solution. It would be nice to get the author's of Spacy to give their view on these two different design choices.

3) Naturally, the single entity ruler above can be directly created as follows:

rulerAll = EntityRuler(nlp, overwrite_ents=True)
for f in flowers:
    rulerAll.add_patterns([{"label": "flower", "pattern": f}])
for a in animals:
    rulerAll.add_patterns([{"label": "animal", "pattern": a}])

The other code above shown for constructing rulerAll is meant to illustrate how we can query an entity ruler for the list of patterns which have been added to it. In practice we would construct rulerAll directly without first constructing rulerPlant and rulerAnimal. Unless we wanted to test and profile these (rulerPlant and rulerAnimal) individually.

3
  • Thanks for your answer, @DBaker! Would you mind elaborating on why you had the option disable = ['ner'] in your code? Would that cause any unexpected results downstream. I omitted that option and your code still works, so when is it necessary/essential that we use the option disable = ['ner']? Also, why did you say it would be simpler and more natural to use rulerAll? I thought that would be more complex in that we had to first created rulerAnimals and rulerPlants before we wrote additional 3 lines of code for rulerAll?
    – Nemo
    Commented Aug 19, 2019 at 2:29
  • 1
    Thanks for your feedback @Nemo, I am glad that I could help. I have added an edit to the answer in order to clarify the points which mentioned in your comment
    – DBaker
    Commented Aug 19, 2019 at 14:19
  • Love your added edits which give me more food for thought (particularly the last sentence of point 2 about the SpaCy author's view). Thank you!
    – Nemo
    Commented Aug 20, 2019 at 2:16

Not the answer you're looking for? Browse other questions tagged or ask your own question.