Revisions to Add multiple EntityRuler with spaCy (ValueError: 'entity_ruler' already exists in pipeline)

added 1643 characters in body

Source Link

edited Aug 19, 2019 at 14:22

2.1k
10
15

The other code above shown for constructing rulerAll is meant to illustrate how we can query an entity ruler for the list of patterns which have been added to it. In practice we would construct rulerAll directly without first constructing rulerPlant and rulerAnimal. Unless we wanted to test and profile these (rulerPlant and rulerAnimal) individually.

added 1643 characters in body

Source Link

edited Aug 19, 2019 at 14:15

DBaker

2.1k
10
15

Edits following questions:

1)If you do not need Spacy's default Named Entity Recognition (NER), then I would suggest disabling it as that will speedup computations and avoid interference (see discussion about this here). Disabling NER will not cause unexpected downstream results (your document just won't be tagged for the default entities LOC, ORG, PERSON etc..).

2)There is this idea in programming that "Simple is better than complex." (see here). There can be some subjectivity as to what constitutes a simpler solution. I would think that a processing pipeline with fewer components is simpler (i.e. the pipeline containing both entity rulers would seem more complex to me). However depending on your needs in terms of profiling, adjustability etc.. It might be simpler for you have several different entity rulers as described in the first part of this solution. It would be nice to get the author's of Spacy to give their view on these two different design choices.

Naturally, the single entity ruler above can be directly created as follows:

rulerAll = EntityRuler(nlp, overwrite_ents=True)
for f in flowers:
    rulerAll.add_patterns([{"label": "flower", "pattern": f}])
for a in animals:
    rulerAll.add_patterns([{"label": "animal", "pattern": a}])

other code above shown for constructing rulerAll is meant to illustrate how we can query an entity ruler for the list of patterns which have been added to it.

Edits following questions:

1)If you do not need Spacy's default Named Entity Recognition (NER), then I would suggest disabling it as that will speedup computations and avoid interference (see discussion about this here). Disabling NER will not cause unexpected downstream results (your document just won't be tagged for the default entities LOC, ORG, PERSON etc..).

2)There is this idea in programming that "Simple is better than complex." (see here). There can be some subjectivity as to what constitutes a simpler solution. I would think that a processing pipeline with fewer components is simpler (i.e. the pipeline containing both entity rulers would seem more complex to me). However depending on your needs in terms of profiling, adjustability etc.. It might be simpler for you have several different entity rulers as described in the first part of this solution. It would be nice to get the author's of Spacy to give their view on these two different design choices.

Naturally, the single entity ruler above can be directly created as follows:

rulerAll = EntityRuler(nlp, overwrite_ents=True)
for f in flowers:
    rulerAll.add_patterns([{"label": "flower", "pattern": f}])
for a in animals:
    rulerAll.add_patterns([{"label": "animal", "pattern": a}])

other code above shown for constructing rulerAll is meant to illustrate how we can query an entity ruler for the list of patterns which have been added to it.

added 76 characters in body

Source Link

edited Aug 18, 2019 at 16:27

DBaker

2.1k
10
15

You can add another custom entity ruler to your pipeline by changing its name (to avoid name collision). Here is some code to illustrate, but please read the remark below:

import spacy
from spacy.pipeline import EntityRuler
nlp = spacy.load('en_core_web_sm', disable = ['ner'])
rulerPlants = EntityRuler(nlp, overwrite_ents=True)
flowers = ["rose", "tulip", "african daisy"]
for f in flowers:
    rulerPlants.add_patterns([{"label": "flower", "pattern": f}])
animals = ["cat", "dog", "artic fox"]
rulerAnimals = EntityRuler(nlp, overwrite_ents=True)
for a in animals:
    rulerAnimals.add_patterns([{"label": "animal", "pattern": a}])

rulerPlants.name = 'rulerPlants'
rulerAnimals.name = 'rulerAnimals'
nlp.add_pipe(rulerPlants)
nlp.add_pipe(rulerAnimals)

doc = nlp("cat and artic fox, plant african daisy")
for ent in doc.ents:
    print(ent.text , '->', ent.label_)

#output:
#cat -> animal
#artic fox -> animal
#african daisy -> flower

We can verify that the pipeline does contain both entity rulers:

print(nlp.pipe_names)
# ['tagger', 'parser', 'rulerPlants', 'rulerAnimals']

Remark: I would suggest using the simpler and more natural approach of making a new entity ruler which contains the rules of both entity rulers:

rulerAll = EntityRuler(nlp)
rulerAll.add_patterns(rulerAnimals.patterns)
rulerAll.add_patterns(rulerPlants.patterns)

Finally concerning your question about best practices for entity labels, it is a common practice to use abbreviations written with capital letters (see Spacy NER documentation) for example ORG, LOC, PERSON, etc..

You can add another custom entity ruler to your pipeline by changing its name (to avoid name collision). Here is some code to illustrate, but please read the remark below:

import spacy
from spacy.pipeline import EntityRuler
nlp = spacy.load('en_core_web_sm', disable = ['ner'])
rulerPlants = EntityRuler(nlp, overwrite_ents=True)
flowers = ["rose", "tulip", "african daisy"]
for f in flowers:
    rulerPlants.add_patterns([{"label": "flower", "pattern": f}])
animals = ["cat", "dog", "artic fox"]
rulerAnimals = EntityRuler(nlp, overwrite_ents=True)
for a in animals:
    rulerAnimals.add_patterns([{"label": "animal", "pattern": a}])

rulerPlants.name = 'rulerPlants'
rulerAnimals.name = 'rulerAnimals'
nlp.add_pipe(rulerPlants)
nlp.add_pipe(rulerAnimals)

doc = nlp("cat and artic fox, plant african daisy")
for ent in doc.ents:
    print(ent.text , '->', ent.label_)

#output:
#cat -> animal
#artic fox -> animal
#african daisy -> flower

We can verify that the pipeline does contain both entity rulers:

print(nlp.pipe_names)
# ['tagger', 'parser', 'rulerPlants', 'rulerAnimals']

Remark: I would suggest using the simpler and more natural approach of making a new entity ruler which contains the rules of both entity rulers:

rulerAll = EntityRuler(nlp)
rulerAll.add_patterns(rulerAnimals.patterns)
rulerAll.add_patterns(rulerPlants.patterns)

You can add another custom entity ruler to your pipeline by changing its name (to avoid name collision). Here is some code to illustrate, but please read the remark below:

import spacy
from spacy.pipeline import EntityRuler
nlp = spacy.load('en_core_web_sm', disable = ['ner'])
rulerPlants = EntityRuler(nlp, overwrite_ents=True)
flowers = ["rose", "tulip", "african daisy"]
for f in flowers:
    rulerPlants.add_patterns([{"label": "flower", "pattern": f}])
animals = ["cat", "dog", "artic fox"]
rulerAnimals = EntityRuler(nlp, overwrite_ents=True)
for a in animals:
    rulerAnimals.add_patterns([{"label": "animal", "pattern": a}])

rulerPlants.name = 'rulerPlants'
rulerAnimals.name = 'rulerAnimals'
nlp.add_pipe(rulerPlants)
nlp.add_pipe(rulerAnimals)

doc = nlp("cat and artic fox, plant african daisy")
for ent in doc.ents:
    print(ent.text , '->', ent.label_)

#output:
#cat -> animal
#artic fox -> animal
#african daisy -> flower

We can verify that the pipeline does contain both entity rulers:

print(nlp.pipe_names)
# ['tagger', 'parser', 'rulerPlants', 'rulerAnimals']

Remark: I would suggest using the simpler and more natural approach of making a new entity ruler which contains the rules of both entity rulers:

rulerAll = EntityRuler(nlp)
rulerAll.add_patterns(rulerAnimals.patterns)
rulerAll.add_patterns(rulerPlants.patterns)

Finally concerning your question about best practices for entity labels, it is a common practice to use abbreviations written with capital letters (see Spacy NER documentation) for example ORG, LOC, PERSON, etc..

added 76 characters in body

Source Link

edited Aug 18, 2019 at 16:17

DBaker

2.1k
10
15

Loading

Source Link

created Aug 18, 2019 at 16:11

DBaker

2.1k
10
15

Loading

Collectives™ on Stack Overflow

Return to Answer