Skip to main content
added 1643 characters in body
Source Link
DBaker
  • 2.1k
  • 10
  • 15

The other code above shown for constructing rulerAll is meant to illustrate how we can query an entity ruler for the list of patterns which have been added to it. In practice we would construct rulerAll directly without first constructing rulerPlant and rulerAnimal. Unless we wanted to test and profile these (rulerPlant and rulerAnimal) individually.

other code above shown for constructing rulerAll is meant to illustrate how we can query an entity ruler for the list of patterns which have been added to it.

The other code above shown for constructing rulerAll is meant to illustrate how we can query an entity ruler for the list of patterns which have been added to it. In practice we would construct rulerAll directly without first constructing rulerPlant and rulerAnimal. Unless we wanted to test and profile these (rulerPlant and rulerAnimal) individually.

added 1643 characters in body
Source Link
DBaker
  • 2.1k
  • 10
  • 15

Edits following questions:

1)If you do not need Spacy's default Named Entity Recognition (NER), then I would suggest disabling it as that will speedup computations and avoid interference (see discussion about this here). Disabling NER will not cause unexpected downstream results (your document just won't be tagged for the default entities LOC, ORG, PERSON etc..).

2)There is this idea in programming that "Simple is better than complex." (see here). There can be some subjectivity as to what constitutes a simpler solution. I would think that a processing pipeline with fewer components is simpler (i.e. the pipeline containing both entity rulers would seem more complex to me). However depending on your needs in terms of profiling, adjustability etc.. It might be simpler for you have several different entity rulers as described in the first part of this solution. It would be nice to get the author's of Spacy to give their view on these two different design choices.

  1. Naturally, the single entity ruler above can be directly created as follows:
rulerAll = EntityRuler(nlp, overwrite_ents=True)
for f in flowers:
    rulerAll.add_patterns([{"label": "flower", "pattern": f}])
for a in animals:
    rulerAll.add_patterns([{"label": "animal", "pattern": a}])

other code above shown for constructing rulerAll is meant to illustrate how we can query an entity ruler for the list of patterns which have been added to it.

Edits following questions:

1)If you do not need Spacy's default Named Entity Recognition (NER), then I would suggest disabling it as that will speedup computations and avoid interference (see discussion about this here). Disabling NER will not cause unexpected downstream results (your document just won't be tagged for the default entities LOC, ORG, PERSON etc..).

2)There is this idea in programming that "Simple is better than complex." (see here). There can be some subjectivity as to what constitutes a simpler solution. I would think that a processing pipeline with fewer components is simpler (i.e. the pipeline containing both entity rulers would seem more complex to me). However depending on your needs in terms of profiling, adjustability etc.. It might be simpler for you have several different entity rulers as described in the first part of this solution. It would be nice to get the author's of Spacy to give their view on these two different design choices.

  1. Naturally, the single entity ruler above can be directly created as follows:
rulerAll = EntityRuler(nlp, overwrite_ents=True)
for f in flowers:
    rulerAll.add_patterns([{"label": "flower", "pattern": f}])
for a in animals:
    rulerAll.add_patterns([{"label": "animal", "pattern": a}])

other code above shown for constructing rulerAll is meant to illustrate how we can query an entity ruler for the list of patterns which have been added to it.

added 76 characters in body
Source Link
DBaker
  • 2.1k
  • 10
  • 15

You can add another custom entity ruler to your pipeline by changing its name (to avoid name collision). Here is some code to illustrate, but please read the remark below:

import spacy
from spacy.pipeline import EntityRuler
nlp = spacy.load('en_core_web_sm', disable = ['ner'])
rulerPlants = EntityRuler(nlp, overwrite_ents=True)
flowers = ["rose", "tulip", "african daisy"]
for f in flowers:
    rulerPlants.add_patterns([{"label": "flower", "pattern": f}])
animals = ["cat", "dog", "artic fox"]
rulerAnimals = EntityRuler(nlp, overwrite_ents=True)
for a in animals:
    rulerAnimals.add_patterns([{"label": "animal", "pattern": a}])

rulerPlants.name = 'rulerPlants'
rulerAnimals.name = 'rulerAnimals'
nlp.add_pipe(rulerPlants)
nlp.add_pipe(rulerAnimals)

doc = nlp("cat and artic fox, plant african daisy")
for ent in doc.ents:
    print(ent.text , '->', ent.label_)

#output:
#cat -> animal
#artic fox -> animal
#african daisy -> flower

We can verify that the pipeline does contain both entity rulers:

print(nlp.pipe_names)
# ['tagger', 'parser', 'rulerPlants', 'rulerAnimals']

Remark: I would suggest using the simpler and more natural approach of making a new entity ruler which contains the rules of both entity rulers:

rulerAll = EntityRuler(nlp)
rulerAll.add_patterns(rulerAnimals.patterns)
rulerAll.add_patterns(rulerPlants.patterns)

Finally concerning your question about best practices for entity labels, it is a common practice to use abbreviations written with capital letters (see Spacy NER documentation) for example ORG, LOC, PERSON, etc..

You can add another custom entity ruler to your pipeline by changing its name (to avoid name collision). Here is some code to illustrate, but please read the remark below:

import spacy
from spacy.pipeline import EntityRuler
nlp = spacy.load('en_core_web_sm', disable = ['ner'])
rulerPlants = EntityRuler(nlp, overwrite_ents=True)
flowers = ["rose", "tulip", "african daisy"]
for f in flowers:
    rulerPlants.add_patterns([{"label": "flower", "pattern": f}])
animals = ["cat", "dog", "artic fox"]
rulerAnimals = EntityRuler(nlp, overwrite_ents=True)
for a in animals:
    rulerAnimals.add_patterns([{"label": "animal", "pattern": a}])

rulerPlants.name = 'rulerPlants'
rulerAnimals.name = 'rulerAnimals'
nlp.add_pipe(rulerPlants)
nlp.add_pipe(rulerAnimals)

doc = nlp("cat and artic fox, plant african daisy")
for ent in doc.ents:
    print(ent.text , '->', ent.label_)

#output:
#cat -> animal
#artic fox -> animal
#african daisy -> flower

We can verify that the pipeline does contain both entity rulers:

print(nlp.pipe_names)
# ['tagger', 'parser', 'rulerPlants', 'rulerAnimals']

Remark: I would suggest using the simpler and more natural approach of making a new entity ruler which contains the rules of both entity rulers:

rulerAll = EntityRuler(nlp)
rulerAll.add_patterns(rulerAnimals.patterns)
rulerAll.add_patterns(rulerPlants.patterns)

You can add another custom entity ruler to your pipeline by changing its name (to avoid name collision). Here is some code to illustrate, but please read the remark below:

import spacy
from spacy.pipeline import EntityRuler
nlp = spacy.load('en_core_web_sm', disable = ['ner'])
rulerPlants = EntityRuler(nlp, overwrite_ents=True)
flowers = ["rose", "tulip", "african daisy"]
for f in flowers:
    rulerPlants.add_patterns([{"label": "flower", "pattern": f}])
animals = ["cat", "dog", "artic fox"]
rulerAnimals = EntityRuler(nlp, overwrite_ents=True)
for a in animals:
    rulerAnimals.add_patterns([{"label": "animal", "pattern": a}])

rulerPlants.name = 'rulerPlants'
rulerAnimals.name = 'rulerAnimals'
nlp.add_pipe(rulerPlants)
nlp.add_pipe(rulerAnimals)

doc = nlp("cat and artic fox, plant african daisy")
for ent in doc.ents:
    print(ent.text , '->', ent.label_)

#output:
#cat -> animal
#artic fox -> animal
#african daisy -> flower

We can verify that the pipeline does contain both entity rulers:

print(nlp.pipe_names)
# ['tagger', 'parser', 'rulerPlants', 'rulerAnimals']

Remark: I would suggest using the simpler and more natural approach of making a new entity ruler which contains the rules of both entity rulers:

rulerAll = EntityRuler(nlp)
rulerAll.add_patterns(rulerAnimals.patterns)
rulerAll.add_patterns(rulerPlants.patterns)

Finally concerning your question about best practices for entity labels, it is a common practice to use abbreviations written with capital letters (see Spacy NER documentation) for example ORG, LOC, PERSON, etc..

added 76 characters in body
Source Link
DBaker
  • 2.1k
  • 10
  • 15
Loading
Source Link
DBaker
  • 2.1k
  • 10
  • 15
Loading