3

I want to make SpaCy model that will recognise organisation names. Each organisation name have between 1 and 4 words, that can be titled or capitalised. I have added more than 3500 names of the organisations like this:

patterns = []
for organisation in organisations_list:
    patterns.append({"label": "ORG", "pattern": organisation.strip()})

So now i have a list of patterns that look like this:

for p in patterns:
   print(p)

result:

{'label': 'ORG', 'pattern': 'BLS AG'}
{'label': 'ORG', 'pattern': 'Chemins de fer du Jura'}
{'label': 'ORG', 'pattern': 'Comlux'}
{'label': 'ORG', 'pattern': 'CRH Gétaz Group'}
{'label': 'ORG', 'pattern': 'DKSH Management AG'}
{'label': 'ORG', 'pattern': 'Ferdinand Steck Maschinenfabrik'}
{'label': 'ORG', 'pattern': 'Galenica'}
{'label': 'ORG', 'pattern': 'Givaudan'}
{'label': 'ORG', 'pattern': 'Heliswiss'}
{'label': 'ORG', 'pattern': 'Jet Aviation'}
{'label': 'ORG', 'pattern': 'Kolmar'}
...
...

So patterns object look like this:

patterns = [{'label': 'ORG', 'pattern': 'BLS AG'}
{'label': 'ORG', 'pattern': 'Chemins de fer du Jura'}
{'label': 'ORG', 'pattern': 'Comlux'}
{'label': 'ORG', 'pattern': 'CRH Gétaz Group'}
{'label': 'ORG', 'pattern': 'DKSH Management AG'}
{'label': 'ORG', 'pattern': 'Ferdinand Steck Maschinenfabrik'}
{'label': 'ORG', 'pattern': 'Galenica'}
{'label': 'ORG', 'pattern': 'Givaudan'}
{'label': 'ORG', 'pattern': 'Heliswiss'}
{'label': 'ORG', 'pattern': 'Jet Aviation'}
{'label': 'ORG', 'pattern': 'Kolmar'}....]

Then I created a blank model:

nlp = spacy.blank("en")
nlp.add_pipe('entity_ruler')
ruler.add_patterns(patterns)

And then, I have tested it like this:

for full_text in list_of_texts:
    doc = nlp(full_text)
    print(doc.ents.text, doc.ents.label_)

And it does not recognises anything (even if Im testing it in a sentence that has exact name of the organisations). I have also tried to add tagger and parser to my blank model with entity_ruler but its always the same.

These are some of the examples of text that I have used for testing (each company name in testing texts are also in the patterns with the same capitalisations and spelling):

t1 = "I work in company called DKSH Management AG its very good company"
t2 = "I have stayed in Holiday Inn Express and I really liked it"
t3 = "Have you head for company named AKKA Technologies SE"
t4 = "what do you think about ERYTECH Pharma"
t5 = "did you get an email from ESI Group"
t6 = "Esso S.A.F. sent me an email last week"

What am I doing wrong? I have noticed that It works if I do it like this:

ruler = EntityRuler(nlp)
ruler.add_patterns(patterns)
nlp = spacy.load("en_core_web_trf")
nlp.add_pipe('entity_ruler', before = 'tagger')
#if i do print(nlp.pipeline) i can see entity_ruler added before tager.

But then I do not know if it works because of my entity_ruler or because of the pre trained model. I have tested it on 20 example texts and it gives me the same results with entity_ruler and without it, so I cant figure it out if it works better or not.

What am I doing wrong?

1 Answer 1

2

You're not adding the EntityRuler correctly. You're creating an EntityRuler from scratch and adding rules to it, and then telling the pipeline to create an EntityRuler that's completely unrelated.

This is the problem code:

ruler = EntityRuler(nlp)     # ruler 1
ruler.add_patterns(patterns) # ruler 1
nlp = spacy.blank("en")
nlp.add_pipe('entity_ruler') # this creates an unrelated ruler 2

This is what you should do:

nlp = spacy.blank("en")
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns)

That should work.


In spaCy v2 the flow for creating a pipeline component was to create the object and then add it to the pipeline, but in v3 the flow is to ask the pipeline to create the component and then use the returned object.


Based on your updated examples, here is example code using the EntityRuler to match the first sentence.

nlp = spacy.blank("en")
ruler = nlp.add_pipe("entity_ruler")
patterns = [
  {"label": "ORG", "pattern": "DKSH Management AG"},
  {"label": "ORG", "pattern": "Some other company"},
]
ruler.add_patterns(patterns)

doc = nlp("I work in company called DKSH Management AG its very good company")
print([(ent.text, ent.label_) for ent in doc.ents])
# output: [('DKSH Management AG', 'ORG')]

Does that clarify how you should structure your code?

Looking at your updated question code, your code with the blank model is almost right, but note that add_pipe returns the EntityRuler object. You should add your patterns to that object.

10
  • So I do not need this: ruler = EntityRuler(nlp) ? But will the companies that I have added in the pattern be in 'entity_ruler'?
    – taga
    Commented Mar 27, 2021 at 11:40
  • Still it does not work, 0 organizations found.
    – taga
    Commented Mar 27, 2021 at 11:41
  • You should not call EntityRuler(nlp) anywhere. If it's still not working please add a sample of code with one of your patterns and a sentence it should match.
    – polm23
    Commented Mar 27, 2021 at 11:49
  • I have updated code in my question so that it matches yours, and I have added 6 text examples with company names with exact spelling as in patterns.
    – taga
    Commented Mar 27, 2021 at 11:59
  • I think that your solution works only if there is 1 pattern. I copied and pasted your latest code and it works for "patterns = [{"label": "ORG", "pattern": "DKSH Management AG"}] ", but If make the pattern as Im making (see my first block of code), if I add around 3500 then it does not work. It returns empty tuple.
    – taga
    Commented Mar 27, 2021 at 12:42

Not the answer you're looking for? Browse other questions tagged or ask your own question.