I want to make SpaCy model that will recognise organisation names. Each organisation name have between 1 and 4 words, that can be titled or capitalised. I have added more than 3500 names of the organisations like this:
patterns = []
for organisation in organisations_list:
patterns.append({"label": "ORG", "pattern": organisation.strip()})
So now i have a list of patterns that look like this:
for p in patterns:
print(p)
result:
{'label': 'ORG', 'pattern': 'BLS AG'}
{'label': 'ORG', 'pattern': 'Chemins de fer du Jura'}
{'label': 'ORG', 'pattern': 'Comlux'}
{'label': 'ORG', 'pattern': 'CRH Gétaz Group'}
{'label': 'ORG', 'pattern': 'DKSH Management AG'}
{'label': 'ORG', 'pattern': 'Ferdinand Steck Maschinenfabrik'}
{'label': 'ORG', 'pattern': 'Galenica'}
{'label': 'ORG', 'pattern': 'Givaudan'}
{'label': 'ORG', 'pattern': 'Heliswiss'}
{'label': 'ORG', 'pattern': 'Jet Aviation'}
{'label': 'ORG', 'pattern': 'Kolmar'}
...
...
So patterns object look like this:
patterns = [{'label': 'ORG', 'pattern': 'BLS AG'}
{'label': 'ORG', 'pattern': 'Chemins de fer du Jura'}
{'label': 'ORG', 'pattern': 'Comlux'}
{'label': 'ORG', 'pattern': 'CRH Gétaz Group'}
{'label': 'ORG', 'pattern': 'DKSH Management AG'}
{'label': 'ORG', 'pattern': 'Ferdinand Steck Maschinenfabrik'}
{'label': 'ORG', 'pattern': 'Galenica'}
{'label': 'ORG', 'pattern': 'Givaudan'}
{'label': 'ORG', 'pattern': 'Heliswiss'}
{'label': 'ORG', 'pattern': 'Jet Aviation'}
{'label': 'ORG', 'pattern': 'Kolmar'}....]
Then I created a blank model:
nlp = spacy.blank("en")
nlp.add_pipe('entity_ruler')
ruler.add_patterns(patterns)
And then, I have tested it like this:
for full_text in list_of_texts:
doc = nlp(full_text)
print(doc.ents.text, doc.ents.label_)
And it does not recognises anything (even if Im testing it in a sentence that has exact name of the organisations). I have also tried to add tagger
and parser
to my blank model with entity_ruler
but its always the same.
These are some of the examples of text that I have used for testing (each company name in testing texts are also in the patterns with the same capitalisations and spelling):
t1 = "I work in company called DKSH Management AG its very good company"
t2 = "I have stayed in Holiday Inn Express and I really liked it"
t3 = "Have you head for company named AKKA Technologies SE"
t4 = "what do you think about ERYTECH Pharma"
t5 = "did you get an email from ESI Group"
t6 = "Esso S.A.F. sent me an email last week"
What am I doing wrong? I have noticed that It works if I do it like this:
ruler = EntityRuler(nlp)
ruler.add_patterns(patterns)
nlp = spacy.load("en_core_web_trf")
nlp.add_pipe('entity_ruler', before = 'tagger')
#if i do print(nlp.pipeline) i can see entity_ruler added before tager.
But then I do not know if it works because of my entity_ruler
or because of the pre trained model. I have tested it on 20 example texts and it gives me the same results with entity_ruler and without it, so I cant figure it out if it works better or not.
What am I doing wrong?