How do I add new entity (ORG) instances in spacy nlp

Question

I am trying to add stock symbols to the strings recognized as ORG entities. For each symbol, I do:

nlp.matcher.add(symbol, u'ORG', {}, [[{u'orth': symbol}]])

I can see that this symbol gets added to the patterns:

print "Patterns:", nlp.matcher._patterns

but any symbols that were not recognized before adding are not recognized after adding. Apparently, these tokens already exist in the vocabulary (that is why the vocab length does not change).

What should I be doing differently? What am I missing?

Thanks

Here is my example code:

"Brief snippet to practice adding stock ticker symbols as ORG entities"

from spacy.en import English
import spacy.en
from spacy.attrs import ORTH, TAG, LOWER, IS_ALPHA, FLAG63
import os
import csv
import sys

nlp = English()  #Load everything for the English model

print "Before nlp vocab length", len(nlp.matcher.vocab)

symbol_list = [u"CHK", u"JONE", u"NE", u"DO",  u"ESV"]

txt =  u"""drive double-digit rallies in Chesapeake Energy (NYSE: CHK), (NYSE: NE), (NYSE: DO), (NYSE: ESV), (NYSE: JONE)"""# u"""Drive double-digit rallies in Chesapeake Energy (NYSE: CHK), Noble Corporation (NYSE:NE), Diamond Offshore (NYSE:DO), Ensco (NYSE:ESV), and Jones Energy (NYSE: JONE)"""
before = nlp(txt)
for tok in before:   #Before adding entities
    print tok, tok.orth, tok.tag_, tok.ent_type_

for symbol in symbol_list:
    print "adding symbol:", symbol
    print "vocab length:", len(nlp.matcher.vocab)
    print "pattern length:", nlp.matcher.n_patterns
    nlp.matcher.add(symbol, u'ORG', {}, [[{u'orth': symbol}]])


print "Patterns:", nlp.matcher._patterns
print "Entities:", nlp.matcher._entities
for ent in nlp.matcher._entities:
    print ent.label

tokens = nlp(txt)

print "\n\nAfter:"
print "After nlp vocab length", len(nlp.matcher.vocab)

for tok in tokens:
    print tok, tok.orth, tok.tag_, tok.ent_type_

If you are on >1.0 you should a callback function for each matcher and merge tokens manually. — Dmytro Sadovnychyi, Commented Nov 5, 2016 at 16:17
Thanks for your suggestion, but could you provide a few more details? Where do I add the callback? What does the callback do? How do I merge tokens manually? Sorry, I'm just starting to use Spacy. Thanks, Herb — user1430965, Commented Nov 9, 2016 at 4:34

Dmytro Sadovnychyi · Accepted Answer · 2016-11-11 10:02:44Z

Here's working example based on the docs:

import spacy

nlp = spacy.load('en')

def merge_phrases(matcher, doc, i, matches):
    '''
    Merge a phrase. We have to be careful here because we'll change the token indices.
    To avoid problems, merge all the phrases once we're called on the last match.
    '''
    if i != len(matches)-1:
        return None
    spans = [(ent_id, label, doc[start : end]) for ent_id, label, start, end in matches]
    for ent_id, label, span in spans:
        span.merge('NNP' if label else span.root.tag_, span.text, nlp.vocab.strings[label])

matcher = spacy.matcher.Matcher(nlp.vocab)
matcher.add(entity_key='stock-nyse', label='STOCK', attrs={}, specs=[[{spacy.attrs.ORTH: 'NYSE'}]], on_match=merge_phrases)
matcher.add(entity_key='stock-esv', label='STOCK', attrs={}, specs=[[{spacy.attrs.ORTH: 'ESV'}]], on_match=merge_phrases)
doc = nlp(u"""drive double-digit rallies in Chesapeake Energy (NYSE: CHK), (NYSE: NE), (NYSE: DO), (NYSE: ESV), (NYSE: JONE)""")
matcher(doc)
print(['%s|%s' % (t.orth_, t.ent_type_) for t in doc])

->

['drive|', 'double|', '-|', 'digit|', 'rallies|', 'in|', 'Chesapeake|ORG', 'Energy|ORG', '(|', 'NYSE|STOCK', ':|', 'CHK|', ')|', ',|', '(|', 'NYSE|STOCK', ':|', 'NE|GPE', ')|', ',|', '(|', 'NYSE|STOCK', ':|', 'DO|', ')|', ',|', '(|', 'NYSE|STOCK', ':|', 'ESV|STOCK', ')|', ',|', '(|', 'NYSE|STOCK', ':|', 'JONE|ORG', ')|']

NYSE and ESV now marked with STOCK entity type. Basically, on each match you should manually merge tokens and/or assign entity types you want. There's also acceptor function which allows you to filter/reject the matches while they are being matched.

Thanks. I'm traveling, but I will look at this as soon as I get back. Appreciate your help. — user1430965, Commented Nov 15, 2016 at 17:00

Collectives™ on Stack Overflow

How do I add new entity (ORG) instances in spacy nlp

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
python
nlp
spacy
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged pythonnlpspacy or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
python
nlp
spacy
or ask your own question.