2

I am trying to add stock symbols to the strings recognized as ORG entities. For each symbol, I do:

nlp.matcher.add(symbol, u'ORG', {}, [[{u'orth': symbol}]])

I can see that this symbol gets added to the patterns:

print "Patterns:", nlp.matcher._patterns

but any symbols that were not recognized before adding are not recognized after adding. Apparently, these tokens already exist in the vocabulary (that is why the vocab length does not change).

What should I be doing differently? What am I missing?

Thanks

Here is my example code:

"Brief snippet to practice adding stock ticker symbols as ORG entities"

from spacy.en import English
import spacy.en
from spacy.attrs import ORTH, TAG, LOWER, IS_ALPHA, FLAG63
import os
import csv
import sys

nlp = English()  #Load everything for the English model

print "Before nlp vocab length", len(nlp.matcher.vocab)

symbol_list = [u"CHK", u"JONE", u"NE", u"DO",  u"ESV"]

txt =  u"""drive double-digit rallies in Chesapeake Energy (NYSE: CHK), (NYSE: NE), (NYSE: DO), (NYSE: ESV), (NYSE: JONE)"""# u"""Drive double-digit rallies in Chesapeake Energy (NYSE: CHK), Noble Corporation (NYSE:NE), Diamond Offshore (NYSE:DO), Ensco (NYSE:ESV), and Jones Energy (NYSE: JONE)"""
before = nlp(txt)
for tok in before:   #Before adding entities
    print tok, tok.orth, tok.tag_, tok.ent_type_

for symbol in symbol_list:
    print "adding symbol:", symbol
    print "vocab length:", len(nlp.matcher.vocab)
    print "pattern length:", nlp.matcher.n_patterns
    nlp.matcher.add(symbol, u'ORG', {}, [[{u'orth': symbol}]])


print "Patterns:", nlp.matcher._patterns
print "Entities:", nlp.matcher._entities
for ent in nlp.matcher._entities:
    print ent.label

tokens = nlp(txt)

print "\n\nAfter:"
print "After nlp vocab length", len(nlp.matcher.vocab)

for tok in tokens:
    print tok, tok.orth, tok.tag_, tok.ent_type_
3
  • If you are on >1.0 you should a callback function for each matcher and merge tokens manually. Commented Nov 5, 2016 at 16:17
  • Could you provide a few more details? Commented Nov 9, 2016 at 4:27
  • Thanks for your suggestion, but could you provide a few more details? Where do I add the callback? What does the callback do? How do I merge tokens manually? Sorry, I'm just starting to use Spacy. Thanks, Herb Commented Nov 9, 2016 at 4:34

1 Answer 1

3

Here's working example based on the docs:

import spacy

nlp = spacy.load('en')

def merge_phrases(matcher, doc, i, matches):
    '''
    Merge a phrase. We have to be careful here because we'll change the token indices.
    To avoid problems, merge all the phrases once we're called on the last match.
    '''
    if i != len(matches)-1:
        return None
    spans = [(ent_id, label, doc[start : end]) for ent_id, label, start, end in matches]
    for ent_id, label, span in spans:
        span.merge('NNP' if label else span.root.tag_, span.text, nlp.vocab.strings[label])

matcher = spacy.matcher.Matcher(nlp.vocab)
matcher.add(entity_key='stock-nyse', label='STOCK', attrs={}, specs=[[{spacy.attrs.ORTH: 'NYSE'}]], on_match=merge_phrases)
matcher.add(entity_key='stock-esv', label='STOCK', attrs={}, specs=[[{spacy.attrs.ORTH: 'ESV'}]], on_match=merge_phrases)
doc = nlp(u"""drive double-digit rallies in Chesapeake Energy (NYSE: CHK), (NYSE: NE), (NYSE: DO), (NYSE: ESV), (NYSE: JONE)""")
matcher(doc)
print(['%s|%s' % (t.orth_, t.ent_type_) for t in doc])

->

['drive|', 'double|', '-|', 'digit|', 'rallies|', 'in|', 'Chesapeake|ORG', 'Energy|ORG', '(|', 'NYSE|STOCK', ':|', 'CHK|', ')|', ',|', '(|', 'NYSE|STOCK', ':|', 'NE|GPE', ')|', ',|', '(|', 'NYSE|STOCK', ':|', 'DO|', ')|', ',|', '(|', 'NYSE|STOCK', ':|', 'ESV|STOCK', ')|', ',|', '(|', 'NYSE|STOCK', ':|', 'JONE|ORG', ')|']

NYSE and ESV now marked with STOCK entity type. Basically, on each match you should manually merge tokens and/or assign entity types you want. There's also acceptor function which allows you to filter/reject the matches while they are being matched.

1
  • Thanks. I'm traveling, but I will look at this as soon as I get back. Appreciate your help. Commented Nov 15, 2016 at 17:00

Not the answer you're looking for? Browse other questions tagged or ask your own question.