Spacy Regex Phrase Matcher in Python

Question

In a large corpus of text, I am interested in extracting every sentence which has a specific list of (Verb-Noun) or (Adjective-Noun) somewhere in the sentence. I have a long list but here is a sample. In my MWE I am trying to extract sentences with "write/wrote/writing/writes" and "book/s". I have around 30 such pairs of words.

Here is what I have tried but it's not catching most of the sentences:

import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

doc = nlp(u'Graham Greene is his favorite author. He wrote his first book when he was a hundred and fifty years old.\
While writing this book, he had to fend off aliens and dinosaurs. Greene\'s second book might not have been written by him. \
Greene\'s cat in its deathbed testimony alleged that it was the original writer of the book. The fact that plot of the book revolves around \
rats conquering the world, lends credence to the idea that only a cat could have been the true writer of such an inane book.')

matcher = Matcher(nlp.vocab)
pattern1 = [{"LEMMA": "write"},{"TEXT": {"REGEX": ".+"}},{"LEMMA": "book"}]
matcher.add("testy", None, pattern)

for sent in doc.sents:
    if matcher(nlp(sent.lemma_)):
        print(sent.text)

Unfortunately, I am only getting one match:

"While writing this book, he had to fend off aliens and dinosaurs."

Whereas, I expect to get the "He wrote his first book" sentence as well. The other write-books have writer as a noun to its good that its not matching.

polm23 · Accepted Answer · 2021-05-29 08:53:20Z

The issue is that in the Matcher, by default each dictionary in the pattern corresponds to exactly one token. So your regex doesn't match any number of characters, it matches any one token, which isn't what you want.

To get what you want, you can use the OP value to specify that you want to match any number of tokens. See the operators or quantifiers section in the docs.

However, given your problem, you probably want to actually use the Dependency Matcher instead, so I rewrote your code to use that as well. Try this:

import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

doc = nlp("""
Graham Greene is his favorite author. He wrote his first book when he was a hundred and fifty years old.
While writing this book, he had to fend off aliens and dinosaurs. Greene's second book might not have been written by him. 
Greene's cat in its deathbed testimony alleged that it was the original writer of the book. The fact that plot of the book revolves around 
rats conquering the world, lends credence to the idea that only a cat could have been the true writer of such an inane book.""")

matcher = Matcher(nlp.vocab)
pattern = [{"LEMMA": "write"},{"OP": "*"},{"LEMMA": "book"}]
matcher.add("testy", [pattern])

print("----- Using Matcher -----")
for sent in doc.sents:
    if matcher(sent):
        print(sent.text)

print("----- Using Dependency Matcher -----")

deppattern = [
        {"RIGHT_ID": "wrote", "RIGHT_ATTRS": {"LEMMA": "write"}},
        {"LEFT_ID": "wrote", "REL_OP": ">", "RIGHT_ID": "book", 
            "RIGHT_ATTRS": {"LEMMA": "book"}}
        ]

from spacy.matcher import DependencyMatcher

dmatcher = DependencyMatcher(nlp.vocab)

dmatcher.add("BOOK", [deppattern])

for _, (start, end) in dmatcher(doc):
    print(doc[start].sent)

One other, less important thing - the way you were calling the matcher was kind of weird. You can pass the matcher Docs or Spans, but they should definitely be natural text, so calling .lemma_ on the sentence and creating a fresh doc from that worked in your case, but in general should be avoided.

Thanks so much for your answer. Reading about DependencyMatcher now on google. Thanks for introducing me to it. However, when I ran your code, I got the following error: "[E098] Invalid pattern specified: expected both SPEC and PATTERN." — Amatya, Commented May 29, 2021 at 16:36
Sounds like you're using spaCy v2. My code is written for v3, I would recommend you upgrade - the Dependency Matcher isn't supported in v2. — polm23, Commented May 30, 2021 at 5:03
Yeah my spacy is 2.3.5 I'll upgrade. For the code, I tried this and it worked. deppattern = [ {'SPEC' : {"NODE_NAME": "wrote"}, "PATTERN":{"LEMMA": "write"}}, # {'SPEC' : {"NODE_NAME": "book"}, "PATTERN":{"LEMMA": "write"}}, {"SPEC": {"NBOR_NAME": "wrote", "NBOR_RELOP": ">", "NODE_NAME": "book"}, "PATTERN": {"LEMMA": "book"}} ] Thnak you! — Amatya, Commented May 30, 2021 at 9:04
quick question: How would one change the dependency Pattern code if instead of "write > book" I had "draft > PhD Thesis" or "write > referee report" or "prepare > some long phrase". Basically, when the dependent word on the right is not one word but a phrase which may not necessarily have meaning in plain english? — Amatya, Commented May 30, 2021 at 11:31
It depends on what the dependency parse looks like. Merging noun chunks might make it simpler, but I suggest you look at the dependency parse for a target sentence and figure out how to break it down. — polm23, Commented May 30, 2021 at 12:53

Collectives™ on Stack Overflow

Spacy Regex Phrase Matcher in Python

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
python
regex
spacy
match-phrase
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged pythonregexspacymatch-phrase or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
python
regex
spacy
match-phrase
or ask your own question.