In a large corpus of text, I am interested in extracting every sentence which has a specific list of (Verb-Noun) or (Adjective-Noun) somewhere in the sentence. I have a long list but here is a sample. In my MWE I am trying to extract sentences with "write/wrote/writing/writes" and "book/s". I have around 30 such pairs of words.
Here is what I have tried but it's not catching most of the sentences:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
doc = nlp(u'Graham Greene is his favorite author. He wrote his first book when he was a hundred and fifty years old.\
While writing this book, he had to fend off aliens and dinosaurs. Greene\'s second book might not have been written by him. \
Greene\'s cat in its deathbed testimony alleged that it was the original writer of the book. The fact that plot of the book revolves around \
rats conquering the world, lends credence to the idea that only a cat could have been the true writer of such an inane book.')
matcher = Matcher(nlp.vocab)
pattern1 = [{"LEMMA": "write"},{"TEXT": {"REGEX": ".+"}},{"LEMMA": "book"}]
matcher.add("testy", None, pattern)
for sent in doc.sents:
if matcher(nlp(sent.lemma_)):
print(sent.text)
Unfortunately, I am only getting one match:
"While writing this book, he had to fend off aliens and dinosaurs."
Whereas, I expect to get the "He wrote his first book" sentence as well. The other write-books have writer as a noun to its good that its not matching.