Spacy matcher with regex across tokens

Question

I have the following sentences:

phrases = ['children externalize their emotions through outward behavior',
         'children externalize hidden emotions.',
         'children externalize internalized emotions.',
         'a child might externalize a hidden emotion through misbehavior',
         'a kid might externalize some emotions through behavior',
         'traumatized children externalize their hidden trauma through bad behavior.',
         'The kid is externalizing internal traumas',
         'A child might externalize emotions though his outward behavior',
         'The kid externalized a lot of his emotions through misbehavior.']

I want to catch whatever noun comes after the verb externalize; externalizing, externalizes, etc

In this case; we should get:

externalize their emotions
externalize hidden emotions
externalize internalized emotions
externalize a hidden emotion
externalize some emotions
externalize their hidden trauma
externalizing internal traumas
externalized a lot of his emotions

So far I am able to catch only the noun if it comes after the verb externalize

I want to catch the noun; if it happens to be after less than 15 characters. for example: externalize a lot of emotions That should be matched; because ( a lot of his ) is only 14 characters; counting the spaces.

Here is my working which is far from perfect.

import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher =  Matcher(vocab = nlp.vocab)
verb_noun = [{'POS':'VERB'}, {'POS':'NOUN'}]
matcher.add('verb_noun', None, verb_noun)

list_result = []
for phrase in phrases:
    doc = nlp(phrase)
    doc_match = matcher(doc)
    if doc_match:
        for match in doc_match:
            start = match[1]
            end = match[2]
            result = doc[start:end]
            result = [i.lemma_ for i in result]
            if 'externaliz' in result[0].lower():
                result = ' '.join(result)
                list_result.append(result)

Suppose the only words of interest were "externalize", "externalizing" and "externalized" and you wanted to return the remainder of the string following one of those words and the space following it. For that you could match the regular expression (?:(?<=\bexternalize )|(?<=\bexternalizing )|(?<=\bexternalized )).*. Demo... — Cary Swoveland, Commented Nov 5, 2021 at 19:47
... The problem is limiting the matched string to only a portion of the remainder of the line (e.g., "a lot of his emotions" rather than "a lot of his emotions through misbehavior."). That would require natural language processing, which is well beyond the capability of regular expressions. — Cary Swoveland, Commented Nov 5, 2021 at 20:02

polm23 · Accepted Answer · 2021-11-07 04:55:24Z

I want to catch the noun; if it happens to be after less than 15 characters. for example: externalize a lot of emotions That should be matched; because ( a lot of his ) is only 14 characters; counting the spaces.

You can do this, though I wouldn't recommend it. What you should do is write a regex to match against the string and use Doc.char_span to create a Match. Since the Matcher works on tokens, using a heuristic like "14 characters, including spaces" cannot be implemented reasonably. Also that kind of heuristic is a hack and will perform erratically.

I suspect what you actually want to do is figure out what is being externalized, that is, to find the object of the verb. In that case you should use the DependencyMatcher. Here's an example of using it with a simple rule and merging noun chunks:

import spacy

from spacy.matcher import DependencyMatcher
nlp = spacy.load("en_core_web_sm")

texts = ['children externalize their emotions through outward behavior',
         'children externalize hidden emotions.',
         'children externalize internalized emotions.',
         'a child might externalize a hidden emotion through misbehavior',
         'a kid might externalize some emotions through behavior',
         'traumatized children externalize their hidden trauma through bad behavior.',
         'The kid is externalizing internal traumas',
         'A child might externalize emotions though his outward behavior',
         'The kid externalized a lot of his emotions through misbehavior.']

pattern = [
  {
    "RIGHT_ID": "externalize",
    "RIGHT_ATTRS": {"LEMMA": "externalize"}
  },
  {
    "LEFT_ID": "externalize",
    "REL_OP": ">",
    "RIGHT_ID": "object",
    "RIGHT_ATTRS": {"DEP": "dobj"}
  },
]

matcher = DependencyMatcher(nlp.vocab)
matcher.add("EXTERNALIZE", [pattern])

# what was externalized?

# this is optional: merge noun phrases
nlp.add_pipe("merge_noun_chunks")

for doc in nlp.pipe(texts):
    for match_id, tokens in  matcher(doc):
        # tokens[0] is like "externalize"
        print(doc[tokens[1]])

Output:

their emotions
hidden emotions
internalized emotions
a hidden emotion
some emotions
their hidden trauma
internal traumas
emotions
his outward behavior
a lot

Collectives™ on Stack Overflow

Spacy matcher with regex across tokens

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
python
nlp
spacy
matcher
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged pythonnlpspacymatcher or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
python
nlp
spacy
matcher
or ask your own question.