Add new pattern in Entity Ruler Spacy with regex in multiple tokens

Question

I have this code that works well if I try to search exact words.

from spacy.lang.en import English
import spacy

#nlp = spacy.load("en_core_web_sm")
nlp = spacy.load("en_core_web_sm", disable=["tagger", "attribute_ruler", "lemmatizer","ner"])
ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "ORG", "pattern": "Google"},
            {"label": "COLOR", "pattern": "yellow"},
            {"label": "COLOR", "pattern": "red"},
            {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]},
            {"label": "DIN", "pattern": [{"TEXT" : {"REGEX": "DIN\d"}}]},

            {"label": "DIAM", "pattern": [{"TEXT" : {"REGEX": "diameter\d"}}]},  
            {"label": "MATERIAL", "pattern": [{"LOWER": "zinc"}, {"LOWER": "plated"}]},
            {"label": "MATERIAL", "pattern": [{"LOWER": "stainless"}, {"LOWER": "steel"}]},
            
            
            {"label": "BRAND", "pattern": [{"LOWER": "cubitron"},{"LOWER": "ii"}]}            
            
           ]
ruler.add_patterns(patterns)

doc = nlp("Google red yellow DIN 789 opening its first big zinc plated ffice in San Francisco")
print([(ent.text, ent.label_) for ent in doc.ents])

But the regex doesnt work for whole sentence but just for each token.

I tried to add something like this to add new entity but it doesnt still show the new label DIN in the output.

from spacy.tokens import Span

doc = nlp("Google red yellow DIN 180 opening its first big zinc plated ffice in San Francisco")

pattern = r"DIN\s\d"
original_ents = list(doc.ents) 
mwt_ents = []
for match in re.finditer(pattern, doc.text):
   start, end = match.span()
   span = doc.char_span(start, end)
   if span is not None:
       mwt_ents.append((span.start, span.end, span.text))
       
for ent in mwt_ents:
   start, end, name = ent
   per_ent = Span(doc, start, end, label="DIN")
   original_ents.append(per_ent)

doc.ents = original_ents

from spacy.util import filter_spans
filtered = filter_spans(original_ents)
doc.ents = filtered
for ent in doc.ents:
   print (ent.text, ent.label_)

What all am I doing wrong? How can I add to the nlp model new rule based on regex that searches in the whole input? THANKS!!

polm23 · Accepted Answer · 2022-05-25 04:48:21Z

1

Since your regexes are just for numeric tokens, just add a new token to your pattern.

[{"LOWER" : "diameter"}, {"IS_DIGIT": True}]

How can I add to the nlp model new rule based on regex that searches in the whole input?

The Matcher just doesn't support that. If you want to use regexes against the whole input you can do that yourself and add the spans directly, you don't need the Matcher.

answered May 25, 2022 at 4:48

polm23

15.3k8 gold badges36 silver badges62 bronze badges

Hi! thanks :) I wanted to use the regexes in ruler so that I can then appy diplacy. Is there some way how to combine regex for whole input and displacy?
– HeadOverFeet
Commented May 25, 2022 at 20:08
1

You don't have to use the ruler for displacy. Please see the section in the official docs on using regexes without the ruler. spacy.io/usage/rule-based-matching#regex-text
– polm23
Commented May 26, 2022 at 5:28
I thought I need to use this for displacy ‘’’ doc2 = nlp(LONG_NEWS_ARTICLE) displacy.render(doc2, style="ent")’’’ where nlp model will be set up with using patterns which where I need to set the regex for whole input. Do you mean that I dont have to use patterns for regex recognition for whole input? What I want to is set a defined list of words or phrases and label them. Some of the phrases are as above like : stainless steel but some I want to identify by regex from the whole input and label them eg DIN norm. Is there any solution for it? I am sorry… I am lost :D
– HeadOverFeet
Commented May 26, 2022 at 9:22
1

displaCy just uses doc.ents. If you follow the documentation I linked to you can set doc.ents using a regex on the whole Doc, without the ruler. If that's not clear please ask a new question that's like "how do I use regexes on the whole text in spaCy and display the results with displaCy".
– polm23
Commented May 26, 2022 at 9:25
Once again thanks! I understand it now! Really big thank you!
– HeadOverFeet
Commented May 26, 2022 at 14:37

Add a comment |

Collectives™ on Stack Overflow

Add new pattern in Entity Ruler Spacy with regex in multiple tokens

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
python
regex
entity
spacy
rulers
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged pythonregexentityspacyrulers or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
python
regex
entity
spacy
rulers
or ask your own question.