0

I have this code that works well if I try to search exact words.

from spacy.lang.en import English
import spacy

#nlp = spacy.load("en_core_web_sm")
nlp = spacy.load("en_core_web_sm", disable=["tagger", "attribute_ruler", "lemmatizer","ner"])
ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "ORG", "pattern": "Google"},
            {"label": "COLOR", "pattern": "yellow"},
            {"label": "COLOR", "pattern": "red"},
            {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]},
            {"label": "DIN", "pattern": [{"TEXT" : {"REGEX": "DIN\d"}}]},

            {"label": "DIAM", "pattern": [{"TEXT" : {"REGEX": "diameter\d"}}]},  
            {"label": "MATERIAL", "pattern": [{"LOWER": "zinc"}, {"LOWER": "plated"}]},
            {"label": "MATERIAL", "pattern": [{"LOWER": "stainless"}, {"LOWER": "steel"}]},
            
            
            {"label": "BRAND", "pattern": [{"LOWER": "cubitron"},{"LOWER": "ii"}]}            
            
           ]
ruler.add_patterns(patterns)

doc = nlp("Google red yellow DIN 789 opening its first big zinc plated ffice in San Francisco")
print([(ent.text, ent.label_) for ent in doc.ents])

But the regex doesnt work for whole sentence but just for each token.

I tried to add something like this to add new entity but it doesnt still show the new label DIN in the output.

from spacy.tokens import Span

doc = nlp("Google red yellow DIN 180 opening its first big zinc plated ffice in San Francisco")

pattern = r"DIN\s\d"
original_ents = list(doc.ents) 
mwt_ents = []
for match in re.finditer(pattern, doc.text):
   start, end = match.span()
   span = doc.char_span(start, end)
   if span is not None:
       mwt_ents.append((span.start, span.end, span.text))
       
for ent in mwt_ents:
   start, end, name = ent
   per_ent = Span(doc, start, end, label="DIN")
   original_ents.append(per_ent)

doc.ents = original_ents

from spacy.util import filter_spans
filtered = filter_spans(original_ents)
doc.ents = filtered
for ent in doc.ents:
   print (ent.text, ent.label_)

What all am I doing wrong? How can I add to the nlp model new rule based on regex that searches in the whole input? THANKS!!

1 Answer 1

1

Since your regexes are just for numeric tokens, just add a new token to your pattern.

[{"LOWER" : "diameter"}, {"IS_DIGIT": True}]

How can I add to the nlp model new rule based on regex that searches in the whole input?

The Matcher just doesn't support that. If you want to use regexes against the whole input you can do that yourself and add the spans directly, you don't need the Matcher.

5
  • Hi! thanks :) I wanted to use the regexes in ruler so that I can then appy diplacy. Is there some way how to combine regex for whole input and displacy? Commented May 25, 2022 at 20:08
  • 1
    You don't have to use the ruler for displacy. Please see the section in the official docs on using regexes without the ruler. spacy.io/usage/rule-based-matching#regex-text
    – polm23
    Commented May 26, 2022 at 5:28
  • I thought I need to use this for displacy ‘’’ doc2 = nlp(LONG_NEWS_ARTICLE) displacy.render(doc2, style="ent")’’’ where nlp model will be set up with using patterns which where I need to set the regex for whole input. Do you mean that I dont have to use patterns for regex recognition for whole input? What I want to is set a defined list of words or phrases and label them. Some of the phrases are as above like : stainless steel but some I want to identify by regex from the whole input and label them eg DIN norm. Is there any solution for it? I am sorry… I am lost :D Commented May 26, 2022 at 9:22
  • 1
    displaCy just uses doc.ents. If you follow the documentation I linked to you can set doc.ents using a regex on the whole Doc, without the ruler. If that's not clear please ask a new question that's like "how do I use regexes on the whole text in spaCy and display the results with displaCy".
    – polm23
    Commented May 26, 2022 at 9:25
  • Once again thanks! I understand it now! Really big thank you! Commented May 26, 2022 at 14:37

Not the answer you're looking for? Browse other questions tagged or ask your own question.