1

I'm newbie to spacy and I've read the docs about token-base matching. I've tried spaCy matcher using the REGEX but I don't have any results.

When I use the re library to do the match it works though.

Am I doing something wrong in the code.

I'm trying to match the "accès'd" word

Thanks for your help

# REGEX
import re
text = u"accès'd est ferme aujpourd'hui"
pattern_re = re.compile("^acc?é?e?è?s?s?'?D" , re.I)
pattern_re.match(text)

# <re.Match object; span=(0, 7), match="accès'd">

# REGEX SPACY VERSION 1
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("fr_core_news_sm")

pattern = [{'TEXT': {'REGEX' : "^acc?é?e?è?s?s?'?D"}}]
matcher = Matcher(nlp.vocab)
matcher.add('AccèsD' , None , pattern)


doc = nlp(text)

matches = matcher(doc)
for match_id, start , end in matches:
    match_string = nlp.vocab.strings[match_id]
    span = doc[start:end]
    print(match_id, match_string, start , end , span.text)

# NOTHING

# REGEX SPACY VERSION 2
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("fr_core_news_sm")

accesd_flag = lambda text : bool(re.compile(r"^acc?é?e?è?s?s?'?D" , re.I).match(text))
IS_ACCESD = nlp.vocab.add_flag(accesd_flag)
pattern=  [{IS_ACCESD : True}]

matcher = Matcher(nlp.vocab)
matcher.add('AccèsD' , None , pattern)

doc = nlp(text)

matches = matcher(doc)
for match_id, start , end in matches:
    match_string = nlp.vocab.strings[match_id]
    span = doc[start:end]
    print(match_id, match_string, start , end , span.text)


# NOTHING
1
  • Why don't you make an on_match function instead of None then you can use it to help you debug
    – nmc
    Commented Apr 6, 2019 at 21:52

1 Answer 1

1

Regex support for spacy was introduced in version 2.1.0

From the website

Versions before v2.1.0 don’t yet support the REGEX operator.

It might be related to this, since REGEX for the matcher will not be used otherwise

Otherwise, I believe the answer is related to not matching all contextual tokens. You should then just slightly modify the regex and use the LOWER matching attribute to capture the two token context.

text = u"accès'd est ferme aujpourd'hui"
pattern = [{"LOWER" : { "REGEX": "^acc?é?e?è?s?s?" }, {"LOWER": "d"}]
matcher = Matcher(nlp.vocab)
matcher.add("accesd", None, pattern) 
doc = nlp(text) 
matches = matcher(doc)

This is because you were using the re.I flag so the matcher will work in a case insensitive way by only reviewing the LOWER attribute

4
  • Hi Nathan, thanks for your reply. I’ve checked spaCy version and it’s the last one
    – dito
    Commented Apr 6, 2019 at 19:15
  • Oh too bad, thought it could be that.
    – nmc
    Commented Apr 6, 2019 at 21:36
  • 1
    I think I found the source of the problem. IT's the spacy tokenizer who splits the text into ['accès', 'd', 'est', 'ferme', "aujpourd'", 'hui'] and this is why the regex does not give a match
    – dito
    Commented Apr 6, 2019 at 23:26
  • Oh I see, that's right. I'll add an answer I believe shall work
    – nmc
    Commented Apr 7, 2019 at 5:31

Not the answer you're looking for? Browse other questions tagged or ask your own question.