Token-base matching with spaCy REGEX

Question

I'm newbie to spacy and I've read the docs about token-base matching. I've tried spaCy matcher using the REGEX but I don't have any results.

When I use the re library to do the match it works though.

Am I doing something wrong in the code.

I'm trying to match the "accès'd" word

Thanks for your help

# REGEX
import re
text = u"accès'd est ferme aujpourd'hui"
pattern_re = re.compile("^acc?é?e?è?s?s?'?D" , re.I)
pattern_re.match(text)

# <re.Match object; span=(0, 7), match="accès'd">

# REGEX SPACY VERSION 1
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("fr_core_news_sm")

pattern = [{'TEXT': {'REGEX' : "^acc?é?e?è?s?s?'?D"}}]
matcher = Matcher(nlp.vocab)
matcher.add('AccèsD' , None , pattern)


doc = nlp(text)

matches = matcher(doc)
for match_id, start , end in matches:
    match_string = nlp.vocab.strings[match_id]
    span = doc[start:end]
    print(match_id, match_string, start , end , span.text)

# NOTHING

# REGEX SPACY VERSION 2
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("fr_core_news_sm")

accesd_flag = lambda text : bool(re.compile(r"^acc?é?e?è?s?s?'?D" , re.I).match(text))
IS_ACCESD = nlp.vocab.add_flag(accesd_flag)
pattern=  [{IS_ACCESD : True}]

matcher = Matcher(nlp.vocab)
matcher.add('AccèsD' , None , pattern)

doc = nlp(text)

matches = matcher(doc)
for match_id, start , end in matches:
    match_string = nlp.vocab.strings[match_id]
    span = doc[start:end]
    print(match_id, match_string, start , end , span.text)


# NOTHING

Why don't you make an on_match function instead of None then you can use it to help you debug — nmc, Commented Apr 6, 2019 at 21:52

nmc · Accepted Answer · 2019-04-07 05:47:36Z

1

Regex support for spacy was introduced in version 2.1.0

From the website

Versions before v2.1.0 don’t yet support the REGEX operator.

It might be related to this, since REGEX for the matcher will not be used otherwise

Otherwise, I believe the answer is related to not matching all contextual tokens. You should then just slightly modify the regex and use the LOWER matching attribute to capture the two token context.

text = u"accès'd est ferme aujpourd'hui"
pattern = [{"LOWER" : { "REGEX": "^acc?é?e?è?s?s?" }, {"LOWER": "d"}]
matcher = Matcher(nlp.vocab)
matcher.add("accesd", None, pattern) 
doc = nlp(text) 
matches = matcher(doc)

This is because you were using the re.I flag so the matcher will work in a case insensitive way by only reviewing the LOWER attribute

edited Apr 7, 2019 at 5:47

answered Apr 6, 2019 at 6:18

nmc

3,1541 gold badge26 silver badges49 bronze badges

Hi Nathan, thanks for your reply. I’ve checked spaCy version and it’s the last one
– dito
Commented Apr 6, 2019 at 19:15
Oh too bad, thought it could be that.
– nmc
Commented Apr 6, 2019 at 21:36
1

I think I found the source of the problem. IT's the spacy tokenizer who splits the text into ['accès', 'd', 'est', 'ferme', "aujpourd'", 'hui'] and this is why the regex does not give a match
– dito
Commented Apr 6, 2019 at 23:26
Oh I see, that's right. I'll add an answer I believe shall work
– nmc
Commented Apr 7, 2019 at 5:31

Add a comment |

Collectives™ on Stack Overflow

Token-base matching with spaCy REGEX

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
python
regex
nlp
spacy
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged pythonregexnlpspacy or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
python
regex
nlp
spacy
or ask your own question.