In Spacy, how to match a specific entity type, just after a specific word, whatever there is between?

Question

I would like to match text in Spacy with the following pattern:

If there is the word "dénomination" or "denomination", I want to match the next 'MISC' entity (entity name from Spacy), whatever there is between the two.

for example in:

text=" Some texte about a company, company number: 254455, Dénomination\n (entire name): NAME_OF_THE_COMPANY , \n, some other informations of the... "

I'd like to extract "NAME_OF_COMPANY" which is recognize by Spacy as the entity MISC

To get the entities with Spacy I do:

for txt in text_file:
    doc = nlp(txt)
    for token_french in doc_french:
    for ent in doc.ents:
         print(ent.label_, ent.text)

But then I tried many pattern as the one below but without any success:

    matcher=Matcher(nlp.vocab)
    pattern = [{"REGEX" : "[D|d][é|e]nomination\s{0,}"},{"REGEX" : "[A-Za-z\n\r\s:)]{1,}"},{"ENT_TYPE" : "MISC"}]
    matcher.add('company_name', None, pattern)
    matches = matcher(doc)

Are you sure the [A-Za-z\n\r\s:)]{1,} will do what you want? It will be applied to a single token, there is no point to use \n\r\s here. Besides, you can't use REGEX outside of a top level token, like LOWER or TEXT. — Wiktor Stribiżew, Commented Sep 13, 2019 at 7:35
Could you provide some three "real" samples of data for testing and also which model youre using? (default french model!?) — Tiago Duque, Commented Sep 13, 2019 at 11:01
About [A-Za-z\n\r\s:)]{1,} I just wanted to take any character so "*" could have been ok I guess. Here is a real sample: """intleur *1930**** Bt oe 29-12-2018 N° d'entreprise: 0716963***0 Objet de I'acte : Constitution Dénomination : (en entier): EDOR (en abrégé): ED Forme juridique: Association sans but lucratif Siége: 125 Chemin d'Odrimont 1380 Lasne (Ohain) Belgique""" I want the name of the company which is in this case "EDOR". There is sometime more texte between "Dénomination" and the name of the company. Moreover the name is recognize has a MISC entity by Spacy. — Pier Smn, Commented Sep 17, 2019 at 7:22

aab · Accepted Answer · 2019-09-14 08:26:13Z

1

A few things to keep in mind:

Each dict in the pattern corresponds to one token without surrounding whitespace.
You can match any number of intervening tokens with {"OP": "*"}.
It's useful to use validate=True with Matcher() to get more feedback when you're working on new patterns.

I think your pattern might look more like:

pattern = [{"LOWER": {"REGEX" : "d[é|e]nomination"}}, {"OP": "*"}, {"ENT_TYPE": "MISC"}]

The Matcher looks at the whole document, so if you have a long document this will provide not only the next MISC but a match with "denomination" followed by every following MISC. You'd have to select the shortest match from the results separately.

answered Sep 14, 2019 at 8:26

aab

11.4k24 silver badges40 bronze badges

Thank you for your tips. And this is my problem, it will look after every MISC after "denomination". Indeed, I just want the first occurence after "denomination". Any ideas?
– Pier Smn
Commented Sep 17, 2019 at 7:11
In that case, do not allow MISC entities for {"OP": "*"} by replacing it with {"OP": "*", "ENT_TYPE": {"NOT_IN": ["MISC"]}}. So final pattern would be pattern = [{"LOWER": {"REGEX" : "d[é|e]nomination"}}, {"OP": "*", "ENT_TYPE": {"NOT_IN": ["MISC"]}}, {"ENT_TYPE": "MISC"}].
– Girishkumar
Commented Oct 14, 2022 at 15:34

Add a comment |

Collectives™ on Stack Overflow

In Spacy, how to match a specific entity type, just after a specific word, whatever there is between?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
python
regex
spacy
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged pythonregexspacy or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
python
regex
spacy
or ask your own question.