0

I would like to match text in Spacy with the following pattern:

  • If there is the word "dénomination" or "denomination", I want to match the next 'MISC' entity (entity name from Spacy), whatever there is between the two.

for example in:

text=" Some texte about a company, company number: 254455, Dénomination\n (entire name): NAME_OF_THE_COMPANY , \n, some other informations of the... "

I'd like to extract "NAME_OF_COMPANY" which is recognize by Spacy as the entity MISC

To get the entities with Spacy I do:

for txt in text_file:
    doc = nlp(txt)
    for token_french in doc_french:
    for ent in doc.ents:
         print(ent.label_, ent.text)

But then I tried many pattern as the one below but without any success:

    matcher=Matcher(nlp.vocab)
    pattern = [{"REGEX" : "[D|d][é|e]nomination\s{0,}"},{"REGEX" : "[A-Za-z\n\r\s:)]{1,}"},{"ENT_TYPE" : "MISC"}]
    matcher.add('company_name', None, pattern)
    matches = matcher(doc)
4
  • Are you sure the [A-Za-z\n\r\s:)]{1,} will do what you want? It will be applied to a single token, there is no point to use \n\r\s here. Besides, you can't use REGEX outside of a top level token, like LOWER or TEXT. Commented Sep 13, 2019 at 7:35
  • BTW, which word here do you consider of MISC type? Commented Sep 13, 2019 at 9:37
  • Could you provide some three "real" samples of data for testing and also which model youre using? (default french model!?) Commented Sep 13, 2019 at 11:01
  • About [A-Za-z\n\r\s:)]{1,} I just wanted to take any character so "*" could have been ok I guess. Here is a real sample: """intleur *1930**** Bt oe 29-12-2018 N° d'entreprise: 0716963***0 Objet de I'acte : Constitution Dénomination : (en entier): EDOR (en abrégé): ED Forme juridique: Association sans but lucratif Siége: 125 Chemin d'Odrimont 1380 Lasne (Ohain) Belgique""" I want the name of the company which is in this case "EDOR". There is sometime more texte between "Dénomination" and the name of the company. Moreover the name is recognize has a MISC entity by Spacy.
    – Pier Smn
    Commented Sep 17, 2019 at 7:22

1 Answer 1

1

A few things to keep in mind:

  • Each dict in the pattern corresponds to one token without surrounding whitespace.

  • You can match any number of intervening tokens with {"OP": "*"}.

  • It's useful to use validate=True with Matcher() to get more feedback when you're working on new patterns.

I think your pattern might look more like:

pattern = [{"LOWER": {"REGEX" : "d[é|e]nomination"}}, {"OP": "*"}, {"ENT_TYPE": "MISC"}]

The Matcher looks at the whole document, so if you have a long document this will provide not only the next MISC but a match with "denomination" followed by every following MISC. You'd have to select the shortest match from the results separately.

2
  • Thank you for your tips. And this is my problem, it will look after every MISC after "denomination". Indeed, I just want the first occurence after "denomination". Any ideas?
    – Pier Smn
    Commented Sep 17, 2019 at 7:11
  • In that case, do not allow MISC entities for {"OP": "*"} by replacing it with {"OP": "*", "ENT_TYPE": {"NOT_IN": ["MISC"]}}. So final pattern would be pattern = [{"LOWER": {"REGEX" : "d[é|e]nomination"}}, {"OP": "*", "ENT_TYPE": {"NOT_IN": ["MISC"]}}, {"ENT_TYPE": "MISC"}]. Commented Oct 14, 2022 at 15:34

Not the answer you're looking for? Browse other questions tagged or ask your own question.