I would like to match text in Spacy with the following pattern:
- If there is the word "dénomination" or "denomination", I want to match the next 'MISC' entity (entity name from Spacy), whatever there is between the two.
for example in:
text=" Some texte about a company, company number: 254455, Dénomination\n (entire name): NAME_OF_THE_COMPANY , \n, some other informations of the... "
I'd like to extract "NAME_OF_COMPANY" which is recognize by Spacy as the entity MISC
To get the entities with Spacy I do:
for txt in text_file:
doc = nlp(txt)
for token_french in doc_french:
for ent in doc.ents:
print(ent.label_, ent.text)
But then I tried many pattern as the one below but without any success:
matcher=Matcher(nlp.vocab)
pattern = [{"REGEX" : "[D|d][é|e]nomination\s{0,}"},{"REGEX" : "[A-Za-z\n\r\s:)]{1,}"},{"ENT_TYPE" : "MISC"}]
matcher.add('company_name', None, pattern)
matches = matcher(doc)
[A-Za-z\n\r\s:)]{1,}
will do what you want? It will be applied to a single token, there is no point to use\n\r\s
here. Besides, you can't useREGEX
outside of a top level token, likeLOWER
orTEXT
.MISC
type?