2

I am using spacy's phrasematching for a large number of rules, some of these are company stock symbols which appear in the form of lets say:

AAPL.NY BB.TX

Or these could appear as AAPL or BB.

When phrasematching I have been using two patterns to get these matches:

{"label": "TICKER", "pattern": [{"ORTH": {"REGEX": "AAPL\\.[A-Z]{2,3}"}}]}
{"label": "TICKER", "pattern": [{"ORTH": "AAPL"}]}

Is ORTH the right pattern to match for the REGEX? It gives some interesting results sometimes where it will capture something like AAPL.HSHSHSJSKKSKKS even though that is beyond the {2,3}.

Could anyone help me with a) Whether using ORTH makes sense here b) How would one limit the use of REGEX to only have a max of 2 or 3 characters after the period ?

1
  • Does my answer help or do you need more assistance? Commented Jul 1, 2021 at 9:50

1 Answer 1

2

ORTH (meaning orthography) was used before TEXT was introduced in Spacy 2.1. Now, when doing regex matching, you'd better apply that to TEXT.

As for the regex itself, mind that it is applied to the whole token text, and in order to match the entire token text, you need to use anchors, ^ and $ (or \A and \z).

So, you can use

{"TEXT": {"REGEX": r"^AAPL\.[A-Z]{2,3}$"}}

Also, note the use of a raw string literal so as to avoid double escaping backslashes.

Not the answer you're looking for? Browse other questions tagged or ask your own question.