9

Is there a way to write a rule based system to catch things like start/end dates from a contract text. Here are a few real examples. I am bolding the date entities which I want spacy to automatically detect. If you have other ideas different than spacy that is also OK!

  1. The initial term of this Lease shall be for a period of Five (5) years commencing on February 1, 2012, (the “Lease Commencement Date”) and expiring on January 31, 2017 (the “Initial Lease Term”).

  2. Term: One (1) year commencing January 1, 2007 ("Commencement Date") and ending December 31, 2007 ("Expiration Date").

  3. This Lease Agreement is entered into for term of 15 years, beginning January 1, 2014 and ending on December 31, 2028.

6
  • Dates can be super complicated. Can you be certain that you will only be looking for dates in the format MonthName dayNum, 4DigitYear? Commented Dec 15, 2019 at 13:31
  • No guarantee what format it will be in. Could be MONTH, DAY, YEAR, or MM/DD/YYYY for example. Commented Dec 15, 2019 at 14:11
  • That makes it more difficult. Could it also be DD/MM/YYYY or DD/MM/YY, or YYYY/MM/DD, or YY/MM/DD? This is why dates are complicated in programing. Commented Dec 15, 2019 at 14:14
  • 1
    But you still need to recognize it as a date. You can't do that without knowing all the formats that a date could be in. Commented Dec 15, 2019 at 14:38
  • 1
    i know spacy can recognize it as a date. i just want to subselect for those dates which are start/end dates. – Commented Dec 16, 2019 at 10:58

2 Answers 2

6

I think you have to make a clear distinction between two types of methods:

1) Statistical models / Machine Learning, a.k.a. NER models. These will take the context of the sentence into account when trying to figure out whether a specific token, or multiple consecutive tokens, are a date. spaCy has pre-built NER models you can download to try out on your specific data. You'll want to look for those entities (in doc.ents) that have ent.label_ == DATE. Once you have those entities, you can run them through a date parser to understand what the actual date is. See also here for more information.

2) Rule-based entity recognition. Here, you have to define the rules yourself by specifying how you expect your date will look like, e.g. XX/XX/XXXX with X being a digit. As user1558604 pointed out though, you'll have to write multiple different rules if you want to recognize different representations of dates. You can find an overview of spaCy's rule-based matching methods here.

5
  • 1
    Thanks! Right now, we have a set of rules that select the start and end dates from all the spacy recognized dates. We want to make a more sophisticated rule-based approach before going to machine learning though. A few reasons for this: 1) we will establish a baseline accuracy/recall threshold to which we can compare future statistical models; 2) we will discover more about the problem and better understand its subtleties; 3) we can use the rule based approach to help efficiently label data for future training. maybe we should use the parsing tool?love to hear your thoughts. thanks! Commented Dec 16, 2019 at 10:45
  • Ok so if I understand you correctly, you are already using the NER models in spaCy, and you want the rules to look at the surrounding sentence and extract begin/end clues ?
    – Sofie VL
    Commented Dec 17, 2019 at 7:51
  • I think using the parser could definitely help you. Actually, spaCy has a currently experimental and undocumented DependencyMatcher that could be useful to you. See also stackoverflow.com/questions/57664264/… and github.com/explosion/spaCy/issues/4433
    – Sofie VL
    Commented Dec 17, 2019 at 7:56
  • Spacy works... I tried it myself. Will update this thread once my POC project is ready for future reference Commented Feb 19, 2023 at 14:30
  • github.com/anshumankmr/rasa-chat-bot Commented Feb 19, 2023 at 18:08
-1

You can use SUTime from CoreNLP to do it easily: https://github.com/FraBle/python-sutime

2
  • because i do not know how to use that software. is it even in python, or just java? Commented Dec 15, 2019 at 14:39
  • This library is a python wrapper on top of orginal java implementation. You can use it via python. If you go through the link in my answer, you will get the installation instruction and sample code for it.
    – anas17in
    Commented Dec 16, 2019 at 5:10

Not the answer you're looking for? Browse other questions tagged or ask your own question.