The following question is about the Spacy NLP library for Python, but I would be surprised if the answer for other libraries differed substantially.
What is the maximum document size that Spacy can handle under reasonable memory conditions (e.g. a 4 GB VM in my case)? I had hoped to use Spacy to search for matches in book-size documents (100K+ tokens), but I'm repeatedly getting crashes that point to memory exhaustion as the cause.
I'm an NLP noob - I know the concepts academically, but I don't really know what to expect out of the state of the art libraries in practice. So I don't know if what I'm asking the library to do is ridiculously hard, or so easy that must be something I've screwed up in my environment.
As far as why I'm using an NLP library instead of something specifically oriented toward document search (e.g. solr), I'm using it because I would like to do lemma-based matching, rather than string-based.