10

The following question is about the Spacy NLP library for Python, but I would be surprised if the answer for other libraries differed substantially.

What is the maximum document size that Spacy can handle under reasonable memory conditions (e.g. a 4 GB VM in my case)? I had hoped to use Spacy to search for matches in book-size documents (100K+ tokens), but I'm repeatedly getting crashes that point to memory exhaustion as the cause.

I'm an NLP noob - I know the concepts academically, but I don't really know what to expect out of the state of the art libraries in practice. So I don't know if what I'm asking the library to do is ridiculously hard, or so easy that must be something I've screwed up in my environment.

As far as why I'm using an NLP library instead of something specifically oriented toward document search (e.g. solr), I'm using it because I would like to do lemma-based matching, rather than string-based.

2
  • 1
    Can you provide a little more detail about your situation? Where in your code is it failing, when you parse the document or when you try and lemmatize? If you're using spaCy, you'll have to do a little more additional work to do a full text search. If a full text search on lemmas is your main use case, you might want to look into some extensions to Solr, like this: github.com/nicholasding/solr-lemmatizer Commented Jan 11, 2018 at 21:02
  • 1
    It happens when I parse the documentation. Yes, my main search is search on lemmas, but I want to support a few other cases. My main question was how big a document, generally, I can get away with spacy, given a 1-2GB working memory. This is plain English text.
    – J B NY
    Commented Jan 13, 2018 at 0:45

1 Answer 1

13

Spacy has a max_length limit of 1,000,000 characters. I was able to parse a document with 450,000 words just fine. The limit can be raised. I would split the text into n chunks depending upon total size.

The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the nlp.max_length limit. The limit is in number of characters, so you can check whether your inputs are too long by checking len(text).

https://github.com/explosion/spaCy/blob/master/spacy/errors.py

Not the answer you're looking for? Browse other questions tagged or ask your own question.