The first question is whose law you are concerned with, since in principle you might have violated copyright law in any country, and might be sued under the laws of multiple countries. The US has a concept of "fair use" which is notoriously difficult to apply. When you are sued in the US, you can defend against the allegation by arguing certain things: telegraphically, this includes purpose and character of use, nature of the work, substantiality in relation to the whole, and effect on market. Plus there is a 5th factor to be considered, transformativeness. The court then weighs these factors to decide if the use is "fair". By reading existing case law on the topic (conveniently available from the US Copyright office) you might develop a fact-based opinion of the risk: you would be vastly better off hiring an attorney who specializes in US copyright law to do an analysis for you. Do not hire a programmer to give you legal advice (do not hire an attorney to debug code).
You would "fail" on the test of substantiality in that you are copying a highly substantial portion of the original work(s). You would "win" on nature of use (research especially non-profit and commentary are the underlying purposes that drive fair use law). It's not clear how you would fare w.r.t. nature of the work, which is intended to distinguish the extremes "news report" and "literature and artistic work" where copying news is at the fair use end of the spectrum. It is not clear how you would fare on "effect on market", but probably not so badly: are you avoiding some licensing fee? Coupled with the tranformativeness consideration, you are most likely having no effect on the market, since the product that you will distribute is not the original work, but a scientific conclusion about the work.
Germany has different laws, and this article would be relevant if you cared about Germany. There was a change in the law that expanded the analog of fair use pertaining to research use. That law allows 15 percent of a work to be reproduced, distributed and made available to the public for the purpose of non-commercial scientific research. That, b.t.w., does not refer to what you are planning to do (unless you also publish quotes); for personal scientific research you may reproduce up to 75 percent. Since this is a new law only a year old, you could become part of the cutting edge in testing the limits of the law. So the standard disclaimer applies: ask your attorney. But note section 60d of the law which legalized data mining, and is squarely on point:
(1) In order to enable the automatic analysis of large numbers of
works (source material) for scientific research, it shall be
permissible
to reproduce the source material, including automatically and systematically, in order to create, particularly by means of
normalisation, structuring and categorisation, a corpus which can be
analysed and
to make the corpus available to the public for a specifically limited circle of persons for their joint scientific research, as well
as to individual third persons for the purpose of monitoring the
quality of scientific research.
In such cases, the user may only pursue non-commercial purposes.
(2) If database works are used pursuant to subsection (1), this shall
constitute customary use in accordance with section 55a, first
sentence. If insubstantial parts of databases are used pursuant to
subsection (1), this shall be deemed consistent with the normal
utilisation of the database and with the legitimate interests of the
producer of the database within the meaning of section 87b (1), second
sentence, and section 87e.
(3) Once the research work has been completed, the corpus and the
reproductions of the source material shall be deleted; they may no
longer be made available to the public. It shall, however, be
permissible to transmit the corpus and the reproductions of the source
material to the institutions referred to in sections 60e and 60f for
the purpose of long-term storage.