I was able to get Tesseract to run via a Python script on my Windows machine to turn non-searchable PDFs into searchable ones. When downloading Tesseract onto windows, it asked me which languages I wanted and I selected them, this was when I learned about the math module to begin with. I am not sure how effective the math module was, but I could see that it was downloaded when I checked the languages.
Now I am trying to install Tesseract on Debian.
To install Tesseract I used the command:
sudo apt install -y tesseract-ocr
Then, to ensure I had the math module, I would always follow that up with:
sudo apt install tesseract-ocr-equ
And, I am pretty sure that would install the math module. I remember using that command successfully several times, including earlier this morning. However, now, when I use that code, I get the following messages:
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
E: Unable to locate package tesseract-ocr-equ
Just to make sure I wasn't crazy, I looked up the language codes used by Tesseract, according to Debian.org, and they say that "equ" belongs to the "Math / equation detection module", admittedly that is an earlier version. So, I tried the following code:
sudo apt-get install -y tesseract-ocr-equ
Among the several lines of code that I got in response were the following:
Note, selecting 'tesseract-ocr-uzb-cyrl' for regex 'tesseract-ocr-[equ]'
Note, selecting 'tesseract-ocr-ell' for regex 'tesseract-ocr-[equ]'
Note, selecting 'tesseract-ocr-eng' for regex 'tesseract-ocr-[equ]'
Note, selecting 'tesseract-ocr-enm' for regex 'tesseract-ocr-[equ]'
Note, selecting 'tesseract-ocr-epo' for regex 'tesseract-ocr-[equ]'
Note, selecting 'tesseract-ocr-est' for regex 'tesseract-ocr-[equ]'
Note, selecting 'tesseract-ocr-eus' for regex 'tesseract-ocr-[equ]'
Note, selecting 'tesseract-ocr-que' for regex 'tesseract-ocr-[equ]'
Note, selecting 'tesseract-ocr-uig' for regex 'tesseract-ocr-[equ]'
Note, selecting 'tesseract-ocr-ukr' for regex 'tesseract-ocr-[equ]'
Note, selecting 'tesseract-ocr-urd' for regex 'tesseract-ocr-[equ]'
Note, selecting 'tesseract-ocr-uzb' for regex 'tesseract-ocr-[equ]'
tesseract-ocr-eng is already the newest version (1:4.1.0-2).
tesseract-ocr-eng set to manually installed.
So, this made me wonder if there was a different math module for different languages, and the math module is automatically downloaded with the language you download. I just really remember using the command initially without any problem. That being said, I have had several head injuries, so my memory is not entirely reliable. It's just that if I turn out to have been mistaken here and I have not been using that code as I remember, this will be one of those deeply troubling times due to how vividly I remember this working.
So, the primary question is how do I download the "Math / equation detection module" for Tesseract onto my Linux Beta on my Chromebook. Secondarily, could someone tell me if the functionality of the "sudo apt install tesseract-ocr-equ" command changed recently. This is frustrating me quite a bit. I am hoping that someone just changed the functionality this morning and math modules are now built into the languages.
equ.traineddata
file which I believe means you don't need to add it in versions >3.02 (greater than 3.02)ecu
as a language code, and not a module. Since there is no language code namedecu
, the search/install properly fails. You need to "train" the ocr using the documentation at the link i found, by passing the trainer program the.traindata
files