0

I was able to get Tesseract to run via a Python script on my Windows machine to turn non-searchable PDFs into searchable ones. When downloading Tesseract onto windows, it asked me which languages I wanted and I selected them, this was when I learned about the math module to begin with. I am not sure how effective the math module was, but I could see that it was downloaded when I checked the languages.

Now I am trying to install Tesseract on Debian.

To install Tesseract I used the command:

sudo apt install -y tesseract-ocr

Then, to ensure I had the math module, I would always follow that up with:

sudo apt install tesseract-ocr-equ

And, I am pretty sure that would install the math module. I remember using that command successfully several times, including earlier this morning. However, now, when I use that code, I get the following messages:

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
E: Unable to locate package tesseract-ocr-equ

Just to make sure I wasn't crazy, I looked up the language codes used by Tesseract, according to Debian.org, and they say that "equ" belongs to the "Math / equation detection module", admittedly that is an earlier version. So, I tried the following code:

sudo apt-get install -y tesseract-ocr-equ

Among the several lines of code that I got in response were the following:

Note, selecting 'tesseract-ocr-uzb-cyrl' for regex 'tesseract-ocr-[equ]'
Note, selecting 'tesseract-ocr-ell' for regex 'tesseract-ocr-[equ]'
Note, selecting 'tesseract-ocr-eng' for regex 'tesseract-ocr-[equ]'
Note, selecting 'tesseract-ocr-enm' for regex 'tesseract-ocr-[equ]'
Note, selecting 'tesseract-ocr-epo' for regex 'tesseract-ocr-[equ]'
Note, selecting 'tesseract-ocr-est' for regex 'tesseract-ocr-[equ]'
Note, selecting 'tesseract-ocr-eus' for regex 'tesseract-ocr-[equ]'
Note, selecting 'tesseract-ocr-que' for regex 'tesseract-ocr-[equ]'
Note, selecting 'tesseract-ocr-uig' for regex 'tesseract-ocr-[equ]'
Note, selecting 'tesseract-ocr-ukr' for regex 'tesseract-ocr-[equ]'
Note, selecting 'tesseract-ocr-urd' for regex 'tesseract-ocr-[equ]'
Note, selecting 'tesseract-ocr-uzb' for regex 'tesseract-ocr-[equ]'
tesseract-ocr-eng is already the newest version (1:4.1.0-2).
tesseract-ocr-eng set to manually installed.

So, this made me wonder if there was a different math module for different languages, and the math module is automatically downloaded with the language you download. I just really remember using the command initially without any problem. That being said, I have had several head injuries, so my memory is not entirely reliable. It's just that if I turn out to have been mistaken here and I have not been using that code as I remember, this will be one of those deeply troubling times due to how vividly I remember this working.

So, the primary question is how do I download the "Math / equation detection module" for Tesseract onto my Linux Beta on my Chromebook. Secondarily, could someone tell me if the functionality of the "sudo apt install tesseract-ocr-equ" command changed recently. This is frustrating me quite a bit. I am hoping that someone just changed the functionality this morning and math modules are now built into the languages.

3
  • It took me a long time to find this but if you look, there is an equ.traineddata file which I believe means you don't need to add it in versions >3.02 (greater than 3.02)
    – eyoung100
    Commented May 16 at 17:13
  • @eyoung100 Thank you, that definitely helps. So, does this mean I have NOT been successfully loading it into my script? That's a bummer. Thanks for your help. Commented May 17 at 7:01
  • That is correct: The regex you posted regarding "several lines" is treating ecu as a language code, and not a module. Since there is no language code named ecu, the search/install properly fails. You need to "train" the ocr using the documentation at the link i found, by passing the trainer program the .traindata files
    – eyoung100
    Commented May 17 at 15:39

0

You must log in to answer this question.

Browse other questions tagged .