0

I have a few machines that run tesseract-ocr 4.0 for different applications. The machines have similar configuration(4 cores, 16 GB memory), and all of them run Ubuntu 16.04.5 LTS.

However, in the course of work, at least one of these applications has diverged and is running something which is causing a significant performance improvement in tesseract. So much so, that for a particular page, while the other instances' tesseracts take 7-7.5s, this particular instance's tesseract takes just 3.5-4 s.

Naturally, I want to isolate the reason for this and try and apply them to all the other instances.

Here is all I've found till now. 1. The storage is same for all of them, so no SSD/Magnetic HDD performance differences 2. The CPU cores are the same, i5-7400, 3 GHz 2. The OS version(16.04.5) and kernel versions(Linux 4.15.0-47-generic) are the same. 3. These are the tesseract-ocr version and dependent library details

tesseract 4.1.0-rc1-255-g332a1
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0

Short of comparing every package ever installed on every one of these systems, how do I find what is causing the improvement?

2
  • Any chance there's a difference in the other apps running in the background?
    – fixer1234
    Commented Apr 16, 2019 at 7:34
  • adding to what fixer1234 said, could you run top/htop on each machine and see if there is something extra running that is causing the performance hit.
    – Randomhero
    Commented Apr 16, 2019 at 8:54

1 Answer 1

0

tesseract's performance is most affected by the font it is dealing with, the size of the text in the image, and the type of image (tiff produces most accurate results, jpg fastest processing) and image quality.

To counteract the competition by other software running on the system, use 'nice' with tesseract

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .