4

Tesseract 3.03 have been released recently and I have just installed it. Nevertheless, English language data is not provided with the download (from https://launchpad.net/ubuntu/+source/tesseract/3.03.03-1). On the Tesseract website, there is a "Download" link but you can only find "English language data for Tesseract 3.02". Where could I find those for 3.03?

2 Answers 2

4

As mentioned by others you can use 3.02's english language packs for 3.03. Below are the instructions:

  1. Download and unzip from here : 1

  2. Install pre-requisites and unzip

    `sudo apt-fast install -y libicu-dev libpango1.0-dev libcairo2-dev`
    `tar xfv tesseract-ocr-3.02.eng.tar.gz`
    
  3. Extract Tesseract's English data pack to tessdata directory inside tesseract-3.03 directory. Assuming both(English language data and tesseract source .tar.gz files) are in the same folder

    tar zxvf tesseract-ocr-3.02.eng.tar.gz

mv tesseract-ocr/tessdata/. tesseract-3.03/tessdata/

4.Go back to tesseract's directory and finish the installation

cd tesseract-3.03

./autogen.sh

./configure

make -j

sudo make install LANGS="eng"

sudo ldconfig

Now test your installation with the test image in the directory

tesseract phototest.tif  ans -l eng
cat ans.txt

Output:

This is a lot of 12 point text to test the ocr code and see if it works on all types of file format.

The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox.

NOTE: some lines have wrong formatting...any advice to correct those would be great

1

You can use the language data from 3.02 on 3.03 RC.

Also please note that 3.03 has not yet been released officially. That is an RC build.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .