Tesseract 3.03 have been released recently and I have just installed it. Nevertheless, English language data is not provided with the download (from https://launchpad.net/ubuntu/+source/tesseract/3.03.03-1). On the Tesseract website, there is a "Download" link but you can only find "English language data for Tesseract 3.02". Where could I find those for 3.03?
2 Answers
As mentioned by others you can use 3.02's english language packs for 3.03. Below are the instructions:
Download and unzip from here : 1
Install pre-requisites and unzip
`sudo apt-fast install -y libicu-dev libpango1.0-dev libcairo2-dev` `tar xfv tesseract-ocr-3.02.eng.tar.gz`
Extract Tesseract's English data pack to tessdata directory inside tesseract-3.03 directory. Assuming both(English language data and tesseract source .tar.gz files) are in the same folder
tar zxvf tesseract-ocr-3.02.eng.tar.gz
mv tesseract-ocr/tessdata/. tesseract-3.03/tessdata/
4.Go back to tesseract's directory and finish the installation
cd tesseract-3.03
./autogen.sh
./configure
make -j
sudo make install LANGS="eng"
sudo ldconfig
Now test your installation with the test image in the directory
tesseract phototest.tif ans -l eng
cat ans.txt
Output:
This is a lot of 12 point text to test the ocr code and see if it works on all types of file format.
The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox.
NOTE: some lines have wrong formatting...any advice to correct those would be great
You can use the language data from 3.02 on 3.03 RC.
Also please note that 3.03 has not yet been released officially. That is an RC build.