Batch OCR for many PDF files (not already OCRed)?

Is there any way to batch OCR PDFs that haven't been already OCRed? This is, I think, the current state of things dealing with two issues:

Batch OCR PDFs


  • Acrobat – This is the most straightfoward ocr engine that will batch OCR. The only problem seems to be 1) it wont skip files that have already been OCRed 2) try throwing a bunch of PDFs at it (some old) and watch it crash. It is a little buggy. It will warn you at each error it runs into (though you can tell the software to not notify. But again, it dies horribly on certain types of PDFs so your mileage may vary.

  • ABBYY FineReader (Batch/Scansnap), Omnipage – These have got to be some of the worst programmed pieces of software known to man. If you can find out how to fully automate (no prompting) batch OCR of PDFs saving with the same name then please post here. It seems the only solutions I could find failed somewhere--renaming, not fully automated, etc. etc. At best, there is a way to do it, but the documentation and programming is so horrible that you'll never find out.

  • ABBYY FineReader Engine, ABBYY Recognition Server - These really are more enterprise solutions, you probably would be better off just getting acrobat to run over a folder and try and weed out pdfs that give you errors/crash the program than going through the hassle of trying to install evaluation software (assuming you are a simple end-user). Doesn't seem cost competitive for the small user.

  • ** Autobahn DX workstation ** the cost of this product is so prohibitive, you probably could buy 6 copies of acrobat. Not really an end-user solution. If you're an enterprise setup, this may be worth it for you.


  • WatchOCR – no longer developed, and basically impossible to run on modern Ubuntu distros
  • pdfsandwich – no longer developed, basically impossible to run on modern Ubuntu distros
  • ** ABBY LINUX OCR ** - this should be scriptable, and seems to have some good results:


However, like a lot of these other ABBYY products they charge by the page, again, you might be better off trying to get Acrobat Batch OCR to work.

  • **Ocrad, GOCR, OCRopus, tesseract, ** – these may work but there are a few problems:
  1. OCR results are not as great as, say, acrobat for some of these (see above link).
  2. None of the programs take in a PDF file and output a PDF file. You have to create a script and break apart the PDF first and run the programs over each and then reassemble the file as a pdf
  3. Once you do, you may find, like I did, that (tesseract) creates an OCR layer that is shifted over. So if you search for the word 'the', you'll get a highlight of the part of the word next to it.
  • Batch DjVu → Convert to PDF – haven't looked into it, but seems like a horrible round-a-bout solution.


  • PDFcubed.com – come on, not really a batch solution.
  • ABBYY Cloud OCR - not sure if this is really a batch solution, either way, you have to pay by the page and this could get quite pricey.

Cloud (update 2023--can't believe people are still looking at this)

Nowadays look into AWS, Azure, GCP to scale up your OCR, but more than likely there will be a managed solution available to you.

Identifying non-OCRed PDFs

This is a slightly easier problem, that can be solved easily in Linux and much less so in Windows. I was able to code a perl script using pdffont to identify whether fonts are embedded to determine which files are not-OCRed.

Current "solutions"

  1. Use a script to identify non-OCRed pdfs (so you don't rerun over thousands of OCRed PDFs) and copy these to a temporary directory (retaining the correct directory tree) and then use Acrobat on Windows to run over these hoping that the smaller batches won't crash.

  2. use the same script but get one of the linux ocr tools to properly work, risking ocr quality.

I think I'm going to try #1, I'm just worried too much about the results of the Linux OCR tools (I don't suppose anyone has done a comparison) and breaking the files apart and stitching them together again seems to be unnecessary coding if Adobe can actually batch OCR a directory without choking.

If you want a completely free solution, you'll have to use a script to identify the non-OCRed pdfs (or just rerun over OCRed ones), and then use one of the linux tools to try and OCR them. Teseract seems to have the best results, but again, some of these tools are not supported well in modern versions of Ubuntu, though if you can set it up and fix the problem I had where the image layer not matching the text-matching layer (with tesseract) then you would have a pretty workable solution and once again Linux > Windows.

Do you have a working solution to fully automate, batch OCR PDFs, skipping already OCRed files keeping the same name, with high quality? If so, I would really appreciate the input.

Perl script to move non-OCRed files to a temp directory. Can't guarantee this works and probably need to be rewritten, but if someone makes it work (assuming it doesn't work) or work better, let me know and I'll post a better version here.


# move non-ocred files to a directory
# change variables below, you need a base dir (like /home/joe/), and a sourcedirectory and output
# direcotry (e.g books and tempdir)
# move all your pdfs to the sourcedirectory

use warnings;
use strict;

# need to install these modules with CPAN or your distros installer (e.g. apt-get)
use CAM::PDF;
use File::Find;
use File::Basename;
use File::Copy;

#use PDF::OCR2;
#$PDF::OCR2::CHECK_PDF   = 1;

my $basedir = '/your/base/directory';
my $sourcedirectory  = $basedir.'/books/';
my @exts       = qw(.pdf);
my $count      = 0;
my $outputroot = $basedir.'/tempdir/';
open( WRITE, >>$basedir.'/errors.txt' );

#check file
#my $pdf = PDF::OCR2->new($basedir.'/tempfile.pdf');
#print $pdf->page(10)->text;

        wanted => \&process_file,

        #       no_chdir => 1

sub process_file {
    #must be a file
    if ( -f $_ ) {
        my $file = $_;
        #must be a pdf
        my ( $dir, $name, $ext ) = fileparse( $_, @exts );
        if ( $ext eq '.pdf' ) {
            #check if pdf is ocred
            my $command = "pdffonts \'$file\'";
            my $output  = `$command`;
            if ( !( $output =~ /yes/ || $output =~ /no/ ) ) {
                #print "$file - Not OCRed\n";
                my $currentdir = $File::Find::dir;
                if ( $currentdir =~ /$sourcedirectory(.+)/ ) {
                    #if directory doesn't exist, create
                    unless(-d $outputroot.$1){
                    system("mkdir -p $outputroot$1");
                    #copy over file
                    my $fromfile = "$currentdir/$file";
                    my $tofile = "$outputroot$1/$file";
                    print "copy from: $fromfile\n";
                    print "copy to: $tofile\n";
                    copy($fromfile, $tofile) or die "Copy failed: $!";
#                       `touch $outputroot$1/\'$file\'`;


  Could you please share your Windows "script to identify non-OCRed pdfs (...) and copy these to a temporary directory (retaining the correct directory tree)?
    – Erb
    Commented Dec 12, 2012 at 9:41
  ok it's up. I warn you that it might not run correctly the first time. This wont damage your pdfs at all (it just copies, it doesn't touch the originals) but what I mean is that you might have to modify the script. If you know perl it would be a breeze, if not let me know, or you might be able to debug it yourself and make the minor edits necessary.
    – Joe
    Commented Dec 12, 2012 at 18:17
  Many thanks. I will try to make it work (even if I am new with perl). Thanks.
    – Erb
    Commented Dec 13, 2012 at 9:50
  Maybe another idea in Windows (worked in XP)? I have used this in the past in order to "remove from a folder (with subfolders) all pdf files which have no passwords". The idea was to keep all pdf files that are password protected. Copy with Syncback freeware all pdf (with related subfolders) in a new folder (C:\5\"). Add pdftotext.exe and this text file renamed in del_pdf_no_password.bat . Its content : "FOR /R C:\5\ %%x IN (*.PDF) DO (pdftotext %%x NUL && DEL %%x)" where "C:\5\" is the folder to change. Then start pdftotext.exe and only then the .bat file.
    – Erb
    Commented Dec 13, 2012 at 10:09
  More details : you 'll need to remove empty spaces (+ special caracters like ","...) inside any folders names with a freeware renamer (like for examples : alternativeto.net/software/renamer). Otherwise it won't work for all subfolders! Ps: I did not wrote this script (I was helped by someone in ...2004!)
    – Erb
    Commented Dec 13, 2012 at 10:19

I too have looked for a way to batch-OCR many PDFs in an automated manner, without much luck. In the end I have come up with a workable solution similar to yours, using Acrobat with a script as follows:

  1. Copy all relevant PDFs to a specific directory.

  2. Remove PDFs already containing text (assuming they are already OCRd or already text - not ideal I know, but good enough for now).

  3. Use AutoHotKey to automatically run Acrobat, select the specific directory, and OCR all documents, appending "-ocr" to their filename.

  4. Move the OCRd PDFs back to their original location, using the presence of a "-ocr.pdf" file to determine whether it was successful.

It is a bit Heath Robinson, but actually works pretty well.

  Why do you need to use AutoHotKey if Acrobat will already batch ocr a directory? If you're worried about repeating the process if acrobat crashes the file modified timestamp will tell you where you left off. If you want to keep the originals you can just copy the directory. If you just want the -ocr at the end you can just do a batch name change after you're done.
    – Joe
    Commented Oct 15, 2012 at 15:05
  • 1
    By luck could you share how you do point 2. and 3. in Windows? Thanks in advance ;)
    – Erb
    Commented Dec 12, 2012 at 9:34

I beleive you need to realize that ABBYY FineReader is an end-user solution designed to provide fast&accurate out-of-the-box OCR.

Based on my experience, OCR projects have significatly different details each time and there's no way create an out of the box soulition for each unique case.But i can suggest you more professional tools that can do the job for you:

I was a part of the front-end development team for the cloud service specified above and can provide more info on it if necessary.

Considering the lookup of a text layer in PDF, i can't give any advice on that, because this task is a bit aside of OCR which is my specialty, so i find your approach of using external script very reasonable. Maybe you'll find this discussion helpful: http://forum.ocrsdk.com/questions/108/check-if-pdf-is-scanned-image-or-contains-text

  • 1
    Well at least we know that ABBYY lacks the documentation or functionality (that is found in Acrobat) to easily batch OCR a folder of pdf. Simple batch OCR of a folder of non-OCRed docs is an extremely desired feature (much more than some of ABBYY's other features). Just google to find out how overwhelmingly common this desire is, if not, I can provide cites. Thanks for the other options, I will look into them, but for now let anyone who comes here in search of how to complete this VERY common task (cites available) know that we heard it from the horse's mouth that ABBYY cannot do this.
    – Joe
    Commented May 20, 2012 at 19:57
  Batch OCR is available in ABBYY FineReader Professional. In your question you state a need to fully automate OCR. Now you need just a batch processing. Please be clear in what exactly you need.
    – Nikolay
    Commented May 21, 2012 at 7:47
  Read above. I said 'EASILY batch OCR', 'SIMPLE batch ocr of a folder'. Further up: " If you can find out how to fully automate (no prompting) batch OCR..". It's pretty obvious what i want. So let's be clear to anyone who visits this page: * If you want to 'batch process' a folder of pdfs using a horrible, complicated interface with horrible save options in a heavy user-intensive process ABBYY may work for you * If you want to 'EASILY batch OCR', 'simple batch ocr' with little user interaction like thousands of others, like Acrobat already does, ABBYY Finereader is not for you.
    – Joe
    Commented May 31, 2012 at 19:11

I had some success in early 2015 doing fully hands-off batch OCR using Nuance OmniPage Ultimate on windows. Not free, list price $500. Use the batch program "DocuDirect" that is included. It has an option "Run job without any prompts" which seems the direct answer to your original question.

I used DocuDirect to output one searchable PDF file for each input image (i.e., non-searchable) PDF file; it can be told to replicate the input directory tree in the output folder as well as the original input file names (almost - see below). Uses multiple cores too. The accuracy was the best of the packages I evaluated. Password-protected documents are skipped (without stopping the job, without showing a dialog).

Caveat 1: Almost the original file names - suffix ".PDF" becomes ".pdf" (i.e., from upper to lower case) because hey, it's all the same on windows. (Ugh.)

Caveat 2: No log file so diagnosing which files fail during recognition -- which they definitely do -- is back on you. DocuDirect will happily produce garbled outputs like entire pages simply missing. I wrote a python script using the PyPDF2 module to implement a crude validation: testing that the output page count matched input page count. See below.

Caveat 3: A fuzzy, indistinct input image file will cause OmniPage to hang forever, not using any CPU; it just never recovers. This really derails batch processing and I did not find any workarounds. I also reported this to Nuance, but got nowhere.

@Joe is right about the software being poorly programmed and documented. I note the core of OmniPage has amazing character-recognition magic technology, but the outer shell (GUI & batch processing) is enough to make you pull your hairs out.

I endorse @Joe's and @Kiwi's suggestion to screen out files using scripts, so as to present the OCR package only with unprotected image documents.

My only affiliation with Nuance is as a not-exactly-satisfied customer - I have a batch of unresolved support tickets to prove it :)

@Joe: Late answer, but maybe still relevant. @SuperUser community: I hope you feel this is on topic.

** Update ** successor package is Nuance PowerPDF Advanced, list price just $150. I had even better success with this, it is just as accurate but far more stable.

Pre/post-OCR tree validation python script follows.

Script to validate OCR outputs against inputs.
Both input and output are PDF documents in a directory tree.
For each input document, checks for the corresponding output
document and its page count.

Requires PyPDF2 from https://pypi.python.org/pypi/PyPDF2

from __future__ import print_function
from PyPDF2 import PdfFileReader
import getopt
import os
import stat
import sys

def get_pdf_page_count(filename):
    Gets number of pages in the named PDF file.
    Fails on an encrypted or invalid file, returns None.
    with open(filename, "rb") as pdf_file:
        page_count = None
        err = None
            # slurp the file
            pdf_obj = PdfFileReader(pdf_file)
            # extract properties
            page_count = pdf_obj.getNumPages()
            err = ""
        except Exception:
            # Invalid PDF.
            # Limit exception so we don't catch KeyboardInterrupt etc.
            err = str(sys.exc_info())
            # This should be rare
            print("Warning: failed on file %s: %s" % (filename, err), file=sys.stderr)
            return None

    return page_count

def validate_pdf_pair(verbose, img_file, txt_file):
    Checks for existence and size of target PDF file;
    number of pages should match source PDF file.
    Returns True on match, else False.
    #if verbose: 
    #    print("Image PDF is %s" % img_file)
    #    print("Text PDF is %s" % txt_file)

    # Get source and target page counts
    img_pages = get_pdf_page_count(img_file)
    txt_pages = get_pdf_page_count(txt_file)
    if img_pages is None:
        # Bogus PDF, skip.
        print("Warning: failed to get page count for %s" % img_file, file=sys.stderr)
        return None
    if txt_pages is None:
        # Bogus PDF, skip.
        print("Warning: failed to get page count for %s" % txt_file, file=sys.stderr)
        return None

    retval = True
    if img_pages != txt_pages:
        retval = False
        print("Mismatch page count: %d in source %s, %d in target %s" % (img_pages, img_file, txt_pages, txt_file), file=sys.stderr)

    return retval

def validate_ocr_output(verbose, process_count, total_count, img_dir, txt_dir):
    Walks a tree of files to compare against output tree, calling self recursively.
    Returns a tuple with PDF file counts (matched, non-matched).
    # Iterate over the this directory
    match = 0
    nonmatch = 0
    for dirent in os.listdir(img_dir):
        src_path = os.path.join(img_dir, dirent)
        tgt_path = os.path.join(txt_dir, dirent)
        if os.path.isdir(src_path):
            if verbose: print("Found source dir %s" % src_path)
            # check target
            if os.path.isdir(tgt_path):
                # Ok to process
                (sub_match, sub_nonmatch) = validate_ocr_output(verbose, process_count + match + nonmatch, total_count, 
                                         src_path, tgt_path)
                match += sub_match
                nonmatch += sub_nonmatch
                # Target is missing!?
                print("Fatal: target dir not found: %s" % tgt_path, file=sys.stderr)

        elif os.path.isfile(src_path):
            # it's a plain file
            if src_path.lower().endswith(".pdf"):
                # check target
                # HACK: OmniPage changes upper-case PDF suffix to pdf;
                # of course not visible in Windohs with the case-insensitive 
                # file system, but it's a problem on linux.
                if not os.path.isfile(tgt_path):
                    # Flip lower to upper and VV
                    if tgt_path.endswith(".PDF"):
                        # use a slice
                        tgt_path = tgt_path[:-4] + ".pdf"
                    elif tgt_path.endswith(".pdf"):
                        tgt_path = tgt_path[:-4] + ".PDF"
                # hopefully it will be found now!
                if os.path.isfile(tgt_path):
                    # Ok to process
                    sub_match = validate_pdf_pair(verbose, src_path, tgt_path)
                    if sub_match:
                        match += 1
                        nonmatch += 1
                    if verbose: print("File %d vs %d matches: %s" % (process_count + match + nonmatch, total_count, sub_match))

                    # Target is missing!?
                    print("Fatal: target file not found: %s" % tgt_path, file=sys.stderr)
                    nonmatch += 1

            # This should never happen
            print("Warning: not a directory nor file: %s" % src_path, file=sys.stderr)
    return (match, nonmatch)

def count_pdfs_listdir(verbose, src_dir):
    Counts PDF files in a tree using os.listdir, os.stat and recursion.
    Not nearly as elegant as os.walk, but hopefully very fast on
    large trees; I don't need the whole list in memory.
    count = 0
    for dirent in os.listdir(src_dir):
        src_path = os.path.join(src_dir, dirent)
        # stat the entry just once
        mode = os.stat(src_path)[stat.ST_MODE]
        if stat.S_ISDIR(mode):
            # It's a directory, recurse into it
            count += count_pdfs_listdir(verbose, src_path)
        elif stat.S_ISREG(mode):
            # It's a file, count it
            if src_path.lower().endswith('.pdf'):
                count += 1
            # Unknown entry, print an error
            print("Warning: not a directory nor file: %s" % src_path, file=sys.stderr)
    return count

def main(args):
    Parses command-line arguments and processes the named dirs.
        opts, args = getopt.getopt(args, "vi:o:")
    except getopt.GetoptError:
    # default values
    verbose = False
    in_dir = None
    out_dir = None
    for opt, optarg in opts:
        if opt in ("-i"):
            in_dir = optarg
        elif opt in ("-o"):
            out_dir = optarg
        elif opt in ("-v"):
            verbose = True
    # validate args
    if in_dir is None or out_dir is None: usage()
    if not os.path.isdir(in_dir):
        print("Not found or not a directory: %s" % input, file=sys.stderr)
    if not os.path.isdir(out_dir):
        print("Not found or not a directory: %s" % out_dir, file=sys.stderr)
    if verbose: 
        print("Validating input %s -> output %s" % (in_dir, out_dir))
    # get to work
    print("Counting files in %s" % in_dir)
    count = count_pdfs_listdir(verbose, in_dir)
    print("PDF input file count is %d" % count)
    (match,nomatch) = validate_ocr_output(verbose=verbose, process_count=0, total_count=count, img_dir=in_dir, txt_dir=out_dir) 
    print("Results are: %d matches, %d mismatches" % (match, nomatch))

def usage():
    print('Usage: validate_ocr_output.py [options] -i input-dir -o output-dir')
    print('    Compares pre-OCR and post-OCR directory trees')
    print('    Options: -v = be verbose')

# Pass all params after program name to our main
if __name__ == "__main__":
  I have just seen your update. I will try it. I hope it does the OCR silently and without crashing! (Wow! 1GB download file!)
    – Erb
    Commented Jan 7, 2017 at 10:11

On linux

Best and easyest way out there is to use pypdfocr it doesn't change the pdf

pypdfocr your_document.pdf

At the end you will have another your_document_ocr.pdf the way you want it with searchable text. The app doesn't change the quality of the image. Increases the size of the file a bit by adding the overlay text.

To batch the pdfs

ls ./p*.pdf | xargs -L1 -I {}  pypdfocr {}

If the PDFs are in sub-folders:

tree -fai . | grep -P ".pdf$" | xargs -L1 -I {}  pypdfocr {}

Update 3rd november 2018:

pypdfocr is no longer supported since 2016 and I noticed some problems due to not being mentained. ocrmypdf(module) does a symiliar job and can be used like this:

ocrmypdf in.pdf out.pdf

To install:

pip install ocrmypdf


apt install ocrmypdf

so the command would become

tree -fai . | grep -P ".pdf$" | xargs -L1 -I {}  ocrmypdf {} {}_ocr.pdf 

On Mac or Linux:

parallel --tag -j 2 ocrmypdf '{}' 'output/{}' ::: *.pdf

From here.


You could consider Aquaforest's Autobahn DX : http://www.aquaforest.com/en/autobahn.asp

It is designed to process batches of PDFs and has a variety of options (eg Skip or pass-through OCRed files) as well as options for smart treatment of PDFs which may offer a better result (eg if a PDF has some image pages and some text pages, it can just OCR the image pages)

  If you're affiliated with that product, please explicitly say so by editing your question.
    – slhck
    Commented May 17, 2012 at 15:14

