Assuming the djvu files contain OCR-ed text, a fast way on Linux to get that out is to use Popen to run djvutxt
and grab the output.
The text in a .djvu
file is compressed with a djvu
specific compression algorithm, bzz
, for which no simple C interface exists which you could load as an shared object in Python. It is a C++ implementation based on some framework.
Shameless self promotion: I contributed to Calibre the conversion from OCR-ed .djvu
, which uses djvutxt
in this way. However it falls back to my pure python decoder implementation (sloooow) if djvutxt
is not available. So you could use that code if you cannot use djvutxt
.
I have not yet put out the Python source seperately from Calibre. But after downloading and extracting Calibre's source:
curl -L http://status.calibre-ebook.com/dist/src | tar xvJ
find . | fgrep djvu
The relevant files are djvu_input.py
, djvu.py
and djvubzzdec.py