13

I have a collection of ebooks in djvu, pdf, chm format and I am looking for a way to search the keyword in the content. I have been researching around and find couple suggestion to parse pdf content but there seems to be no way to convert the content in djvu into text. By any chance, does anyone know a way to decode djvu content into text so that I can search it easily?

Thanks

3 Answers 3

9

Assuming the djvu files contain OCR-ed text, a fast way on Linux to get that out is to use Popen to run djvutxt and grab the output.

The text in a .djvu file is compressed with a djvu specific compression algorithm, bzz, for which no simple C interface exists which you could load as an shared object in Python. It is a C++ implementation based on some framework.

Shameless self promotion: I contributed to Calibre the conversion from OCR-ed .djvu, which uses djvutxt in this way. However it falls back to my pure python decoder implementation (sloooow) if djvutxt is not available. So you could use that code if you cannot use djvutxt.

I have not yet put out the Python source seperately from Calibre. But after downloading and extracting Calibre's source:

curl -L http://status.calibre-ebook.com/dist/src | tar xvJ
find . | fgrep djvu

The relevant files are djvu_input.py, djvu.py and djvubzzdec.py

4

python-djvulibre is a set of Python bindings to the djvulibre open source implementation of djvu -- I haven't tried it, but it looks like it should meet your needs.

1

Certainly the DjVuLibre SDK will allow access to the text layer -- if it exists (not all DjVu files have a text layer; many are purely raster images).

An alternative solution might be to base your index on IIS technology. CamiNova has a free IFilter that you can use for this.

[http://dev.caminova.jp/beta/djvu-wic/][1]

Not the answer you're looking for? Browse other questions tagged or ask your own question.