how to extract text from djvu and other ebooks format (possibly in Python) [closed]

Question

Closed. This question is seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. It does not meet Stack Overflow guidelines. It is not currently accepting answers.

We don’t allow questions seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. You can edit the question so it can be answered with facts and citations.

Closed 9 years ago.

Improve this question

I have a collection of ebooks in djvu, pdf, chm format and I am looking for a way to search the keyword in the content. I have been researching around and find couple suggestion to parse pdf content but there seems to be no way to convert the content in djvu into text. By any chance, does anyone know a way to decode djvu content into text so that I can search it easily?

Thanks

Anthon · Accepted Answer · 2013-03-12 18:28:45Z

Assuming the djvu files contain OCR-ed text, a fast way on Linux to get that out is to use Popen to run djvutxt and grab the output.

The text in a .djvu file is compressed with a djvu specific compression algorithm, bzz, for which no simple C interface exists which you could load as an shared object in Python. It is a C++ implementation based on some framework.

Shameless self promotion: I contributed to Calibre the conversion from OCR-ed .djvu, which uses djvutxt in this way. However it falls back to my pure python decoder implementation (sloooow) if djvutxt is not available. So you could use that code if you cannot use djvutxt.

I have not yet put out the Python source seperately from Calibre. But after downloading and extracting Calibre's source:

curl -L http://status.calibre-ebook.com/dist/src | tar xvJ
find . | fgrep djvu

The relevant files are djvu_input.py, djvu.py and djvubzzdec.py

Alex Martelli · Accepted Answer · 2009-10-08 15:39:16Z

4

python-djvulibre is a set of Python bindings to the djvulibre open source implementation of djvu -- I haven't tried it, but it looks like it should meet your needs.

answered Oct 8, 2009 at 15:39

Alex Martelli

873k174 gold badges1.2k silver badges1.4k bronze badges

Add a comment |

msr · Accepted Answer · 2009-12-11 04:29:44Z

1

Certainly the DjVuLibre SDK will allow access to the text layer -- if it exists (not all DjVu files have a text layer; many are purely raster images).

An alternative solution might be to base your index on IIS technology. CamiNova has a free IFilter that you can use for this.

[http://dev.caminova.jp/beta/djvu-wic/][1]

answered Dec 11, 2009 at 4:29

msr

3363 silver badges7 bronze badges

Add a comment |

Collectives™ on Stack Overflow

how to extract text from djvu and other ebooks format (possibly in Python) [closed]

3 Answers 3

Not the answer you're looking for? Browse other questions tagged
python
pdf
full-text-search
djvu
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Not the answer you're looking for? Browse other questions tagged pythonpdffull-text-searchdjvu or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
python
pdf
full-text-search
djvu
or ask your own question.