Sometimes when I do pdftotext it results in perfect text. I assume this is because the actual unicode text data is embedded directly in the PDF itself, and simply read out.

But other times (around half or more of documents that aren't just straight up scanned images) it results in ~strange glyphs~ in place of things like diacritics and accent marks, or sometimes even what seem to be blurry letters.

For example, this Yoruba dictionary PDF has these problems. If you run this:

pdftotext yoruba.pdf yoruba.txt

You end up with these words scattered about:

expected     actual
--------     ------
lairotẹle    lairot4ille
ikọsilẹ      ikljlsil4il
logó         logb

Notice the accented ó became the letter b. But it's not as if every ó becomes a b in the doc. Many do, but not all. Same with the being a 4il. Many become like this, probably all of them. Most of the time (my sense is saying) the more obscure accent marks / diacritics like get converted into stranger characters or character sequences.

Why is this? Is it an OCR thing? Or does the PDF actually have the plain text embedded in it (i.e. it's not a scanned document to an image)? And yet, it's somehow not being properly decoded. I would like to know the answer to this, so at least I know it's either an OCR problem or an encoding/decoding problem.

If it's an encoding problem, that would be interesting. Then my question is, can I tell pdftotext to use some obscure decoding technique? Or what.

I bring this up partially because I've discovered some webpages recently that are encoded in either ucs2 or latin1, some even in some strange windows2255 or some encoding. So I've had to tinker with the encoding/decoding to properly extract the text in HTML documents. I'm wondering if the same thing applies to PDFs in this case.

Another document that suffers this problem is the Navajo dictionary. I don't know if it's an OCR thing or an encoding thing. Another document that is strange is "Zulu-English Dictionary by Forgotten Books" (which I would link to but straight downloads instead of being rendered in the browser). If you copy/paste the text, each letter is spaced 1 or 2 spaces from each other in seemingly random fashion. I have no idea why, would like to have a better sense.


You must log in to answer this question.

Browse other questions tagged .