4

There is a PDF file, which contains mainly text. However some parts of these texts are impossible to select and copy:

enter image description here

And when I select this:

enter image description here

So:

ult. Optimist

becomes

ult. Op mist

What is causing this, and can I somehow overcome on this limitation?

1
  • 1
    I'd suggest that this behaviour is likely to depend on the software you're running: i.e. the reader plus any underlying support libraries as determined by the operating system. Commented Jun 18, 2023 at 7:02

1 Answer 1

4

It may be because the program that wrote the PDF has "flattened" the "ti" ligature into a graphics line-draw/fill object.

It might be that the "ti" ligature encoding is not recognised by the target into which you pasted the text.

It looks like the first is more likely the case if your image represents the text selection in a PDF viewer itself.


There are many other difficult problems in extracting plain text from all possible PDFs.

So there might not be any easy solution for this.

1
  • 1
    This. Funny enough, the built-in OCR of macOS handles this image fine... but OCR usually is not what you want. Encoding of ligatures and of eastern European characters often is a weak point. For professional purposes, there are some libraries (like PDFlib TET) which usually handle this a bit better.
    – jvb
    Commented Jun 17, 2023 at 20:22

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .