1

In Adobe Acrobat Pro XI hitting Ctrl + C when the following text is selected

enter image description here

will copy the following to the clipboard:

Training
1. Collect
a
set
of
representa8ve
training
documents

In Google Chrome hitting Ctrl + C when the following text is selected

enter image description here

will copy the following to the clipboard:

Training+
1. Collect+a+set+of+representa8ve+training+documents

I use Windows 7 SP1 x64 Ultimate. The PDF file can be accessed here (the screenshots above show page 16).

Why do Google Chrome and Adobe Acrobat Pro copy different text to the clipboard when I select the same text in the PDF?

0

2 Answers 2

3

The issue is already in the original document, in the way it has been created.

It looks as if the original presentation has been created with PowerPoint (what else…) on Mac (well the presentation may have been created on Windows, and then brought to Mac to create the PDF). No OCR involved.

The PDF creation occurred using the Apple tools, and it seems that these tools have problems with ligatures. Instead of using the Ligature character from the "main" font file, it creates another subset containing the ligature character, but does not properly encode the Unicode code, and the result is that transposing the encoding to the "main" font encoding leads to the character 8.

As we all know, in PDF, text is a set of "words" placed on a canvas, where the "words" are separated by whitespace. The connection between the "words" to form a sentence does not exist in basic PDF. For copying, either the PDF viewer does some heuristics to determine whether those "words" belong together or not, and/or it uses the structure information (if present). Chrome's logic is different from Acrobat's logic, and that's how the discrepancies appear.

Actually Acrobat XI has an option in the Context Menu of the selection "Copy with Formatting", and that lead (after pasting into BBEdit) to:

"Training"
"1.    Collect a set of representa8ve training documents"

This option apparently uses more logic to create sentences. But the ligature is wrong, because it can not be properly recreated.

Verdict, badly created PDF leads to discrepancies when attempting to repurpose contents with different PDF viewers…

3
  • Text consists of characters that are readable (letters, numbers, symbols, punctuation), and characters that are positional (space, tab, carriage return, line feed). Are you saying that the PDF format ignores the positional characters that are part of the text and does its own thing?
    – fixer1234
    Commented Nov 30, 2014 at 16:09
  • Thanks for the clear explanation and for pointing to the "Copy with Formatting" action! Commented Nov 30, 2014 at 17:42
  • @fixer1234: In PDF, each "word" (which is a sequence of readable characters) is placed individually; in fact, depending on the editing tool, a text string containing (manual) kerning may be broken up. You find the full details in ISO 32000 (or the Portable Document Format Reference, which is part of the documentation of the Acrobat SDK, downloadable from the Adobe website).
    – Max Wyss
    Commented Nov 30, 2014 at 21:03
2

You can get to pdf from a number of types of source documents. If you start with something saved directly from a word processor, the pdf will contain nice, editable text. If you start with an image of a page, the pdf contains a picture, which is not editable without OCR. In between are typeset documents. They contain text, but everything is hard-formatted to control precise appearance on the page. Trying to edit those, or even clean them up for editing, can be a nightmare.

In this document, the spacing between words is controlled with tabs (or special characters interpreted as tabs), rather than spaces. The strange "8" in representative is probably due to the use of a ligature (special coding or kerning pair to tighten the spacing between the "t" and "i"). It would not be surprising if different viewers handle the typesetting control codes differently.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .