Why does this PDF appear to encode parentheses correctly but doesn't when using pdftotext or copying and pasting?

Question

Here are links to some journal articles:

They all encode parentheses (and other characters such as brackets) incorrectly. However, this is only apparent when trying to convert them to text or copy and paste. For example, the first line of the body of the first article should read:

Proton exchange membrane fuel cells (PEMFCs) have received

Instead, when copying and pasting from Acrobat Reader, it gives

Proton exchange membrane fuel cells PEMFCs have received

And when using "Save as text" it gives

Proton exchange membrane fuel cells ^CPEMFCs�
have received

Where the open parenthesis is ^C, the 03 ASCII control sequence, and the closing parenthesis is Unicode 65533, the replacement character, followed by a newline. Similarly, pdf2txt encodes it as

Proton exchange membrane fuel cells 共PEMFCs兲 have received

(Unicode 20849 and 20850) and pdftotext encodes it as

Proton exchange membrane fuel cells ͑PEMFCs͒ have received

(Unicode 849 and 850).

There's also Unicode 851 ( ͓), 852 ( ͔), 1003 (ϫ), 1011 (ϳ), 1015 (Ϸ), 8217 (’), 8211(–), 8722(−), 64257 (ﬁ), 64258 (ﬂ), and the control character Ctrl-L (ASCII 12) in the pdftotext output. Some of them could be normalized to ASCII pretty easily, but some of them will require manual mapping, I think.

My questions are:

What's the best way to fix this? I've seen some similar questions, including that uses a script to replace the mishandled characters, but setting up the mappings is non-trivial and it doesn't fix the PDF.
Why do different PDF readers and PDF to text utilities give such different results?

Here's the outputs of pdfinfo and pdffonts:

Title:          
Subject:        
Keywords:       
Author:         
Creator:        XPP
Producer:       Acrobat Distiller 6.0.1 (Windows)
CreationDate:   Thu Mar 23 12:07:23 2006
ModDate:        Sun Nov  4 12:48:02 2012
Tagged:         no
Pages:          6
Encrypted:      no
Page size:      657 x 855 pts
File size:      266467 bytes
Optimized:      no
PDF version:    1.4

name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
Helvetica                            Type 1            no  no  no      89  0
Helvetica-Oblique                    Type 1            no  no  no     109  0
Helvetica-Bold                       Type 1            no  no  no      88  0
LFNLKJ+Times-Bold                    Type 1C           yes yes no      63  0
LFNLLK+Times-Italic                  Type 1C           yes yes no      64  0
LFNLMK+Times-Roman                   Type 1C           yes yes no      65  0
LFNLML+MathematicalPi-Three          Type 1C           yes yes no      66  0
LFNLMM+MathematicalPi-One            Type 1C           yes yes no      67  0
LFNLMN+Universal-GreekwithMathPi     Type 1C           yes yes no      72  0

Aaron Brick · Accepted Answer · 2017-12-08 06:35:35Z

2

The answer is in the "uni" column: those fonts, specifically the one that was used for the parentheses, lack an explicit mapping to Unicode. It's a hard problem to identify the most correct codepoint for some arbitrary symbol.

answered Dec 8, 2017 at 6:35

Aaron Brick

1928 bronze badges

Add a comment |

Stack Exchange Network

Why does this PDF appear to encode parentheses correctly but doesn't when using pdftotext or copying and pasting?

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
pdf
encoding
unicode
character-encoding
.

Hot Network Questions

Why does this PDF appear to encode parentheses correctly but doesn't when using pdftotext or copying and pasting?

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged pdfencodingunicodecharacter-encoding.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
pdf
encoding
unicode
character-encoding
.