Here are links to some journal articles:
- https://doi.org/10.1149/1.2183927
- https://doi.org/10.1149/1.2988135
- https://doi.org/10.1149/1.3021012
- https://doi.org/10.1149/1.2159298
They all encode parentheses (and other characters such as brackets) incorrectly. However, this is only apparent when trying to convert them to text or copy and paste. For example, the first line of the body of the first article should read:
Proton exchange membrane fuel cells (PEMFCs) have received
Instead, when copying and pasting from Acrobat Reader, it gives
Proton exchange membrane fuel cells PEMFCs have received
And when using "Save as text" it gives
Proton exchange membrane fuel cells ^CPEMFCs�
have received
Where the open parenthesis is ^C
, the 03 ASCII control sequence, and the closing parenthesis is Unicode 65533, the replacement character, followed by a newline.
Similarly, pdf2txt
encodes it as
Proton exchange membrane fuel cells 共PEMFCs兲 have received
(Unicode 20849 and 20850)
and pdftotext
encodes it as
Proton exchange membrane fuel cells ͑PEMFCs͒ have received
(Unicode 849 and 850).
There's also Unicode 851 ( ͓), 852 ( ͔), 1003 (ϫ), 1011 (ϳ), 1015 (Ϸ), 8217 (’), 8211(–), 8722(−), 64257 (fi), 64258 (fl), and the control character Ctrl-L (ASCII 12) in the pdftotext
output. Some of them could be normalized to ASCII pretty easily, but some of them will require manual mapping, I think.
My questions are:
What's the best way to fix this? I've seen some similar questions, including that uses a script to replace the mishandled characters, but setting up the mappings is non-trivial and it doesn't fix the PDF.
Why do different PDF readers and PDF to text utilities give such different results?
Here's the outputs of pdfinfo
and pdffonts
:
Title:
Subject:
Keywords:
Author:
Creator: XPP
Producer: Acrobat Distiller 6.0.1 (Windows)
CreationDate: Thu Mar 23 12:07:23 2006
ModDate: Sun Nov 4 12:48:02 2012
Tagged: no
Pages: 6
Encrypted: no
Page size: 657 x 855 pts
File size: 266467 bytes
Optimized: no
PDF version: 1.4
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
Helvetica Type 1 no no no 89 0
Helvetica-Oblique Type 1 no no no 109 0
Helvetica-Bold Type 1 no no no 88 0
LFNLKJ+Times-Bold Type 1C yes yes no 63 0
LFNLLK+Times-Italic Type 1C yes yes no 64 0
LFNLMK+Times-Roman Type 1C yes yes no 65 0
LFNLML+MathematicalPi-Three Type 1C yes yes no 66 0
LFNLMM+MathematicalPi-One Type 1C yes yes no 67 0
LFNLMN+Universal-GreekwithMathPi Type 1C yes yes no 72 0