How to get CJK Unicode characters from a PDF that uses supplementary private use characters?

Question

I have several PDF documents (such as this one) that appear to be written using standard Chinese ideograms, but when I extract the text, it turns out that it's encoded using characters from the Unicode supplemental private use areas.

Is there any reliable way to map from the private use characters back to the appropriate CJK characters?

user930067 · Accepted Answer · 2017-11-14 01:26:04Z

0

The general flow is probably

Extract font from PDF
Try to compare the font against different known encoding and see if it is any of those
Or alternatively it could be something that are actually privately used
Work out a reverse relationship by checking the conversion table if it's known what encoding it is, otherwise work from the extracted font from pdf

answered Nov 14, 2017 at 1:26

user930067

1691 silver badge10 bronze badges

Add a comment |

Stack Exchange Network

How to get CJK Unicode characters from a PDF that uses supplementary private use characters?

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
pdf
unicode
chinese
.

Hot Network Questions

How to get CJK Unicode characters from a PDF that uses supplementary private use characters?

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged pdfunicodechinese.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
pdf
unicode
chinese
.