2

I have several PDF documents (such as this one) that appear to be written using standard Chinese ideograms, but when I extract the text, it turns out that it's encoded using characters from the Unicode supplemental private use areas.

Is there any reliable way to map from the private use characters back to the appropriate CJK characters?

1 Answer 1

0

The general flow is probably

  • Extract font from PDF
  • Try to compare the font against different known encoding and see if it is any of those
  • Or alternatively it could be something that are actually privately used
  • Work out a reverse relationship by checking the conversion table if it's known what encoding it is, otherwise work from the extracted font from pdf

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .