5
$\begingroup$

Is there any way to import text containing superscripts in PDFs correctly to plaintext? For example,

Import["~/Downloads/example.pdf", "Plaintext"]

enter image description here

The imported text should be "\$35" not "\$351", it is confused by the superscript:

enter image description here

Here's the example pdf file with superscripts to try:

CloudImport @ "https://www.wolframcloud.com/obj/17686b9d-e609-462d-8184-03201c2d551a"
$\endgroup$
3
  • 4
    $\begingroup$ This superscript is just text smaller text that has been raised. It is not using a special character like: compart.com/en/unicode/U+00B9 So you would need to import it somehow including the font-size to detect whether or not it is a superscript. $\endgroup$
    – SHuisman
    Commented Mar 8 at 10:23
  • 2
    $\begingroup$ How to filter based on glyph size first, seems like it should be a built-in sort of thing no? $\endgroup$
    – M.R.
    Commented Mar 8 at 18:54
  • 1
    $\begingroup$ PDF is a really bad format for exchanging computer-readable information. I appreciate you may not have an alternative, but this is likely going to be very hard. $\endgroup$
    – rhermans
    Commented Mar 26 at 20:28

0

Browse other questions tagged or ask your own question.