How to import PDF content with sub/superscripts?

Ask Question

Asked 4 months ago

Modified 4 months ago

Viewed 131 times

Is there any way to import text containing superscripts in PDFs correctly to plaintext? For example,

Import["~/Downloads/example.pdf", "Plaintext"]

The imported text should be "\$35" not "\$351", it is confused by the superscript:

Here's the example pdf file with superscripts to try:

CloudImport @ "https://www.wolframcloud.com/obj/17686b9d-e609-462d-8184-03201c2d551a"

edited Mar 7 at 22:39

asked Mar 7 at 20:24

user5601

3,6952 gold badges24 silver badges58 bronze badges

4

$\begingroup$ This superscript is just text smaller text that has been raised. It is not using a special character like: compart.com/en/unicode/U+00B9 So you would need to import it somehow including the font-size to detect whether or not it is a superscript. $\endgroup$
– SHuisman
Commented Mar 8 at 10:23
2

$\begingroup$ How to filter based on glyph size first, seems like it should be a built-in sort of thing no? $\endgroup$
– M.R.
Commented Mar 8 at 18:54
1

$\begingroup$ PDF is a really bad format for exchanging computer-readable information. I appreciate you may not have an alternative, but this is likely going to be very hard. $\endgroup$
– rhermans
Commented Mar 26 at 20:28

Add a comment |

0