Bash, get links from pdf

Question

Os ubuntu.

Need to get links or more data (for example binding layer from QuarkXPress application) from pdf to text, in terminal.

Tried pdftotext, but seems links are not exported, pdfgrep is the same.

Is there any solution?

Thanks.

so, uh ... pdftohtml? it just works for me, maybe you could share a sample PDF file... — frostschutz, Commented Jul 24, 2019 at 13:47
Have a look at superuser.com/questions/698811/… where comments suggest pdfannotextractor — Jaleks, Commented Sep 25, 2022 at 14:42

Stéphane Chazelas · Accepted Answer · 2019-07-24 14:35:57Z

5

You could try and extract the /URI(...) PDF directives by hand, maybe after removing compression if any using pdftk:

pdftk file.pdf output - uncompress | grep -aPo '/URI *\(\K[^)]*'

answered Jul 24, 2019 at 14:35

Stéphane Chazelas

554k92 gold badges1.1k silver badges1.6k bronze badges

Thank you, but native pdftohtml is enough.
– Stanislav Hosek
Commented Jul 24, 2019 at 15:15
@StanislavHosek. OK. I think I misinterpreted your question and that you were wanting to retrieve the list of linked URLs from a PDF file.
– Stéphane Chazelas
Commented Jul 24, 2019 at 15:49

Add a comment |

Freddy · Accepted Answer · 2019-07-24 15:21:05Z

2

Using pdfx and filtering all lines starting with - http:

pdfx -v file.pdf | sed -n 's/^- \(http\)/\1/p'

answered Jul 24, 2019 at 15:21

Freddy

25.7k1 gold badge23 silver badges62 bronze badges

Add a comment |

bardom · Accepted Answer · 2022-11-21 12:38:21Z

1

First, you need to check if your pdf is compressed or not, see:

How to know if a PDF file is compressed or not and to (un)compress it

If it's compressed, you need to uncompress it.

Then, you can extract links using grep and sed:

strings uncompressed.pdf | grep -Eo '/URI \(.*\)' | sed 's/^\/URI (//g; s/)$//g'

answered Nov 21, 2022 at 12:38

bardom

112 bronze badges

I love this solution because it does not require to install any additional special tools, which didn't work out for me as easily as I expected. Thanks! 👏
– mreichelt
Commented Jun 25 at 21:14

Add a comment |

Regis Barbosa · Accepted Answer · 2019-07-24 20:11:53Z

0

Test this:

pdftotext -raw "filename.pdf" && file=`ls -tr | tail -1`; grep -E "https?://.*" "${file}" && rm "${file}"

edited Jul 24, 2019 at 20:11

answered Jul 24, 2019 at 14:10

Regis Barbosa

111 silver badge3 bronze badges

1

You should not parse ls ..., but you can just send output to stdout instead of using a temporary file: pdftotext -raw "filename.pdf" - | grep ... (From man pdftotext: If text-file is ´-', the text is sent to stdout.)
– pLumo
Commented Jul 24, 2019 at 14:29
Nope, it is not working, but thanks.
– Stanislav Hosek
Commented Jul 24, 2019 at 15:15
pdftotext version 0.71.0 / GNU bash, versão 5.0.3(1)-release (x86_64-pc-linux-gnu) / grep (GNU grep) 3.3. @pLumo it works either way for me... :)
– Regis Barbosa
Commented Jul 24, 2019 at 20:07

Add a comment |

Roh · Accepted Answer · 2021-01-12 10:32:24Z

0

You can use pdftohtml and then use lynx to pull the links from the html.

answered Jan 12, 2021 at 10:32

Roh

1

Add a comment |

Stack Exchange Network

Bash, get links from pdf

5 Answers 5

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
command-line
pdf
text
.

Linked

Hot Network Questions

Bash, get links from pdf

5 Answers 5

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged command-linepdftext.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
command-line
pdf
text
.