3

Os ubuntu.

Need to get links or more data (for example binding layer from QuarkXPress application) from pdf to text, in terminal.

Tried pdftotext, but seems links are not exported, pdfgrep is the same.

Is there any solution?

Thanks.

3
  • so, uh ... pdftohtml? it just works for me, maybe you could share a sample PDF file... Commented Jul 24, 2019 at 13:47
  • @Jeff Schaller Yep, I forgot about this. :) Working great. Commented Jul 24, 2019 at 15:15
  • Have a look at superuser.com/questions/698811/… where comments suggest pdfannotextractor
    – Jaleks
    Commented Sep 25, 2022 at 14:42

5 Answers 5

5

You could try and extract the /URI(...) PDF directives by hand, maybe after removing compression if any using pdftk:

pdftk file.pdf output - uncompress | grep -aPo '/URI *\(\K[^)]*'
2
  • Thank you, but native pdftohtml is enough. Commented Jul 24, 2019 at 15:15
  • @StanislavHosek. OK. I think I misinterpreted your question and that you were wanting to retrieve the list of linked URLs from a PDF file. Commented Jul 24, 2019 at 15:49
2

Using pdfx and filtering all lines starting with - http:

pdfx -v file.pdf | sed -n 's/^- \(http\)/\1/p'
1

First, you need to check if your pdf is compressed or not, see:

How to know if a PDF file is compressed or not and to (un)compress it

If it's compressed, you need to uncompress it.

Then, you can extract links using grep and sed:

strings uncompressed.pdf | grep -Eo '/URI \(.*\)' | sed 's/^\/URI (//g; s/)$//g'
1
  • I love this solution because it does not require to install any additional special tools, which didn't work out for me as easily as I expected. Thanks! 👏
    – mreichelt
    Commented Jun 25 at 21:14
0

Test this:

pdftotext -raw "filename.pdf" && file=`ls -tr | tail -1`; grep -E "https?://.*" "${file}" && rm "${file}"

Example of script

3
  • 1
    You should not parse ls ..., but you can just send output to stdout instead of using a temporary file: pdftotext -raw "filename.pdf" - | grep ... (From man pdftotext: If text-file is ´-', the text is sent to stdout.)
    – pLumo
    Commented Jul 24, 2019 at 14:29
  • Nope, it is not working, but thanks. Commented Jul 24, 2019 at 15:15
  • pdftotext version 0.71.0 / GNU bash, versão 5.0.3(1)-release (x86_64-pc-linux-gnu) / grep (GNU grep) 3.3. @pLumo it works either way for me... :) Commented Jul 24, 2019 at 20:07
0

You can use pdftohtml and then use lynx to pull the links from the html.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .