1

I am looking for a method to extract the text as well as anchor information using itext.

For e.g.

PDF contect: You can visit our website, XYZ, and do something.

Output should be: You can visit our website, XYZ (www.google.com), and do something.

Basically I am trying to generate a text file with target links information.

Regards, Lalit Kumar

1
  • 2
    As you have said, I am trying.. Can you show us how you are trying?
    – gprathour
    Commented Jul 10, 2014 at 4:36

1 Answer 1

2

The static text you can see in an PDF file is stored in content streams using PDF syntax as described in Adobe's Imaging Model.

The interactive features you can see in a PDF file are stored outside the content stream of a page in so called Annotation dictionary using the Carousel Object System (COS).

You are probably making the assumption that when you see something like itextpdf.com, there is something like <a href="http://itextpdf.com/">itextpdf.com</a> inside a PDF.

There isn't.

There will be something like:

/F1 12 Tf
(itextpdf.com )Tj

somewhere in the content stream that contains the /Contents of a page.

When you inspect the /Annots of a page, you will find something like:

<<
  /A<<
    /S/URI
    /URI(http://itextpdf.com)
  >>
  /Subtype/Link
  /C[0 0 1]
  /Border[0 0 0]
  /Rect[36 803.52 98.03 814.62]
>>

as an object in your PDF file.

If you want to extract all the links and the corresponding text from a document, you need to loop over all the page dictionaries, get the /Annots, check which annotations are of subtype /Link, get the action (/A), and the coordinates (/Rect).

To know which text corresponds with the text, you need to uses iText text parser classes with a "region text" strategy and extract the text at the positions defined by the /Rect entry.

As indicated by GPRathour in the comments, you should show what you've tried. Your question risks to be downvoted or closed if your next question is "Can you give a code sample?" If you study the examples on http://itextpdf.com, you'll find that some of them will get you very close to a solution.

Not the answer you're looking for? Browse other questions tagged or ask your own question.