The static text you can see in an PDF file is stored in content streams using PDF syntax as described in Adobe's Imaging Model.
The interactive features you can see in a PDF file are stored outside the content stream of a page in so called Annotation dictionary using the Carousel Object System (COS).
You are probably making the assumption that when you see something like itextpdf.com, there is something like <a href="http://itextpdf.com/">itextpdf.com</a>
inside a PDF.
There isn't.
There will be something like:
/F1 12 Tf
(itextpdf.com )Tj
somewhere in the content stream that contains the /Contents
of a page.
When you inspect the /Annots
of a page, you will find something like:
<<
/A<<
/S/URI
/URI(http://itextpdf.com)
>>
/Subtype/Link
/C[0 0 1]
/Border[0 0 0]
/Rect[36 803.52 98.03 814.62]
>>
as an object in your PDF file.
If you want to extract all the links and the corresponding text from a document, you need to loop over all the page dictionaries, get the /Annots
, check which annotations are of subtype /Link
, get the action (/A
), and the coordinates (/Rect
).
To know which text corresponds with the text, you need to uses iText text parser classes with a "region text" strategy and extract the text at the positions defined by the /Rect
entry.
As indicated by GPRathour in the comments, you should show what you've tried. Your question risks to be downvoted or closed if your next question is "Can you give a code sample?" If you study the examples on http://itextpdf.com, you'll find that some of them will get you very close to a solution.