63

I'm looking for a way to convert a webpage to PDF, but preserving the webpage's look. Also preserving webpage's text (being selectable), searchable [Generating image screenshot for the webpage would make text neither selectable nor searchable].

I'm looking for printing the webpage to PDF as is (as on web browser) without any manipulation on style or alignment, or loss of any webpage's static components.

This would help preserving offline copies of webpages that are easily readable, annotateable and searchable.


You don't need to read any of below (Question is just the above section) in order to get my question. The following section is just listing of what I've got through research or others' answers in a nested way in order to reach an answer for the question.

Research Outcomes (Suggestions that didn't solve my problem)

Outcomes till now on trying to find a solution (All still not working as a solution for this question)

I've tried these PDF web printing engines but all manipulate pages' look, more even damaging and making some hardly readable: (Example page screenshots are included in square brackets)

  • Chrome [Original, Print Styles (Disabled | not Disabled)]
  • Firefox [Original, Print Styles (Disabled p1,p2 | not Disabled p1,p2)]
  • Readability
    • It simplifies the webpage (which is a good thing for focused reading–However, this isn't what I'm looking for). I'm looking for keeping all the webpage's positions/styles properties as seen on Web Browser in a PDF format without any manipulation.
  • Foxit Reader
  • NovaPDF
  • CutyCapt [Original, Zoom Factor: 0.4: Screenshots, Outputted PDF]
    • I'll add links after I solve program's running issues on Windows"
  • wkhtmltopdf [Original, Zoom Factor: 0.4: Screenshots, Outputted PDF]
    • It doesn't support CSS3.

All webpage screenshot image capturing plugins (e.g. Abduction, Awesome Screenshot, Fireshot, Firefox Screenshot Developer Tool, Full Page Screen Capture, Page2Images, web-capture, ...) don't answer my question, because they don't preserve text and links.

Scrible is great at preserving webpages as is for further annotation and research, but unfortunately still online and without conversion to PDF format.

There are two other questions on the community similar somehow to mine, however, this one is different a little bit but with those important distinctions:

More Similar questions where preserving text and links isn't a requirement (pages are captured as image screenshots mostly):


Notes

OS: Windows 10

7
  • If you want to print from a browser you first have to disable any print stylesheets to maintain the web page's screen appearance.
    – DavidPostill
    Commented Apr 12, 2016 at 15:25
  • 1
    See How to get WYSIWYP (print what you see) in a web browser?. See my answer to that question.
    – DavidPostill
    Commented Apr 12, 2016 at 15:26
  • Then you can print using CutePDF writer.
    – DavidPostill
    Commented Apr 12, 2016 at 15:27
  • @DavidPostill It seems that disabling print styles either doesn't work or it doesn't effect the browser to display PDF correctly. An example screenshots have been added to the edited version of the question.
    – Omar
    Commented Apr 12, 2016 at 19:11
  • 1
    Why did no one mention Opera? Opera save as pdf function will save it as exactly how it looks in a browser.
    – Alex
    Commented Nov 29, 2020 at 9:24

10 Answers 10

11

Contributing another answer for possible users. In Firefox, there used to be an addon "Print pages to PDF". You can search for its last version 0.1.9.3 (work on pre-Quantum versions only).

Currently there's this addon for both Chrome and Firefox that works quite well: PDFMage

  • Save all images in page
  • Generate text as text, not as image, you can search text in generated PDF.
  • Preserver hyperlinks
  • Has the option to save a long webpage as a one-page PDF (so the images are not split between pages)
3
  • Excellent addon. Thank you. Commented Jun 18, 2021 at 6:18
  • I feel like this is the answer we're looking for. Gave it a go and it's preserved the look/layout of the site, text and links. Vote this up!
    – Daniel
    Commented Jun 3, 2022 at 6:56
  • This is the best answer! I've tested on some web pages and it does preserve the looking.
    – czxttkl
    Commented Sep 11, 2022 at 16:20
11

We faced the same problem in a University project and were able to solve it using

wkhtmltopdf

We quite enjoyed the capabilities of this tool on the command line. We also called it using python code to render the current state of webpages. It has the option to deliver the webpage as pdf, usually not perfect to preserve the website view due to the Page formatting (A4 for example), or as png (preserves the view of the page but not links)

There is also the readability(for Python:pypi.python.org/pypi/readability-lxml) project we used that does the ads removal and content detection quite well (e.g. for newspaper articles and the like). If you just want an addon or extension for your browser the following readability implementation might satisfy your need:

Offline now: https://www.readability.com/addons/

WaybackMachine Link: https://web.archive.org/web/20160308192045/https://readability.com/addons

7
  • Unfortunately, wkhtmltopdf didn't preserve page's elements positions. Example Page: Zoom Factor: 0.4: Screenshots, Outputted PDF
    – Omar
    Commented May 6, 2016 at 18:36
  • Readability simplifies the page (which is a good thing–However this isn't what I'm looking for). I need to keep all the page's positions/styles properties as seen on Web Browser in a PDF format without any manipulation.
    – Omar
    Commented May 6, 2016 at 19:08
  • Did you use the wkhtmltopng option of the tool, as png the positions should be okay (at least much better than in the pdf version where the page is fitted to A4 format)
    – sebisnow
    Commented May 9, 2016 at 6:36
  • @sebisnow Is the readability.com site deprecated? I can't access it at the moment.
    – jeppoo1
    Commented May 8, 2020 at 19:44
  • 1
    yes, seems to be offline for at least a year already. I will add a wayback machine link. web.archive.org/web/20160308192045/https://readability.com/…
    – sebisnow
    Commented May 12, 2020 at 11:36
7

I really struggled with this and tried most of the tools that are mentioned so far. The best results I got was using Chrome's headless mode. The command on MacOS would look like this:

/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --headless --print-to-pdf=test.pdf http://127.0.0.1:8080

The best list of command line options I found was here.

However there were problems with that. Specifically my pages are very javascript heavy and I couldn't make the print function wait for them to finish execution. So my output didn't have the images in it.

The solution I found was a nodeJS package: chrome-headless-render-pdf. It's scant documentation is here. It works and it is easily scriptable.

1
  • Headless chrome works but generates horrible output. Commented Jun 18, 2021 at 6:21
4

I had the same problem, and figured it out via Chrome and with a free printer driver called PDF995. This is part of a suite of PDF utilities; the publisher's web site is http://www.pdf995.com/.

However, I think any web browser and any pdf converter will suffice. Anyway, here's what I did:

  1. select all or highlight everything.
  2. Right-click the highlighted selection or press Ctrl+P (both options give you slightly different results, but you end up with the same outcome after completion).

  3. If you right-clicked in 2., the selection (the short-cut), click "print" and only all that you've selected will be on the print preview. Make sure you change your printer destination to whatever pdf converter you decide to use (PDF995 or other).

  4. Click "print" and it saves as a pdf document.

  5. If you pressed Ctrl+P in 2. (the slightly longer way) instead, click on "More settings" and scroll down to "Options".

  6. Click the box that says "Selection only" and everything in the short-cut I described will follow.

  7. Don't forget to change your printer destination to whatever pdf converter you choose (PDF995 or other).

  8. Click "print".

2

If you're on Linux, try this small command line tool CutyCapt, which depends only on Qt and QtWebkit, and exports to PDF.

1

Nobody seems (apart from one comment) to have pointed out that Opera does exactly what is asked for. It saves the page as one selectable PDF Page without cutting adverts! nor add page breaks! and exactly[*] the current view width narrow or wide.

Here we are viewing Narrow Page 1 at the bottom so the PDF in the center is one long page and zoomed in on the right we can see width has been ALMOST[*] exactly used.

[*] The difference is the scroll bars are removed on saving so the width is slightly wider and collateral damage, in that case, there is a slight shift in contents.

enter image description here

[*] NOTE There are hide scrollbar extensions for chromium's like Opera but results can be variable. However, checking the "nominated site" and Hide Scrollbars it shows it can be activated.

enter image description here

0

Although not exactly your request as not in PDF, if the objective is purely to keep an offline copy of webpages for later review, saving it as webpage would do just that.

The big caveat is that it will create a .html file and a folder with all the media content on the page rather than a single document.

In Chrome and Firefox, you can save a page doing a right click on it and choosing Save as... In Internet Explorer, you can save it under File -> Save as (pressing the Alt key for the menus to appear).

3
  • Saving the webpage in .html format would make it not-annotateable. So, I need it in PDF format.
    – Omar
    Commented Apr 12, 2016 at 15:34
  • That's a good point! Just remembered of an extension that allows you to easily disable print-related stylesheets. A quick google search led me to the discussion when I had first heard of it, on Superuser: How to get WYSIWYP (print what you see) in a web browser?
    – Pyheme
    Commented Apr 12, 2016 at 15:42
  • I tried doing "Save As" using Chrome. It creates a .HTML file and a folder. The .HTLM file was missing a whole lot of stuff from the page. Commented Dec 10, 2018 at 22:33
0

Try this service. Creates a PDF from a website as you see it in the browser. https://lomotoh.com/ (I am affiliated with this site)

3
  • This preserves links, but not selectable text, which is a requirement in the question.
    – fixer1234
    Commented Oct 15, 2016 at 23:07
  • Seems to be selectable for some sites. I think it depends what sort of custom font the site uses. Commented Oct 16, 2016 at 3:18
  • 4
    The link does not work. You should remove this answer.
    – PS Nayak
    Commented Jun 6, 2020 at 15:23
0

At least all of the text on some pages is searchable, selectable, cut and pastable. I tried on a page pasted up up robotically by a computer out of text and pix and it it tuned it all into an image.

I have used these things for years. I get the best results in Linux by rebuilding the page in a XX word of your choice and exporting the result as a PDF. I can get what I want at considerable cost. From the my limited use arch ivin The site David Herse put up https://lomotoh.com/ (I am NOT affiliated with this site) works as well as any I have ever used. I will be my go to resource to cover webpages to PDFs until I find better or it cost too much for me to pay out of my own thin purse.

0

I would suggest trying wkhtmltopdf again as suggested by @sebisnow in their answer, with some pre-processing.

Prior do running the program, open the developer tools (Ctrl+Shift+I), and adjust the elements that aren't sitting correctly. Likely they are responsive for phone/desktop/tablet which means that the positions are relative to other HTML objects. Make them absolute positions instead.

Edit the source of the page, focusing on the margins and padding of the objects in question. Sometimes, simply making the canvas 10-15% larger will give even relative elements enough virtual room so they do not move.

I often do use the developer tools to adjust page elements when I'm printing to PDF so I have a reference file for later. Coupled with wkhtmltopdf, you should be able to have the site appear as it does in browser with the feature's that you're looking for like image and link preservation as well as text.

2
  • 1
    wkhtmltopdf does NOT work properly. I am a Browser War Veteran (remember 2006-09?) and that little piece of... tool gives me flashbacks. It will NOT understand page breaks, it will NOT print table gridlines thinner than 1mm (🤮) and it will NOT balance table line heights, keeping a fixed height and then dumping the remaining height on the last line. It is only useful if you go back to 1994 and print from NCSA Mosaic. I'm trying to use Selenium, headless browser and print to PDF. My solution will appear here if I ever make it run. I'm throwing in the kitchen sink and even pandoc.
    – Ricardo
    Commented Apr 11, 2022 at 12:45
  • @Ricardo I remember those days well. The last release does much better with escape characters, but as I said in my response you do need some pre-processing as wkhtmltopdf doesn't recognize newer/custom DOM elements. What I've had to do in the headless situation is have a script that modifies the HTML file (removes header and all non-necessary elements, replacing with common ones.) It is a very similar script to what browsers use for 'reading mode', which in my experience prints absolutely fine with wkhtmltopdf. Let me see if I can find the script that's been working for me to add. Commented Apr 30, 2022 at 10:56

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .