1

I would like to download the entirety of a relatively small website (~50 pages) for offline viewing.

I could manually open every page in a browser and download them via ctrl + s, and this would create the desired result. But this would take a very long time with a website of this size, be time consuming to organize, and there would be a lot of room for human error (missing pages, pages put in wrong directories, etc.)

Wget and its recursive functionality seems like a great solution, but I am having trouble getting the desired result.

The desired result.

Every single page on one domain and all requisite resources of every page (which may be on other domains) to be downloaded. Nothing else.

The problem

A lot of requisite resources are on external domains. These domains are numerous, can change at any time, and are not easy to get an accurate list of.

My best attempt

I tried this command:

wget -r -k -p -H -l inf -w 1 --limit-rate=40k -H -e robots=off https://my.desired.website/

  • -r is used to download pages recursively.
  • -k is used to convert links for simplified offline viewing.
  • -p is used to tell Wget to download requisite resources.
  • -H allows host spanning without restrictions.
  • -l inf is used to be certain that every single page on the desired website will be downloaded, regardless of how deep in the page hierarchy it may be.
  • -w 1 --limit-rate=40k is used to limit download rate and speed, in order not to be rude to hosts.
  • -e robots=off tells Wget to ignore "robots.txt" files and "nofollow" links.

Unfortunately, due to the -H flag, this command not only downloads every single page of the desired website, but it continues following all external links and downloading the entirety of every website it finds. This would likely result in attempting to download the entire public web.

However, without the -H flag, it does not download external resources necessary for viewing the website (i.e. images, JS, CSS, etc. that are hosted on external domains)

You may then say that I should use the -D flag, and whitelist every domain where external resources are kept. This is also not a great solution, because I do not have full control over where the website is hosted... the list of external domains where resources are hosted may change at any time, and I cannot reliably find every domain manually without missing any.

The "real" question

So essentially my question is:

Is it possible to only allow Wget to span hosts when downloading requisite resources?

If not, is there a tool that allows this type of download?

1
  • Maybe use a scraper to scrape just the relevant data instead of the whole website.
    – Gantendo
    Commented Apr 15, 2023 at 7:29

1 Answer 1

1

No. At this time there is no method built-into Wget that allows you to only span hosts for requisite resources.

If you must use Wget, the accepted answer from this question on Stack Overflow might be helpful to you.

If you would like to use another tool, perhaps HTTrack is worth looking into.

Keep in mind that neither of these solutions executes JavaScript and saves the final webpage displayed in a browser. So if you are attempting to archive/backup a website exactly as it would be seen by an end-user, you will probably have to look more deeply into this topic and your solution may require more than a single step and a single tool.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .