I would like to download the entirety of a relatively small website (~50 pages) for offline viewing.
I could manually open every page in a browser and download them via ctrl
+ s
, and this would create the desired result. But this would take a very long time with a website of this size, be time consuming to organize, and there would be a lot of room for human error (missing pages, pages put in wrong directories, etc.)
Wget and its recursive functionality seems like a great solution, but I am having trouble getting the desired result.
The desired result.
Every single page on one domain and all requisite resources of every page (which may be on other domains) to be downloaded. Nothing else.
The problem
A lot of requisite resources are on external domains. These domains are numerous, can change at any time, and are not easy to get an accurate list of.
My best attempt
I tried this command:
wget -r -k -p -H -l inf -w 1 --limit-rate=40k -H -e robots=off https://my.desired.website/
-r
is used to download pages recursively.-k
is used to convert links for simplified offline viewing.-p
is used to tell Wget to download requisite resources.-H
allows host spanning without restrictions.-l inf
is used to be certain that every single page on the desired website will be downloaded, regardless of how deep in the page hierarchy it may be.-w 1 --limit-rate=40k
is used to limit download rate and speed, in order not to be rude to hosts.-e robots=off
tells Wget to ignore "robots.txt" files and "nofollow" links.
Unfortunately, due to the -H
flag, this command not only downloads every single page of the desired website, but it continues following all external links and downloading the entirety of every website it finds. This would likely result in attempting to download the entire public web.
However, without the -H
flag, it does not download external resources necessary for viewing the website (i.e. images, JS, CSS, etc. that are hosted on external domains)
You may then say that I should use the -D
flag, and whitelist every domain where external resources are kept. This is also not a great solution, because I do not have full control over where the website is hosted... the list of external domains where resources are hosted may change at any time, and I cannot reliably find every domain manually without missing any.
The "real" question
So essentially my question is:
Is it possible to only allow Wget to span hosts when downloading requisite resources?
If not, is there a tool that allows this type of download?