5

Is there a way to separate wget's download and --convert-links functionality? For those unfamiliar with wget and/or --convert-links, long story short, wget can be used to download a website. --convert-links modifies the downloaded html files so the downloaded website works off-line. It does that by converting the href/src/etc. attributes to reference local files instead of the remote website.

This is the official explanation:

-k --convert-links

After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc.

Each link will be changed in one of the two ways:

• The links to files that have been downloaded by Wget will be changed to refer to the file they point to as a relative link.

Example: if the downloaded file /foo/doc.html links to /bar/img.gif, also downloaded, then the link in doc.html will be modified to point to ../bar/img.gif. This kind of transformation works reliably for arbitrary combinations of directories.

• The links to files that have not been downloaded by Wget will be changed to include host name and absolute path of the location they point to.

Example: if the downloaded file /foo/doc.html links to /bar/img.gif (or to ../bar/img.gif), then the link in doc.html will be modified to point to http://hostname/bar/img.gif.

Because of this, local browsing works reliably: if a linked file was downloaded, the link will refer to its local name; if it was not downloaded, the link will refer to its full Internet address rather than presenting a broken link. The fact that the former links are converted to relative links ensures that you can move the downloaded hierarchy to another directory.

Note that only at the end of the download can Wget know which links have been downloaded. Because of that, the work done by -k will be performed at the end of all the downloads.

If a (recursive) download gets interrupted & resumed manually, or if one fails to specify -k to begin with, how can one get sane links inside the html files?

It seems not even --backup-converted can make the process more robust, as either wget converts links right after downloading everything (no missing files), or you're on your own (xpath etc)

1 Answer 1

2
+25

Since .html files are ASCII text, you can post-process the .html files, with sed. Files containing, for example http://bad.url/good.part and https://bad.url/good.part and should have good.url instead, leaving the unmodified *.html files as *.html.bak.

find . -type f -name '*.html' -print0 | \
  xargs -0 -r sed -i.bak -e 's%://bad\.url/%://good.url/%'

Naturally, read man find xargs sed

16
  • 3
    @DanielKaplan "complexity" increases the bug surface, unnecessarily. Set up a local webserver, lie to /etc/hosts, re-transfer all the files, oh and remember the --timestamping option. Remember to undo all of this when you're done? Just to change text in an ASCII file? My "solution" has been known among the Unix/Linux community since the beginning. I regard it as a well-known general algorithm applied to a specific case. Which "edge cases:? That's why I leave the .bak files behind. diff to check changes. You can adjust the match string to get the results you want.
    – waltinator
    Commented Mar 25, 2022 at 0:43
  • 1
    "Which "edge cases:?" Well, for example, what if a link starts with a / instead of having a bad.url? Commented Mar 25, 2022 at 3:25
  • 1
    @DanielKaplan In that case, the script would need to detect that and adapt. Why not look at how wget does it?
    – 9pfs
    Commented Mar 25, 2022 at 3:32
  • 1
    @9pfssupportsUkraine That's a fair question. I assumed I wouldn't be able to find the logic or understand it. The last time I read C code was a decade ago and I never knew how to program specifically for linux. I'll give it a try. Commented Mar 25, 2022 at 4:05
  • 1
    @9pfssupportsUkraine bzr.savannah.gnu.org/lh/wget/trunk/view/head:/src/convert.c Looks like the logic is ~1000 lines. I may look into it later, but it might take considerable effort to translate/extract what I'm looking for. Commented Mar 25, 2022 at 4:52

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .