wget fails to download some images in a webpage

Question

So when I tried to download this webpage with wget, the text and styling works well, but some images are missing. Upon further research, the files fail to download because the url wget tries to retrieve them from is invalid, as the console output suggests:

URL transformed to HTTPS due to an HSTS policy
--2021-07-13 21:53:51--  https://www.inhaltsangabe.de/autoren/%7B%7B%20data.avatar_url%20%7D%7D
Reusing existing connection to [www.inhaltsangabe.de]:443.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.inhaltsangabe.de/autoren/%7B%7B%20data.avatar_url%20%7D%7D/ [following]
--2021-07-13 21:53:52--  https://www.inhaltsangabe.de/autoren/%7B%7B%20data.avatar_url%20%7D%7D/
Reusing existing connection to [www.inhaltsangabe.de]:443.
HTTP request sent, awaiting response... 404 Not Found
2021-07-13 21:53:53 ERROR 404: Not Found.

The actual image on the website is accessible and has the following url:

https://www.inhaltsangabe.de/wp-content/themes/yootheme/cache/brecht-276fafb8.jpeg

Other images work fine in the downloaded file. This seems to have something to do with url encoding, but I have no idea on how to solve this problem.

My command:

wget -p www.inhaltsangabe.de/autoren/brecht

(also please be gentle as this is my first question asked here)

For anyone else with this problem: It seems to be a problem with javascript, see this reddit comment — ordinary_python_programmer, Commented Jul 18, 2021 at 1:01

Anaksunaman · Accepted Answer · 2021-07-18 02:42:29Z

404 Errors

This seems to have something to do with url encoding[.]

Decoding the encoded portions of the failing links reveals that the "paths" are actually variable names present in the document source (so e.g. %7B%7B%20data.avatar_url%20%7D%7D becomes {{ data.avatar_url }}). So that would likely be the reason for returning the 404 responses, not the encoding.

The leading https://www.inhaltsangabe.de/autoren/ is probably (mis)applied by wget because each variable appears in an <img> tag src attribute:

ex. {{ data.images.thumbnail.url }}

<# if ( data.images.thumbnail ) { #>
      <img class="suggestion-post-thumbnail" src="{{ data.images.thumbnail.url }}" alt="{{ data.post_title }}">
      <# } #>

ex. {{ data.avatar_url }}

<# if ( data.avatar_url ) { #>
    <img class="suggestion-user-thumbnail" src="{{ data.avatar_url }}" alt="{{ data.display_name }}">
    <# } #>

Missing JPEG

Other images work fine in the downloaded file.

Regarding brecht-276fafb8.jpeg, while admittedly a bit of an educated guess, it appears likely that wget is processing <img> tag src and srcset attributes in the document source, but not any data-src or data-srcset attributes. For example:

ex. brecht-276fafb8.jpeg -> data-src, data-srcset (Fail!)

<img class="el-image uk-border-circle uk-box-shadow-small" alt="Bertolt Brecht" data-src="/wp-content/themes/yootheme/cache/brecht-276fafb8.jpeg" data-srcset="/wp-content/themes/yootheme/cache/brecht-276fafb8.jpeg 350w" data-sizes="(min-width: 350px) 350px" data-width="350" data-height="350" uk-img>

ex. bradbury.jpg ->src, srcset (Success!)

<img width="300" height="300" src="https://www.inhaltsangabe.de/dateien/bradbury-300x300.jpg" alt="Ray Bradbury" sizes="(min-width: 300px) 300px" srcset="https://www.inhaltsangabe.de/dateien/bradbury-300x300.jpg 300w, https://www.inhaltsangabe.de/dateien/bradbury-150x150.jpg 150w, https://www.inhaltsangabe.de/dateien/bradbury.jpg 400w"/>

This makes sense as the src and srcset attributes likely affect the general presentation of the document (i.e. images to show), whereas data-* attributes are primarily aimed at scripting, etc. and don't have any presentational value on their own.

As far as I am aware, at least in prior versions, custom attributes (e.g. data-*) were generally unsupported by wget. Regarding src and scrset, you can see them explicitly mentioned in the lists of attributes to process under src/html-url.c in the source code for wget ).

I have no idea on how to solve this problem.

Unfortunately, I am not aware of a good solution to this issue. My thought might be to do some manual post-processing on the given document source with something like BeautifulSoup to extract any relevant links. But I am not sure if that could be considered a "good" or not.

Thanks for your detailed explanation. I simply ignored the problem, as other sites work fine, but this seems to be a bug worth submitting to the wget creators. — ordinary_python_programmer, Commented Jul 20, 2021 at 19:15
Your welcome. I hope that report you submit gets some traction. — Anaksunaman, Commented Jul 20, 2021 at 23:15

Stack Exchange Network

wget fails to download some images in a webpage

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
download
wget
mirroring
.

Hot Network Questions

wget fails to download some images in a webpage

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged downloadwgetmirroring.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
download
wget
mirroring
.