moooore details

Source Link

edited Jan 31, 2014 at 17:51

651
1
4
9

Essentially, I want to crawl an entire site with Wget, but I don't need it to NEVER download all theother assets (e.g. imagery, CSS, JS, or other assetsetc.). I only want the HTML file (basically anything that appears in an <a href="...">)files.

Google searches are completely useless.

Here's a command I've tried:

wget --limit-rate=200k --no-clobber --convert-links --random-wait -r -E -e robots=off -U "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36" -A html --domain=www.example.com http://www.example.com

Our site is hybrid flat-PHP and CMS. So, HTML "files" could be /path/to/page, /path/to/page/, /path/to/page.php, or /path/to/page.html.

I've even included -R js,css but it still downloads the files, THEN rejects them (pointless waste of bandwidth, CPU, and server load!).

Essentially, I want to crawl an entire site with Wget, but I don't need to download all the imagery, CSS, JS, or other assets. I only want the HTML file (basically anything that appears in an <a href="...">).

Google searches are completely useless.

Here's a command I've tried:

wget --limit-rate=200k --no-clobber --convert-links --random-wait -r -E -e robots=off -U "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36" --domain=www.example.com http://www.example.com

Our site is hybrid flat-PHP and CMS. So, HTML "files" could be /path/to/page, /path/to/page/, /path/to/page.php, or /path/to/page.html.

Essentially, I want to crawl an entire site with Wget, but I need it to NEVER download other assets (e.g. imagery, CSS, JS, etc.). I only want the HTML files.

Google searches are completely useless.

Here's a command I've tried:

wget --limit-rate=200k --no-clobber --convert-links --random-wait -r -E -e robots=off -U "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36" -A html --domain=www.example.com http://www.example.com

Our site is hybrid flat-PHP and CMS. So, HTML "files" could be /path/to/page, /path/to/page/, /path/to/page.php, or /path/to/page.html.

I've even included -R js,css but it still downloads the files, THEN rejects them (pointless waste of bandwidth, CPU, and server load!).

more info

Source Link

edited Jan 31, 2014 at 17:31

Nathan J.B.

651
1
4
9

Essentially, I want to crawl an entire site with Wget, but I don't need to download all the imagery, CSS, JS, or other assets. I only want the HTML file (basically anything that appears in an <a href="...">).

Google searches are completely useless.

Here's a command I've tried:

wget --limit-rate=200k --no-clobber --convert-links --random-wait -r -E -e robots=off -U "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36" --domain=www.example.com http://www.example.com

Our site is hybrid flat-PHP and CMS. So, HTML "files" could be /path/to/page, /path/to/page/, /path/to/page.php, or /path/to/page.html.

Essentially, I want to crawl an entire site with Wget, but I don't need to download all the imagery, CSS, JS, or other assets. I only want the HTML file (basically anything that appears in an <a href="...">).

Google searches are completely useless.

Essentially, I want to crawl an entire site with Wget, but I don't need to download all the imagery, CSS, JS, or other assets. I only want the HTML file (basically anything that appears in an <a href="...">).

Google searches are completely useless.

Here's a command I've tried:

wget --limit-rate=200k --no-clobber --convert-links --random-wait -r -E -e robots=off -U "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36" --domain=www.example.com http://www.example.com

Our site is hybrid flat-PHP and CMS. So, HTML "files" could be /path/to/page, /path/to/page/, /path/to/page.php, or /path/to/page.html.

Source Link

asked Jan 31, 2014 at 17:12

Nathan J.B.

651
1
4
9

How to crawl using wget to download ONLY HTML files (ignore images, css, js)

Essentially, I want to crawl an entire site with Wget, but I don't need to download all the imagery, CSS, JS, or other assets. I only want the HTML file (basically anything that appears in an <a href="...">).

Google searches are completely useless.

wget web-crawler

Stack Exchange Network

Return to Question

How to crawl using wget to download ONLY HTML files (ignore images, css, js)