Using wget to recursively fetch a directory with arbitrary files in it

Question

I have a web directory where I store some config files. I'd like to use wget to pull those files down and maintain their current structure. For instance, the remote directory looks like:

http://mysite.com/configs/.vim/

.vim holds multiple files and directories. I want to replicate that on the client using wget. Can't seem to find the right combo of wget flags to get this done. Any ideas?

waldyrious · Accepted Answer · 2017-10-04 21:53:13Z

1202

You have to pass the -np/--no-parent option to wget (in addition to -r/--recursive, of course), otherwise it will follow the link in the directory index on my site to the parent directory. So the command would look like this:

wget --recursive --no-parent http://example.com/configs/.vim/

To avoid downloading the auto-generated index.html files, use the -R/--reject option:

wget -r -np -R "index.html*" http://example.com/configs/.vim/

edited Oct 4, 2017 at 21:53

waldyrious

3,7764 gold badges36 silver badges42 bronze badges

answered Nov 7, 2008 at 21:55

Paige Ruten

175k36 gold badges180 silver badges198 bronze badges

70

add -nH (cuts out hostname) --cut-dirs=X (cuts out X directories). it's a bit annoying to have to manually count directories for X..
– lkraav
Commented Nov 8, 2010 at 21:49
6

Why doesn't any of these work for w3.org/History/1991-WWW-NeXT/Implementation ? It will only download robots.txt
– matteo
Commented Nov 14, 2011 at 18:56
50

@matteo because the robots.txt probably disallow crawling the website. You should add -e robots=off to force crawling.
– gaborous
Commented Dec 16, 2014 at 18:57
7

If you don't want to download the entire content, you may use: -l1 just download the directory (example.com in your case) -l2 download the directory and all level 1 subfolders ('example.com/something' but not 'example.com/somthing/foo') And so on. If you insert no -l option, wget will use -l 5 automatically. If you insert a -l 0 you´ll download the whole Internet, because wget will follow every link it finds. stackoverflow.com/a/19695143/6785908
– so-random-dude
Commented May 29, 2017 at 11:06
5

why am I always getting an index.html file instead of the directory? wget -r --no-parent -e robots=off http://demo.inspiretheme.com/templates/headlines/images/ This command will only get an index.html file
– shenkwen
Commented Jun 25, 2019 at 20:51

| Show 5 more comments

Sri · Accepted Answer · 2011-03-17 06:17:28Z

147

To download a directory recursively, which rejects index.html* files and downloads without the hostname, parent directory and the whole directory structure :

wget -r -nH --cut-dirs=2 --no-parent --reject="index.html*" http://mysite.com/dir1/dir2/data

answered Mar 17, 2011 at 6:17

Sri

4,7932 gold badges39 silver badges42 bronze badges

I can't get this to work: wget -r -nH --cut-dirs=3 --no-parent --reject="index.html*" w3.org/History/1991-WWW-NeXT/Implementation --cut-dirs=2 doesn't work either It only downloads robots.txt which actually is in the root folder. Whet am I missing?
– matteo
Commented Nov 14, 2011 at 19:04
39

@matteo try adding: -e robots=off
– Paul J
Commented Jan 8, 2012 at 4:04
To recursively obtain all the directories within a directory, use wget -r -nH --reject="index.html*" mysite.io:1234/dir1/dir2
– Prasanth Ganesan
Commented Sep 3, 2019 at 12:50

Add a comment |

ma11hew28 · Accepted Answer · 2014-05-06 16:00:16Z

127

For anyone else that having similar issues. Wget follows robots.txt which might not allow you to grab the site. No worries, you can turn it off:

wget -e robots=off http://www.example.com/

http://www.gnu.org/software/wget/manual/html_node/Robot-Exclusion.html

edited May 6, 2014 at 16:00

ma11hew28

125k120 gold badges458 silver badges656 bronze badges

answered Nov 22, 2012 at 20:36

Sean Villani

1,3621 gold badge8 silver badges9 bronze badges

4

When you are ignoring robots.txt you should at least throttle your requests. The behaviour suggested in this answer is highly impolite.
– Nobody
Commented Nov 19, 2019 at 9:22
2

@Nobody So what's the polite answer to this?
– Phani Rithvij
Commented Jan 6, 2020 at 9:33
1

@PhaniRithvij Rate limit your requests, wget has parameters for it. Note that some people might still take issue, and considering the robots file is explicitly telling you that it's not allowed to do what you are currently doing, you might even get into legal trouble.
– Nobody
Commented Jan 6, 2020 at 19:10
I ran into an unhelpful robots.txt file while trying this out, but found a way around it without this option: the files I needed were also hosted on an FTP server, and running wget in mirror mode on the FTP server worked fine.
– Gaurav
Commented Apr 28, 2021 at 4:37

Add a comment |

ma11hew28 · Accepted Answer · 2014-05-06 15:59:50Z

47

You should use the -m (mirror) flag, as that takes care to not mess with timestamps and to recurse indefinitely.

wget -m http://example.com/configs/.vim/

If you add the points mentioned by others in this thread, it would be:

wget -m -e robots=off --no-parent http://example.com/configs/.vim/

edited May 6, 2014 at 15:59

ma11hew28

125k120 gold badges458 silver badges656 bronze badges

answered Feb 24, 2014 at 9:21

SamGoody

14.3k10 gold badges82 silver badges92 bronze badges

Add a comment |

esote · Accepted Answer · 2017-03-19 23:11:17Z

41

Here's the complete wget command that worked for me to download files from a server's directory (ignoring robots.txt):

wget -e robots=off --cut-dirs=3 --user-agent=Mozilla/5.0 --reject="index.html*" --no-parent --recursive --relative --level=1 --no-directories http://www.example.com/archive/example/5.3.0/

edited Mar 19, 2017 at 23:11

esote

83112 silver badges25 bronze badges

answered Feb 15, 2013 at 12:26

Erich Eichinger

1,91817 silver badges15 bronze badges

this didn't download all subdirectories for me
– Tom
Commented May 7, 2021 at 12:40

Add a comment |

berezovskyi · Accepted Answer · 2020-12-24 19:56:34Z

18

First of all, thanks to everyone who posted their answers. Here is my "ultimate" wget script to download a website recursively:

wget --recursive ${comment# self-explanatory} \
  --no-parent ${comment# will not crawl links in folders above the base of the URL} \
  --convert-links ${comment# convert links with the domain name to relative and uncrawled to absolute} \
  --random-wait --wait 3 --no-http-keep-alive ${comment# do not get banned} \
  --no-host-directories ${comment# do not create folders with the domain name} \
  --execute robots=off --user-agent=Mozilla/5.0 ${comment# I AM A HUMAN!!!} \
  --level=inf  --accept '*' ${comment# do not limit to 5 levels or common file formats} \
  --reject="index.html*" ${comment# use this option if you need an exact mirror} \
  --cut-dirs=0 ${comment# replace 0 with the number of folders in the path, 0 for the whole domain} \
$URL

Afterwards, stripping the query params from URLs like main.css?crc=12324567 and running a local server (e.g. via python3 -m http.server in the dir you just wget'ed) to run JS may be necessary. Please note that the --convert-links option kicks in only after the full crawl was completed.

Also, if you are trying to wget a website that may go down soon, you should get in touch with the ArchiveTeam and ask them to add your website to their ArchiveBot queue.

answered Dec 24, 2020 at 19:56

berezovskyi

3,3663 gold badges26 silver badges30 bronze badges

1

Thanks, @berezovskyi, this helped me get started really quickly.
– NeilG
Commented May 14, 2023 at 2:53
Would you suggest any update to this command with the -m mirror option, @berezovskyi ?
– NeilG
Commented May 14, 2023 at 3:06
@NeilG I don't recall why I ended up expanding --mirror into --recursive --level=inf, but I think it was interfering with some option (--no-parent or --cut-dirs=0, perhaps)? According to man, --mirror is ... currently equivalent to -r -N -l inf --no-remove-listing.
– berezovskyi
Commented May 15, 2023 at 16:47
Thanks @berezogzkyi, I've had opportunity to use it a few different ways since then and I also found I didn't actually want to use the --mirror option - but I can't remember why, either, lol. Just one of the packed options wasn't suitable. Regards.
– NeilG
Commented May 16, 2023 at 22:57
Also --compression=auto may be needed
– Pavel Peřina
Commented Jul 14 at 8:59

Add a comment |

user2288008user2288008 · Accepted Answer · 2013-05-16 12:39:45Z

11

If --no-parent not help, you might use --include option.

Directory struct:

http://<host>/downloads/good
http://<host>/downloads/bad

And you want to download downloads/good but not downloads/bad directory:

wget --include downloads/good --mirror --execute robots=off --no-host-directories --cut-dirs=1 --reject="index.html*" --continue http://<host>/downloads/good

answered May 16, 2013 at 12:39

user2288008

Add a comment |

Conor McDermottroe · Accepted Answer · 2008-11-07 21:49:42Z

8

wget -r http://mysite.com/configs/.vim/

works for me.

Perhaps you have a .wgetrc which is interfering with it?

answered Nov 7, 2008 at 21:49

Conor McDermottroe

1,3639 silver badges10 bronze badges

Add a comment |

jon.bray.eth · Accepted Answer · 2021-09-02 05:20:20Z

It sounds like you're trying to get a mirror of your file. While wget has some interesting FTP and SFTP uses, a simple mirror should work. Just a few considerations to make sure you're able to download the file properly.

Respect `robots.txt`

Ensure that if you have a /robots.txt file in your public_html, www, or configs directory it does not prevent crawling. If it does, you need to instruct wget to ignore it using the following option in your wget command by adding:

wget -e robots=off 'http://your-site.com/configs/.vim/'

Convert remote links to local files.

Additionally, wget must be instructed to convert links into downloaded files. If you've done everything above correctly, you should be fine here. The easiest way I've found to get all files, provided nothing is hidden behind a non-public directory, is using the mirror command.

Try this:

wget -mpEk 'http://your-site.com/configs/.vim/'

# If robots.txt is present:

wget -mpEk robots=off 'http://your-site.com/configs/.vim/'

# Good practice to only deal with the highest level directory you specify (instead of downloading all of `mysite.com` you're just mirroring from `.vim`

wget -mpEk robots=off --no-parent 'http://your-site.com/configs/.vim/'

Using -m instead of -r is preferred as it doesn't have a maximum recursion depth and it downloads all assets. Mirror is pretty good at determining the full depth of a site, however if you have many external links you could end up downloading more than just your site, which is why we use -p -E -k. All pre-requisite files to make the page, and a preserved directory structure should be the output. -k converts links to local files. Since you should have a link set up, you should get your config folder with a file /.vim.

Mirror mode also works with a directory structure that's set up as an ftp:// also.

General rule of thumb:

Depending on the side of the site you are doing a mirror of, you're sending many calls to the server. In order to prevent you from being blacklisted or cut off, use the wait option to rate-limit your downloads.

wget -mpEk --no-parent robots=off --random-wait 'http://your-site.com/configs/.vim/'

But if you're simply downloading the ../config/.vim/ file you shouldn't have to worry about it as your ignoring parent directories and downloading a single file.

RomSteady · Accepted Answer · 2016-05-08 00:35:15Z

5

To fetch a directory recursively with username and password, use the following command:

wget -r --user=(put username here) --password='(put password here)' --no-parent http://example.com/

edited May 8, 2016 at 0:35

RomSteady

3982 silver badges13 bronze badges

answered Oct 21, 2014 at 3:32

prayagupadhyay

30.9k14 gold badges160 silver badges194 bronze badges

Add a comment |

rkok · Accepted Answer · 2017-10-18 23:31:27Z

5

This version downloads recursively and doesn't create parent directories.

wgetod() {
    NSLASH="$(echo "$1" | perl -pe 's|.*://[^/]+(.*?)/?$|\1|' | grep -o / | wc -l)"
    NCUT=$((NSLASH > 0 ? NSLASH-1 : 0))
    wget -r -nH --user-agent=Mozilla/5.0 --cut-dirs=$NCUT --no-parent --reject="index.html*" "$1"
}

Usage:

Add to ~/.bashrc or paste into terminal
wgetod "http://example.com/x/"

answered Oct 18, 2017 at 23:31

rkok

1,06415 silver badges19 bronze badges

Add a comment |

pr-pal · Accepted Answer · 2019-09-07 15:07:53Z

The following option seems to be the perfect combination when dealing with recursive download:

wget -nd -np -P /dest/dir --recursive http://url/dir1/dir2

Relevant snippets from man pages for convenience:

   -nd
   --no-directories
       Do not create a hierarchy of directories when retrieving recursively.  With this option turned on, all files will get saved to the current directory, without clobbering (if a name shows up more than once, the
       filenames will get extensions .n).


   -np
   --no-parent
       Do not ever ascend to the parent directory when retrieving recursively.  This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded.

Jordan Gee · Accepted Answer · 2018-04-09 06:02:11Z

All you need is two flags, one is "-r" for recursion and "--no-parent" (or -np) in order not to go in the '.' and ".." . Like this:

wget -r --no-parent http://example.com/configs/.vim/

That's it. It will download into the following local tree: ./example.com/configs/.vim . However if you do not want the first two directories, then use the additional flag --cut-dirs=2 as suggested in earlier replies:

wget -r --no-parent --cut-dirs=2 http://example.com/configs/.vim/

And it will download your file tree only into ./.vim/

In fact, I got the first line from this answer precisely from the wget manual, they have a very clean example towards the end of section 4.3.

zb226 · Accepted Answer · 2017-08-23 12:47:39Z

2

Wget 1.18 may work better, e.g., I got bitten by a version 1.12 bug where...

wget --recursive (...)

...only retrieves index.html instead of all files.

Workaround was to notice some 301 redirects and try the new location — given the new URL, wget got all the files in the directory.

edited Aug 23, 2017 at 12:47

zb226

10.2k6 gold badges54 silver badges86 bronze badges

answered Feb 28, 2017 at 5:42

Devon

1,08410 silver badges21 bronze badges

Add a comment |

Tumelo Mapheto · Accepted Answer · 2020-06-25 22:01:11Z

1

Recursive wget ignoring robots (for websites)

wget -e robots=off -r -np --page-requisites --convert-links 'http://example.com/folder/'

-e robots=off causes it to ignore robots.txt for that domain

-r makes it recursive

-np = no parents, so it doesn't follow links up to the parent folder

answered Jun 25, 2020 at 22:01

Tumelo Mapheto

5358 silver badges10 bronze badges

Add a comment |

kasperjj · Accepted Answer · 2008-11-07 21:50:44Z

0

You should be able to do it simply by adding a -r

wget -r http://stackoverflow.com/

answered Nov 7, 2008 at 21:50

kasperjj

3,65428 silver badges25 bronze badges

9

This doesn't really download a directory, but all files that it can find on the server, including directories above the one you want to download.
– Luc
Commented Mar 20, 2013 at 9:38

Add a comment |

Collectives™ on Stack Overflow

Using wget to recursively fetch a directory with arbitrary files in it

16 Answers 16

Respect `robots.txt`

Convert remote links to local files.

Try this:

General rule of thumb:

Not the answer you're looking for? Browse other questions tagged
shell
wget
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

16 Answers 16

Respect robots.txt

Convert remote links to local files.

Try this:

General rule of thumb:

Not the answer you're looking for? Browse other questions tagged shellwget or ask your own question.

Linked

Related

Respect `robots.txt`

Not the answer you're looking for? Browse other questions tagged
shell
wget
or ask your own question.