wget is designed to do this. It's a CLI tool.
- Download wget. The official website only provides source code, so you probably want to use someone else's build of wget (latest version, EXE, you probably want the x64 one).
- Go to the folder where you downloaded wget.exe and [shift] + [right click] on the background of the folder. Then click "Open PowerShell Window Here".
- Now we can run commands. For example, type
.\wget.exe --help
and press enter. This should print a bunch of text about how to use wget.
Before we keep going, it's important to understand why "download all files in a webpage's directory" is kind of impossible, and how wget manages to do it anyway. On your local computer, you can open a folder and see all the files inside it. HTTP supports this (it's called WebDAV) but almost every web server has it turned off. Some web servers have a sort-of alternative where they will automatically generate a directory index. These automatically generated directory indexes are just normal HTML pages that contain links to every file in the directory. If the server in question does this for you, great, but it might not for several reasons:
- The server admin has them turned off (ex: with the
Options -Indexes
setting in Apache)
- The folder you are interested in already has a default page set (so you see that instead of the directory listing)
Ok, so we've established that we need to know the names of files to download them, and we don't have a way to just list all the names of files in a directory. wget can do something clever though. It can start at a given page, find all the files that page references (images, links, etc), find all the files those pages reference, find all the pages those pages reference, etc. This process is referred to as "crawling" a website, and it's how search engines find things. A nice side-effect of how this works is that if the server you are working with does happen to have directory indexes turned on it can make use of those (since it's just a page of links).
Now we have to write our wget command. wget has a lot of options, because there are a lot of trade-offs when crawling a website. If you crawl too fast you might overwhelm the server and get banned. If you don't have any conditions for where to stop you might wind up trying to download the whole internet (though wget does have default settings to prevent that).
.\wget.exe "https://www.example.com/foo/example.html" --recursive --no-parent --level=5
Breaking this down:
- start at
https://www.example.com/foo/example.html
--recursive
- do the crawling thing
--no-parent
- never download (or even look at) a page outside https://www.example.com/foo/
--level=5
- max out at 5 pages deep
That works pretty well if everything is in foo
. It sounds like your starting point (example.html
) might not be in foo
though. The simple (but inefficient) option is to just let wget download the whole site and delete the directories you don't want afterwards. By default, wget won't look at anything outside the domain (www.example.com
) you give it, so this might work well enough for you:
.\wget.exe "https://www.example.com/example.html" --recursive --level=5