2

Say a directory or folder's path on a website is https://superuser.xyz/images/, but you don't know this right away. Ordinarily, webmasters don't make browsing of sub-folders available, i.e. the images folder would have no index.html file, so it would simply return a "folder not found" error if someone were to enter the URL in Chrome directly through guessing or sourcing any images' path.

Also, even if the directory were accessible through an index.html file, and you right click on that webpage and press Inspect or View Page Source Code, you could find the folder and its contents, but you can only save individual files in it one at a time in the Inspect view panel, which is inefficient.

In google Chrome windows 10, how do you download all batch contents of an online directory all at once, rather than one-by-one?

3

2 Answers 2

2

wget is designed to do this. It's a CLI tool.

  1. Download wget. The official website only provides source code, so you probably want to use someone else's build of wget (latest version, EXE, you probably want the x64 one).
  2. Go to the folder where you downloaded wget.exe and [shift] + [right click] on the background of the folder. Then click "Open PowerShell Window Here".
  3. Now we can run commands. For example, type .\wget.exe --help and press enter. This should print a bunch of text about how to use wget.

Before we keep going, it's important to understand why "download all files in a webpage's directory" is kind of impossible, and how wget manages to do it anyway. On your local computer, you can open a folder and see all the files inside it. HTTP supports this (it's called WebDAV) but almost every web server has it turned off. Some web servers have a sort-of alternative where they will automatically generate a directory index. These automatically generated directory indexes are just normal HTML pages that contain links to every file in the directory. If the server in question does this for you, great, but it might not for several reasons:

  • The server admin has them turned off (ex: with the Options -Indexes setting in Apache)
  • The folder you are interested in already has a default page set (so you see that instead of the directory listing)

Ok, so we've established that we need to know the names of files to download them, and we don't have a way to just list all the names of files in a directory. wget can do something clever though. It can start at a given page, find all the files that page references (images, links, etc), find all the files those pages reference, find all the pages those pages reference, etc. This process is referred to as "crawling" a website, and it's how search engines find things. A nice side-effect of how this works is that if the server you are working with does happen to have directory indexes turned on it can make use of those (since it's just a page of links).

Now we have to write our wget command. wget has a lot of options, because there are a lot of trade-offs when crawling a website. If you crawl too fast you might overwhelm the server and get banned. If you don't have any conditions for where to stop you might wind up trying to download the whole internet (though wget does have default settings to prevent that).

.\wget.exe "https://www.example.com/foo/example.html" --recursive --no-parent --level=5

Breaking this down:

  • start at https://www.example.com/foo/example.html
  • --recursive - do the crawling thing
  • --no-parent - never download (or even look at) a page outside https://www.example.com/foo/
  • --level=5 - max out at 5 pages deep

That works pretty well if everything is in foo. It sounds like your starting point (example.html) might not be in foo though. The simple (but inefficient) option is to just let wget download the whole site and delete the directories you don't want afterwards. By default, wget won't look at anything outside the domain (www.example.com) you give it, so this might work well enough for you:

.\wget.exe "https://www.example.com/example.html" --recursive --level=5
-1

even without an index file browsers will be able to display the contents, but the server can be configured to disallow it

if the site allows for it you could try to use this solution in powershell: How to download a whole folder of files/subfolders from the web in PowerShell

for a solution on linux you could try this: CURL to download a directory

2
  • not sure what PowerShell is. If it's the modern equivalent of MS-DOS, have no idea how it can even interface to the internet. Hope you could explain how that would work.
    – user610620
    Commented Jun 11, 2022 at 7:10
  • are you asking me how Powershell can 'interface to the internet'? or how to run the script in the link? Commented Jun 12, 2022 at 14:24

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .