How to download list of files from a file server?

Question

How would I download a list of files from a file server like this one http://www.apache.org/dist/httpd/binaries/ ?

I suppose I could use wget but then it tries to get all the links and the html file as well. Is there a better tool to accomplish this?

just to clarify your question: you just want the list of files which could be downloaded from the server, not the files itself (yet)? — akira, Commented Sep 26, 2009 at 4:57
In what way is a command like ` wget --no-verbose --spider --no-directories --recursive --level=2 apache.org/dist/httpd/binaries` not working for you? If you could be more specific that might help — DaveParillo, Commented Sep 26, 2009 at 5:02

user1931user1931 · Accepted Answer · 2009-09-26 04:41:18Z

13

You can specify what file extensions wget will download when crawling pages:

wget -r -A zip,rpm,tar.gz www.site.com/startpage.html

this will perform a recursive search and only download files with the .zip, .rpm, and .tar.gz extensions.

answered Sep 26, 2009 at 4:41

user1931

Add a comment |

akira · Accepted Answer · 2009-09-26 06:03:29Z

8

supposing you really just want a list of the files on the server without fetching them (yet):

%> wget -r -np --spider http://www.apache.org/dist/httpd/binaries/ 2>&1 | awk -f filter.awk | uniq

while 'filter.awk' looks like this

/^--.*--  http:\/\/.*[^\/]$/ { u=$3; }
/^Length: [[:digit:]]+/ { print u; }

then you possibly have to filter out some entries like

"http://www.apache.org/dist/httpd/binaries/?C=N;O=D"

answered Sep 26, 2009 at 4:36

akira

62.5k18 gold badges138 silver badges165 bronze badges

Add a comment |

Udit Desai · Accepted Answer · 2016-06-19 04:09:31Z

You can use following command:

wget --execute="robots = off" --mirror --convert-links --no-parent --wait=5 <website-url>

wget: Simple Command to make CURL request and download remote files to our local machine.
--execute="robots = off": This will ignore robots.txt file while crawling through pages. It is helpful if you're not getting all of the files.
--mirror: This option will basically mirror the directory structure for the given URL. It's a shortcut for -N -r -l inf --no-remove-listing which means:
- -N: don't re-retrieve files unless newer than local
- -r: specify recursive download
- -l inf: maximum recursion depth (inf or 0 for infinite)
- --no-remove-listing: don't remove '.listing' files
--convert-links: make links in downloaded HTML or CSS point to local files
--no-parent: don't ascend to the parent directory
--wait=5: wait 5 seconds between retrievals. So that we don't thrash the server.
<website-url>: This is the website url from where to download the files.

Happy Downloading :smiley:

3 Answers 3