2

From time to time I find some documentation on the web that I need for offline use on my notebook. Usually I fire up wget and get the whole site.

Many projects however are now switching to wikis, and that means I download every single version and every "edit me" link, too.

Is there any tool or any configuration in wget, so that I, for example, download only files without a query string or matching a certain regexp?

Cheers,

By the way: wget has the very useful -k switch, that converts any in-site links to their local counterparts. That would be another requirement. Example: Fetching http://example.com pages. Then all links to "/..." or "http://example.com/..." have to be converted to match the downloaded counterpart.

2 Answers 2

1

From the wget man page:

-R rejlist --reject rejlist

Specify comma-separated lists of file name suffixes or patterns to accept or reject. Note that if any of the wildcard characters, *, ?, [ or ], appear in an element of acclist or rejlist, it will be treated as a pattern, rather than a suffix.

This seems like exactly what you need.

Note: to reduce the load on the wiki server, you might want to look at the -w and --random-wait flags.

1
  • Cool, I just didn't see this option. Thanks.
    – Boldewyn
    Commented Nov 3, 2009 at 18:36
0

Most of them frown on that and Wikipedia actively shuts them down with robots.txt. I would stick to http://en.wikipedia.org/wiki/Special:Export

1
  • I know, that it is quite stressful for the server, but that is one of the reasons I want to download only necessary files. Anyway, some projects just don't deliver their pages in another format than wiki pages.
    – Boldewyn
    Commented Sep 15, 2009 at 20:41

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .