1

I would like to crawl a backup site I lost access to. The site is backed up in subdomain.somesite.com while the links on the web page are www.subdomain.com

this leads to the following situation:

the link http://subdomain.somesite.com/?page_id=number works but the link in the actual html is http://www.subdomain.com/?page_id=number and doesn't work.

Any ideas how to do that with out writing a custom crawler?

I have access to www.subdomain.com which is on top of wordpress. One idea is to redirect all of the pages with the pattern /?page_id=number.

Example. http://www.subdomain.com/?page_id=255 will lead to http://subdomain.somedomain/?page_id=255

2 Answers 2

1

If your problem is about redirecting requests from www.subdomain to subdomain.somedomain, you can simple use RewriteRule in Apache or similar implementations in other webservers. You can use the proxy parameter P to serve the site from the www. domain and let the webserver fetch it from the backup site on the fly.

If you want to crawl and modify the content, the easiest solution would be using wget with the mirror option (availible on Linux, Windows...). It may be sufficient to use the inbuilt functions to convert absolute links to relative links. Otherwise just use a search and replace tool or regular expression to modify the domain in the downloaded folder.

0

gnu wget can do it. the option -r is for recursive download, -k converts the links. see the manpage for more information

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .