Web crawler with converte links option

Question

I would like to crawl a backup site I lost access to. The site is backed up in subdomain.somesite.com while the links on the web page are www.subdomain.com

this leads to the following situation:

the link http://subdomain.somesite.com/?page_id=number works but the link in the actual html is http://www.subdomain.com/?page_id=number and doesn't work.

Any ideas how to do that with out writing a custom crawler?

I have access to www.subdomain.com which is on top of wordpress. One idea is to redirect all of the pages with the pattern /?page_id=number.

Example. http://www.subdomain.com/?page_id=255 will lead to http://subdomain.somedomain/?page_id=255

Martin · Accepted Answer · 2015-08-11 22:24:18Z

If your problem is about redirecting requests from www.subdomain to subdomain.somedomain, you can simple use RewriteRule in Apache or similar implementations in other webservers. You can use the proxy parameter P to serve the site from the www. domain and let the webserver fetch it from the backup site on the fly.

If you want to crawl and modify the content, the easiest solution would be using wget with the mirror option (availible on Linux, Windows...). It may be sufficient to use the inbuilt functions to convert absolute links to relative links. Otherwise just use a search and replace tool or regular expression to modify the domain in the downloaded folder.

Schwertspize · Accepted Answer · 2015-08-11 22:26:57Z

0

gnu wget can do it. the option -r is for recursive download, -k converts the links. see the manpage for more information

answered Aug 11, 2015 at 22:26

Schwertspize

3952 silver badges13 bronze badges

Add a comment |

Stack Exchange Network

Web crawler with converte links option

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
wordpress
web-crawler
.

Hot Network Questions

Web crawler with converte links option

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged wordpressweb-crawler.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
wordpress
web-crawler
.