8

Since about mid-August 2014, several Google servers have been downloading all of the (very) large binary files on my web site, about once a week. The IPs all show as owned by Google, and look like this: google-proxy-66-249-88-199.google.com. These are GET requests, and they are greatly affecting my server traffic.

Prior to this, I didn't see any traffic from these Google proxy IPs, so this seems to be something relatively new. I do see all kinds of traffic from other Google IPs, all of them googlebot and HEAD requests only.

I wouldn't be worried about this except that all of these files are being downloaded by Google about every week or so. The bandwidth used is starting to get excessive.

I've speculated that since many of these files are Windows executables, perhaps Google is downloading them to perform malware scans. Even if that's true, does that really need to happen every week?

Example traffic from google proxy IPs in November so far:

google-proxy-64-233-172-95.google.com: 8.09 GB
google-proxy-66-102-6-104.google.com: 7.50 GB
google-proxy-66-249-83-245.google.com: 3.35 GB
google-proxy-66-249-84-131.google.com: 1.54 GB
google-proxy-66-249-83-131.google.com: 4.98 GB
google-proxy-66-249-83-239.google.com: 2.48 GB
google-proxy-66-249-88-203.google.com: 2.94 GB
google-proxy-66-249-88-201.google.com: 2.58 GB
google-proxy-66-249-88-199.google.com: 4.89 GB

Update #1: I forgot to mention that the files in question are already in the site's robots.txt file. To make sue the robots.txt configuration is working properly, I also used the robots.txt tester in Google Webmaster Tools, which shows that the files are definitely being blocked for all Google bots, with one exception: Adsbot-Google. I'm not sure what that's about either. AND I searched Google for some of the files, and they do NOT appear in search results.

Update #2: Example: between 5:12am and 5:18am PST on November 17, about half a dozen IPs (all google-proxy) did GETs on all of the binary files in question, 27 in total. On November 4 between 2:09pm and 2:15pm PST, those same IPs did basically the same thing.

Update #3: At this point it seems clear that although these are valid Google IPs, they are part of Google's proxy service, and not part of Google's web crawling system. Because these are proxy addresses, there's no way to determine where the GET requests are actually originating, or whether they are coming from one place or many. Based on the sporadic nature of the GETs, it doesn't appear that there is anything nefarious going on; it's likely just someone deciding to download all the binaries while using Google's proxy service. Unfortunately, that service seems to be completely undocumented, which doesn't help. From a site administrator's standpoint, proxies are rather annoying. I don't want to block them, because they have legitimate uses. But they can also be misused.

3
  • Good question. I up-voted it! You will want to block these using robots.txt for sure. Why Google is downloading executables is beyond me. You theory seems like a good one, but somehow, because of the frequency I am not sure. It seems rather strange. These do appear to be valid Googlebot IP addresses, though I do not have google-proxy-66-102-6-104.google.com in my list.
    – closetnoc
    Commented Nov 21, 2014 at 16:57
  • I forgot to mention that the files in question are already in the site's robots.txt file. See Update #1 above.
    – boot13
    Commented Nov 21, 2014 at 17:11
  • You got me confused. I have a contractor expected any minute now so I will have to think about this. Google has been doing funny things with their domain names and IP address allocations and there has been some overlap with various Google services including hosting and others where peoples bots can appear on Google IP address space, however, I have not seen them using Googlebot IP address space. I wish Google would allocate clear space for the various search processes with no or little overlap so that security systems can properly trust these IP addresses.
    – closetnoc
    Commented Nov 21, 2014 at 17:22

1 Answer 1

3

I did some research for this question and found some interesting thins, such as:

1. Is it a fake crawler? -> https://stackoverflow.com/questions/15840440/google-proxy-is-a-fake-crawler-for-example-google-proxy-66-249-81-131-google-c

Conclusion from the user:

These 'crawlers' are not crawlers but are part of the live website preview used in the Google search engine.

I have tried this, to show one of my websites in the preview and yes, there it is, received a blockedIP message.

If you want users to be able to view a preview of your website, you have to accept these 'crawlers'.

Like others said: "the root domain of that URL is google.com and that can't be easily spoofed".

Conclusion: You can trust these bot's or crawlers and it is used to show a preview in google search.

We know the live preview is not downloading your files, so let's jump to question 2.

2. Is it part of Google services? -> Is this Google proxy a fake crawler: google-proxy-66-249-81-131.google.com?

Conclusion:

I think, some people are using Google services (like Google translate, Google mobile, etc.) for accessing (blocked) websites (in schools etc.) but also for DOS attacks and similar activity.

My guess on this is the same as the above. Someone is trying to use a Google service to access your files, such as translator.

If, as you say, the files are already being blocked by the robots.txt, this can only be a manual request.

EDIT: To address the OP Comment extensively:

Can the crawlers ignore the robots.txt? Yes. Here's a list I don't think Google does that, which means it can be other bots using Google proxies.

Can it be a bad bot? Yes, and for that I recommend:

.htaccess banning:

 RewriteCond %{REMOTE_HOST} ^209.133.111..* [OR]
 RewriteCond %{HTTP_USER_AGENT} Spider [OR]
 RewriteCond %{HTTP_USER_AGENT} Slurp
 RewriteRule ^.*$ X.html [L]

This code can ban IP's or User agent's.

Or use a Spider Trap, featured here

I keep my opinion that this is a manual request.

5
  • I saw those answers as well, but they didn't seem to address my specific issue. You may be right that Google Proxy is being somehow misused, in which case I'll most likely block it completely, which is kind of lame. My understanding of robots.txt is that crawler software can choose to ignore it. Friendly bots are supposed to honour it, and most do, but proxies are (I guess) different.
    – boot13
    Commented Nov 21, 2014 at 17:37
  • 1
    @boot13 Be careful though. These are valid Googlebot IP addresses. So if you do block it, block it for only these files. Assuming that you use Apache, you should be able to do this with .htaccess. But that might cause other problems so make sure you pay attention to Google Webmaster Tools for messages.
    – closetnoc
    Commented Nov 21, 2014 at 17:46
  • @boot13 I've updated my answer. Can you check if the accesses are made at the same day / hour or are random? Commented Nov 21, 2014 at 18:44
  • @nunorbatista: they seem random. I've updated my question with some times.
    – boot13
    Commented Nov 21, 2014 at 20:00
  • @nunorbatista: see Update #3 above. It's not Googlebot or any other crawler, it's Google's proxy service. It's not related to Google's live site preview. It looks like one or more people just downloaded the binaries via Google Proxy, perhaps to get around a local block or restriction. The Spider trap suggestion is unlikely to help since the traffic is apparently not a bot. I would like to block Google Proxy IPs from accessing the folder containing the binaries; I'll try using the htaccess code, but of course the downloader could always switch to another proxy, so it may be pointless.
    – boot13
    Commented Nov 30, 2014 at 12:27

Not the answer you're looking for? Browse other questions tagged or ask your own question.