Since about mid-August 2014, several Google servers have been downloading all of the (very) large binary files on my web site, about once a week. The IPs all show as owned by Google, and look like this: google-proxy-66-249-88-199.google.com. These are GET requests, and they are greatly affecting my server traffic.
Prior to this, I didn't see any traffic from these Google proxy IPs, so this seems to be something relatively new. I do see all kinds of traffic from other Google IPs, all of them googlebot and HEAD requests only.
I wouldn't be worried about this except that all of these files are being downloaded by Google about every week or so. The bandwidth used is starting to get excessive.
I've speculated that since many of these files are Windows executables, perhaps Google is downloading them to perform malware scans. Even if that's true, does that really need to happen every week?
Example traffic from google proxy IPs in November so far:
google-proxy-64-233-172-95.google.com: 8.09 GB
google-proxy-66-102-6-104.google.com: 7.50 GB
google-proxy-66-249-83-245.google.com: 3.35 GB
google-proxy-66-249-84-131.google.com: 1.54 GB
google-proxy-66-249-83-131.google.com: 4.98 GB
google-proxy-66-249-83-239.google.com: 2.48 GB
google-proxy-66-249-88-203.google.com: 2.94 GB
google-proxy-66-249-88-201.google.com: 2.58 GB
google-proxy-66-249-88-199.google.com: 4.89 GB
Update #1: I forgot to mention that the files in question are already in the site's robots.txt file. To make sue the robots.txt configuration is working properly, I also used the robots.txt tester in Google Webmaster Tools, which shows that the files are definitely being blocked for all Google bots, with one exception: Adsbot-Google. I'm not sure what that's about either. AND I searched Google for some of the files, and they do NOT appear in search results.
Update #2: Example: between 5:12am and 5:18am PST on November 17, about half a dozen IPs (all google-proxy) did GETs on all of the binary files in question, 27 in total. On November 4 between 2:09pm and 2:15pm PST, those same IPs did basically the same thing.
Update #3: At this point it seems clear that although these are valid Google IPs, they are part of Google's proxy service, and not part of Google's web crawling system. Because these are proxy addresses, there's no way to determine where the GET requests are actually originating, or whether they are coming from one place or many. Based on the sporadic nature of the GETs, it doesn't appear that there is anything nefarious going on; it's likely just someone deciding to download all the binaries while using Google's proxy service. Unfortunately, that service seems to be completely undocumented, which doesn't help. From a site administrator's standpoint, proxies are rather annoying. I don't want to block them, because they have legitimate uses. But they can also be misused.