11

I'm building an application that detects plagiarized answers on Stack Overflow, so I need to retrieve the content of answers programatically.

I know I can do this using the Stack Exchange API, but the API uses rate-limiting/throttling to prevent abuse.

I was considering just making normal HTTP page requests and scraping those rather than going through the API, but I was wondering if page requests are also rate-limited? For example, if I make 5 page requests a second for 30 seconds, would my IP address start to get significantly rate-limited? Is it possible that my IP address would even incur a permanent ban?

I found the following comment by Martin Smith mentioning page request rate-limiting on The Complete Rate-Limiting Guide,

There's also a limit for page requests per IP address per time period which I think must have recently been tightened (saw it twice yesterday) but I don't know what the exact limit is.

but other than that I didn't find anything definitive and authoritative about this. Are the exact details intentionally kept a secret to make it harder for malicious entities to DDoS Stack Exchange?

1 Answer 1

19

Yes, there's a rate-limit. No, the details aren't public (and are subject to change). Yes, if you're hammering the site you'll get blocked for some period of time and if you keep it up that time may become "indefinite".

The ideal way to do this is to download the data dump and read that as fast as your machine allows. The next-best method is to use the API. If you have to scrape the site, then be polite about it - you're not Google. A request every few seconds is probably reasonable for some small number of requests, but if you do get blocked then back off for a while - if you blindly keep making requests after being blocked, you're gonna stay blocked. Also, don't do this from your shared work network - if your network admin comes complaining that we've blocked your office, we'll probably give 'em enough details to figure out it was you who peed in the pool. This also assumes you're making light-weight requests; if you're hitting an expensive route, all bets are off - give up and hit the data dump like you should've done from the start.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .