Rate Limiting at Scale, from SANS AppSec Las Vegas 2012
- 1. Rate-Limiting
at Scale
SANS AppSec Las Vegas 2012
Nick Galbreath @ngalbreath nickg@etsy.com
- 2. Who is Etsy? nick?
• “Marketplace for Small Creative
Businesses”
• Alexa says #51 for USA traffic
• > $500MM transaction volume last year
• Billions and Billions of page views
• Nick Galbreath Director of Engineering
focusing on Security, Fraud, and other fun
stuff
- 3. What’s a Rate Limit?
Maximum number of events
per (brief) period per user
after which the resource is denied.
e.g. “no more than 2 logins per minute”
- 5. Robots gone Wild
• Robots / Crawlers (not always an intended
DDoS)
• 20,000 items in shopping cart
• spam attack!
• Can crush sites very quickly, at almost no
cost. Especially when crawl generates load
or writes to the database
- 6. Humans are Resources too
• Rate limits needed for anything that gets
reviewed by humans such as customer
service requests.
• CRMs are typically bad at dealing with
spammy stuff
- 7. Anything Involving
Money
• Without rate limits on credit card
authorizations your site becomes a card
skimmer site.
• Using a website is much easier than going
to the gas station pump or other
anonymous card reader
- 9. Do Rate Limits Stop all
Fraud? No, but...
• Eliminates false positives and punks
• Allows you to focus on more sophisticated
attacks
• Protects against damaging bursts of activity
(malicious or not)
- 10. Rate Limits are needed
on anything that
depends on an external
resource
This is almost everything!
- 12. Continuous Rate Limits
• Store user identifier, event-type, timestamp
• Allows easy rate-limits for multiple ranges
• Allows easy cross-event limits
• Easy to implement in SQL
- 13. check
25m
check check
10m 10m
- 16. Ouch!
• At scale, this is really painful for databases
to handle.
• Constant binary-tree index churn
• Use in-memory database (or run off
ramdisk) if trying this out
- 17. Quantized Rate Limits
• Stores a count in a time-window or bucket.
• Map current time to a bucket
• (int) (NOW()/period) e.g.
NOW()/3600 is gives the hour bucket.
- 18. Quanitzed time isn’t exact
bucket-123 bucket-124 bucket-125 bucket-
10m 10m 10m 10m
check check
2? 4
check
0?
- 19. Direct Lookup
• Everything is a primary key lookup.
userid-event-period-bucketid
60min: “nickg-login-3600-5589007547”
10min: “nickg-login-600-33534045284”
• Multiple time-frames require multiple
buckets, which means multiple inserting and
checking.
- 20. Quantized RL Accuracy
Not exact.
If you set N per Period, quantized rate-limits
may go as high as:
(n-1)x2 per Period.
e.g. 10 per minute --> 18 per minute
Yikes. Maths!
- 22. Rate-Limits at Scale
• We traded exact accuracy and flexibility for
scaling.
• Implementation using Memcache or Redis
(and perhaps SQL)
set nickg-login-60-212331231 += 1
• Well known sharding techniques
• Auto-expiration of old buckets
• Each set/get takes 1/10 or less of
millisecond. Almost invisible.
- 23. Memory
• Say 256 bytes per bucket
• 10,000,000 buckets is a lot of bucket
• But is only 2G, and fixed
• This is easy on one machine.
- 25. Please write unit tests!
• Easy to get wrong, and consequences can
be unpleasant
• Edge cases and race conditions
• memcache doesn’t have a “insert or
increment” operation. Need to do
multiple steps and check error
conditions.
- 26. Please make an API
• Make it simple for anyone to add rate
limiting to their code.
• Make it one line
// event, period, max events
if (rate_limit_exceed("signin", 60, 5)) {
// do something
}
- 27. Rollout
• Once in production start with guestimates
on rate limits
• If rate limit is triggered, take no action and
only log/graph
• Does volume match expectations?
• Wash, Rinse, Repeat until tuned
appropriately
- 28. oh yeah, don’t forget
Put your
rate-limit
datastore
behind the
firewall
- 29. So a user hit a rate
limit. Now what?
a dialog with product, customer service and engineering
• Do you let them know? (visible indicator)
• Do you start CAPTCHA-ing?
• Do you black hole it? (silent)
Also keep logging and graphing. You’ll need these
to debug when things go awry.
- 31. I feel bad if I don’t use a
graph in a presentation
CAPTCHA
Etsy API
- 32. How we do it
• We use Graphite for real-time graphing
http://graphite.wikidot.com/
• We use StatsD as our API
http://etsy.me/dQwVXi
https://github.com/etsy/statsd
• Our apps do this
StatsD::increment('signins');
UDP based -- can’t break the application
- 33. Division Built-in!
Combine, Mix and Match data in Graphite to
discover new insights.
Seasonal data.
Hard to alert on
But ratio of them is
nearly constant.
Easy to alert on.
Who knew 1 in 5 logins
are failures is universal?!
p.s. Holt-Winters exponential smoothing is also built in
- 35. Laddering
• Use laddering to do rate limits at different
time scales for the same event.
• Set a short period and high rate to prevent
bursts
• Then set a longer period with lower rate to
prevent slow crawls robots.
- 36. Ladder longer periods
to have a smaller rate
Negative example:
2 per Minute ( ~0.033 events per sec )
or 2x60 = 120 per Hour
so laddering with
300 per Hour (~ 0.083 events per sec)
does nothing, but
100 per Hour (~ 0.028) is good.
oh no! the maths again!
- 37. In Pictures...
Rate limit of “3 per 1 box” - ok
Rate Limit 5 per 3 boxes -- alert! (good)
but, say, rate limit 100 per 3 boxes does nothing
and is impossible to trigger
- 39. Anonymous Users
• hash of (IP + appropriate HTTP headers)
• order of headers matters
different browsers order them differently
• Spoofed user agents don’t always get the
order right
Different type of
Anonymous User
- 40. Rate Limit Every IP?
• Probably just Class C (only 16M of them)
• Maybe useful for just alerting
• Probably need whitelisting (e.g. AOL)
- 41. Rate Limit Datacenters
http://github.com/client9/ipcat
Datacenter / Rent-A-Slice / “hands not on
keyboard” / leaseable CPU and network
How much traffic is coming
from them?
- 43. • Almost every action on Etsy has laddered
rate-limit
• We learn the hard way what is not limited
• Virtually no performance impact at scale
• Should we open source the driver?
Editor's Notes
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n