Rate Limiting at Scale, from SANS AppSec Las Vegas 2012

Rate-Limiting
at Scale
SANS AppSec Las Vegas 2012
Nick Galbreath @ngalbreath nickg@etsy.com

Who is Etsy? nick?
• “Marketplace for Small Creative
Businesses”
• Alexa says #51 for USA trafﬁc
• > $500MM transaction volume last year
• Billions and Billions of page views
• Nick Galbreath Director of Engineering
focusing on Security, Fraud, and other fun
stuff

What’s a Rate Limit?

Maximum number of events
per (brief) period per user
after which the resource is denied.

e.g. “no more than 2 logins per minute”

Robots gone Wild
• Robots / Crawlers (not always an intended
DDoS)
• 20,000 items in shopping cart
• spam attack!
• Can crush sites very quickly, at almost no
cost. Especially when crawl generates load
or writes to the database

Humans are Resources too

• Rate limits needed for anything that gets
reviewed by humans such as customer
service requests.
• CRMs are typically bad at dealing with
spammy stuff

Anything Involving
Money
• Without rate limits on credit card
authorizations your site becomes a card
skimmer site.
• Using a website is much easier than going
to the gas station pump or other
anonymous card reader

Other Behaviors

• Password Changes
• Password resets
• Credit card / email / bank info

Do Rate Limits Stop all
Fraud? No, but...
• Eliminates false positives and punks
• Allows you to focus on more sophisticated
attacks
• Protects against damaging bursts of activity
(malicious or not)

Rate Limits are needed
on anything that
depends on an external
resource
This is almost everything!

Continuous Rate Limits

• Store user identiﬁer, event-type, timestamp
• Allows easy rate-limits for multiple ranges
• Allows easy cross-event limits
• Easy to implement in SQL

check
25m

check check
10m 10m

Continuous RL Schema

Check if your database timestamps store
microseconds or not. You want ‘em.

Ouch!

• At scale, this is really painful for databases
to handle.
• Constant binary-tree index churn
• Use in-memory database (or run off
ramdisk) if trying this out

Quantized Rate Limits
• Stores a count in a time-window or bucket.
• Map current time to a bucket
• (int) (NOW()/period) e.g.
NOW()/3600 is gives the hour bucket.

Quanitzed time isn’t exact

bucket-123 bucket-124 bucket-125 bucket-
10m 10m 10m 10m

check check
2? 4
check
0?

Direct Lookup

• Everything is a primary key lookup.
userid-event-period-bucketid
60min: “nickg-login-3600-5589007547”
10min: “nickg-login-600-33534045284”

• Multiple time-frames require multiple
buckets, which means multiple inserting and
checking.

Quantized RL Accuracy
Not exact.
If you set N per Period, quantized rate-limits
may go as high as:
(n-1)x2 per Period.
e.g. 10 per minute --> 18 per minute

Yikes. Maths!

In Pictures
Rate Limit is “10”

9 OK 9 OK

18
ooops

Rate-Limits at Scale
• We traded exact accuracy and ﬂexibility for
scaling.
• Implementation using Memcache or Redis
(and perhaps SQL)
set nickg-login-60-212331231 += 1

• Well known sharding techniques
• Auto-expiration of old buckets
• Each set/get takes 1/10 or less of
millisecond. Almost invisible.

Memory

• Say 256 bytes per bucket
• 10,000,000 buckets is a lot of bucket
• But is only 2G, and ﬁxed
• This is easy on one machine.

Please write unit tests!

• Easy to get wrong, and consequences can
be unpleasant
• Edge cases and race conditions
• memcache doesn’t have a “insert or
increment” operation. Need to do
multiple steps and check error
conditions.

Please make an API
• Make it simple for anyone to add rate
limiting to their code.
• Make it one line
// event, period, max events
if (rate_limit_exceed("signin", 60, 5)) {
// do something
}

Rollout
• Once in production start with guestimates
on rate limits
• If rate limit is triggered, take no action and
only log/graph
• Does volume match expectations?
• Wash, Rinse, Repeat until tuned
appropriately

oh yeah, don’t forget
Put your
rate-limit
datastore
behind the
ﬁrewall

So a user hit a rate
limit. Now what?
a dialog with product, customer service and engineering

• Do you let them know? (visible indicator)
• Do you start CAPTCHA-ing?
• Do you black hole it? (silent)
Also keep logging and graphing. You’ll need these
to debug when things go awry.

I feel bad if I don’t use a
graph in a presentation
CAPTCHA

Etsy API

How we do it
• We use Graphite for real-time graphing
http://graphite.wikidot.com/
• We use StatsD as our API
http://etsy.me/dQwVXi
https://github.com/etsy/statsd
• Our apps do this
StatsD::increment('signins');
UDP based -- can’t break the application

Division Built-in!
Combine, Mix and Match data in Graphite to
discover new insights.
Seasonal data.
Hard to alert on

But ratio of them is
nearly constant.
Easy to alert on.

Who knew 1 in 5 logins
are failures is universal?!

p.s. Holt-Winters exponential smoothing is also built in

Laddering

• Use laddering to do rate limits at different
time scales for the same event.
• Set a short period and high rate to prevent
bursts
• Then set a longer period with lower rate to
prevent slow crawls robots.

Ladder longer periods
to have a smaller rate
Negative example:
2 per Minute ( ~0.033 events per sec )
or 2x60 = 120 per Hour
so laddering with

300 per Hour (~ 0.083 events per sec)
does nothing, but
100 per Hour (~ 0.028) is good.
oh no! the maths again!

In Pictures...
Rate limit of “3 per 1 box” - ok

Rate Limit 5 per 3 boxes -- alert! (good)
but, say, rate limit 100 per 3 boxes does nothing
and is impossible to trigger

Anonymous Users
• hash of (IP + appropriate HTTP headers)
• order of headers matters
different browsers order them differently
• Spoofed user agents don’t always get the
order right

Different type of
Anonymous User

Rate Limit Every IP?

• Probably just Class C (only 16M of them)
• Maybe useful for just alerting
• Probably need whitelisting (e.g. AOL)

Rate Limit Datacenters
http://github.com/client9/ipcat

Datacenter / Rent-A-Slice / “hands not on
keyboard” / leaseable CPU and network

How much trafﬁc is coming
from them?

http://github.com/client9/ipcat

No implication of wrong doing if on the list

• Almost every action on Etsy has laddered
rate-limit
• We learn the hard way what is not limited
• Virtually no performance impact at scale
• Should we open source the driver?

Nick Galbreath nickg@etsy.com @ngalbreath
SANS AppSec Las Vegas 2012

Rate Limiting at Scale, from SANS AppSec Las Vegas 2012

Related slideshows

More Related Content

Rate Limiting at Scale, from SANS AppSec Las Vegas 2012

Editor's Notes