SlideShare a Scribd company logo
Rate-Limiting
  at Scale
        SANS AppSec Las Vegas 2012
 Nick Galbreath @ngalbreath nickg@etsy.com
Who is Etsy? nick?
• “Marketplace for Small Creative
  Businesses”
• Alexa says #51 for USA traffic
• > $500MM transaction volume last year
• Billions and Billions of page views
• Nick Galbreath Director of Engineering
  focusing on Security, Fraud, and other fun
  stuff
What’s a Rate Limit?

   Maximum number of events
     per (brief) period per user
after which the resource is denied.

e.g. “no more than 2 logins per minute”
Why?
Robots gone Wild
• Robots / Crawlers (not always an intended
  DDoS)
  • 20,000 items in shopping cart
  • spam attack!
• Can crush sites very quickly, at almost no
  cost. Especially when crawl generates load
  or writes to the database
Humans are Resources too

  • Rate limits needed for anything that gets
    reviewed by humans such as customer
    service requests.
  • CRMs are typically bad at dealing with
    spammy stuff
Anything Involving
         Money
• Without rate limits on credit card
  authorizations your site becomes a card
  skimmer site.
• Using a website is much easier than going
  to the gas station pump or other
  anonymous card reader
Other Behaviors

• Password Changes
• Password resets
• Credit card / email / bank info
Do Rate Limits Stop all
   Fraud? No, but...
• Eliminates false positives and punks
• Allows you to focus on more sophisticated
  attacks
• Protects against damaging bursts of activity
  (malicious or not)
Rate Limits are needed
   on anything that
depends on an external
       resource
     This is almost everything!
Implementation
Continuous Rate Limits

• Store user identifier, event-type, timestamp
• Allows easy rate-limits for multiple ranges
• Allows easy cross-event limits
• Easy to implement in SQL
check
             25m




check   check
 10m     10m
Continuous RL Schema




 Check if your database timestamps store
  microseconds or not. You want ‘em.
Continuous RL Queries
Ouch!



• At scale, this is really painful for databases
  to handle.
• Constant binary-tree index churn
• Use in-memory database (or run off
  ramdisk) if trying this out
Quantized Rate Limits
• Stores a count in a time-window or bucket.
• Map current time to a bucket
•   (int) (NOW()/period) e.g.
    NOW()/3600 is gives the hour bucket.
Quanitzed time isn’t exact


bucket-123     bucket-124           bucket-125   bucket-
   10m            10m                  10m          10m




               check        check
                 2?           4
                                 check
                                  0?
Direct Lookup

• Everything is a primary key lookup.
  userid-event-period-bucketid
  60min: “nickg-login-3600-5589007547”
  10min: “nickg-login-600-33534045284”

• Multiple time-frames require multiple
  buckets, which means multiple inserting and
  checking.
Quantized RL Accuracy
 Not exact.
 If you set N per Period, quantized rate-limits
 may go as high as:
      (n-1)x2 per Period.
 e.g. 10 per minute --> 18 per minute

      Yikes. Maths!
In Pictures
 Rate Limit is “10”


9 OK                   9 OK



                18
               ooops
Rate-Limits at Scale
• We traded exact accuracy and flexibility for
  scaling.
• Implementation using Memcache or Redis
  (and perhaps SQL)
  set nickg-login-60-212331231 += 1

• Well known sharding techniques
• Auto-expiration of old buckets
• Each set/get takes 1/10 or less of
  millisecond. Almost invisible.
Memory

• Say 256 bytes per bucket
• 10,000,000 buckets is a lot of bucket
• But is only 2G, and fixed
• This is easy on one machine.
Usage
Please write unit tests!

• Easy to get wrong, and consequences can
  be unpleasant
• Edge cases and race conditions
 • memcache doesn’t have a “insert or
    increment” operation. Need to do
    multiple steps and check error
    conditions.
Please make an API
  • Make it simple for anyone to add rate
    limiting to their code.
  • Make it one line
// event, period, max events
if (rate_limit_exceed("signin", 60, 5)) {
    // do something
}
Rollout
• Once in production start with guestimates
  on rate limits
• If rate limit is triggered, take no action and
  only log/graph
• Does volume match expectations?
• Wash, Rinse, Repeat until tuned
  appropriately
oh yeah, don’t forget
  Put your
  rate-limit
 datastore
 behind the
   firewall
So a user hit a rate
         limit. Now what?
a dialog with product, customer service and engineering

     • Do you let them know? (visible indicator)
     • Do you start CAPTCHA-ing?
     • Do you black hole it? (silent)
     Also keep logging and graphing. You’ll need these
              to debug when things go awry.
Intermission
I feel bad if I don’t use a
 graph in a presentation
    CAPTCHA

              Etsy API
How we do it
• We use Graphite for real-time graphing
  http://graphite.wikidot.com/
• We use StatsD as our API
  http://etsy.me/dQwVXi
  https://github.com/etsy/statsd
• Our apps do this
  StatsD::increment('signins');
  UDP based -- can’t break the application
Division Built-in!
       Combine, Mix and Match data in Graphite to
                 discover new insights.
 Seasonal data.
Hard to alert on

But ratio of them is
 nearly constant.
 Easy to alert on.

           Who knew 1 in 5 logins
           are failures is universal?!

  p.s. Holt-Winters exponential smoothing is also built in
Ok back to
rate-limiting
Laddering

• Use laddering to do rate limits at different
  time scales for the same event.
• Set a short period and high rate to prevent
  bursts
• Then set a longer period with lower rate to
  prevent slow crawls robots.
Ladder longer periods
to have a smaller rate
Negative example:
2 per Minute ( ~0.033 events per sec )
 or 2x60 = 120 per Hour
   so laddering with

300 per Hour (~ 0.083 events per sec)
   does nothing, but
100 per Hour (~ 0.028) is good.
                         oh no! the maths again!
In Pictures...
    Rate limit of “3 per 1 box” - ok




    Rate Limit 5 per 3 boxes -- alert! (good)
but, say, rate limit 100 per 3 boxes does nothing
            and is impossible to trigger
Anonymous Identifiers
Anonymous Users
• hash of (IP + appropriate HTTP headers)
• order of headers matters
  different browsers order them differently
• Spoofed user agents don’t always get the
  order right

                Different type of
                Anonymous User
Rate Limit Every IP?

• Probably just Class C (only 16M of them)
• Maybe useful for just alerting
• Probably need whitelisting (e.g. AOL)
Rate Limit Datacenters
      http://github.com/client9/ipcat

 Datacenter / Rent-A-Slice / “hands not on
 keyboard” / leaseable CPU and network




       How much traffic is coming
            from them?
http://github.com/client9/ipcat




  No implication of wrong doing if on the list
• Almost every action on Etsy has laddered
  rate-limit
• We learn the hard way what is not limited
• Virtually no performance impact at scale
• Should we open source the driver?
Nick Galbreath nickg@etsy.com @ngalbreath
        SANS AppSec Las Vegas 2012

More Related Content

Rate Limiting at Scale, from SANS AppSec Las Vegas 2012

  • 1. Rate-Limiting at Scale SANS AppSec Las Vegas 2012 Nick Galbreath @ngalbreath nickg@etsy.com
  • 2. Who is Etsy? nick? • “Marketplace for Small Creative Businesses” • Alexa says #51 for USA traffic • > $500MM transaction volume last year • Billions and Billions of page views • Nick Galbreath Director of Engineering focusing on Security, Fraud, and other fun stuff
  • 3. What’s a Rate Limit? Maximum number of events per (brief) period per user after which the resource is denied. e.g. “no more than 2 logins per minute”
  • 5. Robots gone Wild • Robots / Crawlers (not always an intended DDoS) • 20,000 items in shopping cart • spam attack! • Can crush sites very quickly, at almost no cost. Especially when crawl generates load or writes to the database
  • 6. Humans are Resources too • Rate limits needed for anything that gets reviewed by humans such as customer service requests. • CRMs are typically bad at dealing with spammy stuff
  • 7. Anything Involving Money • Without rate limits on credit card authorizations your site becomes a card skimmer site. • Using a website is much easier than going to the gas station pump or other anonymous card reader
  • 8. Other Behaviors • Password Changes • Password resets • Credit card / email / bank info
  • 9. Do Rate Limits Stop all Fraud? No, but... • Eliminates false positives and punks • Allows you to focus on more sophisticated attacks • Protects against damaging bursts of activity (malicious or not)
  • 10. Rate Limits are needed on anything that depends on an external resource This is almost everything!
  • 12. Continuous Rate Limits • Store user identifier, event-type, timestamp • Allows easy rate-limits for multiple ranges • Allows easy cross-event limits • Easy to implement in SQL
  • 13. check 25m check check 10m 10m
  • 14. Continuous RL Schema Check if your database timestamps store microseconds or not. You want ‘em.
  • 16. Ouch! • At scale, this is really painful for databases to handle. • Constant binary-tree index churn • Use in-memory database (or run off ramdisk) if trying this out
  • 17. Quantized Rate Limits • Stores a count in a time-window or bucket. • Map current time to a bucket • (int) (NOW()/period) e.g. NOW()/3600 is gives the hour bucket.
  • 18. Quanitzed time isn’t exact bucket-123 bucket-124 bucket-125 bucket- 10m 10m 10m 10m check check 2? 4 check 0?
  • 19. Direct Lookup • Everything is a primary key lookup. userid-event-period-bucketid 60min: “nickg-login-3600-5589007547” 10min: “nickg-login-600-33534045284” • Multiple time-frames require multiple buckets, which means multiple inserting and checking.
  • 20. Quantized RL Accuracy Not exact. If you set N per Period, quantized rate-limits may go as high as: (n-1)x2 per Period. e.g. 10 per minute --> 18 per minute Yikes. Maths!
  • 21. In Pictures Rate Limit is “10” 9 OK 9 OK 18 ooops
  • 22. Rate-Limits at Scale • We traded exact accuracy and flexibility for scaling. • Implementation using Memcache or Redis (and perhaps SQL) set nickg-login-60-212331231 += 1 • Well known sharding techniques • Auto-expiration of old buckets • Each set/get takes 1/10 or less of millisecond. Almost invisible.
  • 23. Memory • Say 256 bytes per bucket • 10,000,000 buckets is a lot of bucket • But is only 2G, and fixed • This is easy on one machine.
  • 24. Usage
  • 25. Please write unit tests! • Easy to get wrong, and consequences can be unpleasant • Edge cases and race conditions • memcache doesn’t have a “insert or increment” operation. Need to do multiple steps and check error conditions.
  • 26. Please make an API • Make it simple for anyone to add rate limiting to their code. • Make it one line // event, period, max events if (rate_limit_exceed("signin", 60, 5)) { // do something }
  • 27. Rollout • Once in production start with guestimates on rate limits • If rate limit is triggered, take no action and only log/graph • Does volume match expectations? • Wash, Rinse, Repeat until tuned appropriately
  • 28. oh yeah, don’t forget Put your rate-limit datastore behind the firewall
  • 29. So a user hit a rate limit. Now what? a dialog with product, customer service and engineering • Do you let them know? (visible indicator) • Do you start CAPTCHA-ing? • Do you black hole it? (silent) Also keep logging and graphing. You’ll need these to debug when things go awry.
  • 31. I feel bad if I don’t use a graph in a presentation CAPTCHA Etsy API
  • 32. How we do it • We use Graphite for real-time graphing http://graphite.wikidot.com/ • We use StatsD as our API http://etsy.me/dQwVXi https://github.com/etsy/statsd • Our apps do this StatsD::increment('signins'); UDP based -- can’t break the application
  • 33. Division Built-in! Combine, Mix and Match data in Graphite to discover new insights. Seasonal data. Hard to alert on But ratio of them is nearly constant. Easy to alert on. Who knew 1 in 5 logins are failures is universal?! p.s. Holt-Winters exponential smoothing is also built in
  • 35. Laddering • Use laddering to do rate limits at different time scales for the same event. • Set a short period and high rate to prevent bursts • Then set a longer period with lower rate to prevent slow crawls robots.
  • 36. Ladder longer periods to have a smaller rate Negative example: 2 per Minute ( ~0.033 events per sec ) or 2x60 = 120 per Hour so laddering with 300 per Hour (~ 0.083 events per sec) does nothing, but 100 per Hour (~ 0.028) is good. oh no! the maths again!
  • 37. In Pictures... Rate limit of “3 per 1 box” - ok Rate Limit 5 per 3 boxes -- alert! (good) but, say, rate limit 100 per 3 boxes does nothing and is impossible to trigger
  • 39. Anonymous Users • hash of (IP + appropriate HTTP headers) • order of headers matters different browsers order them differently • Spoofed user agents don’t always get the order right Different type of Anonymous User
  • 40. Rate Limit Every IP? • Probably just Class C (only 16M of them) • Maybe useful for just alerting • Probably need whitelisting (e.g. AOL)
  • 41. Rate Limit Datacenters http://github.com/client9/ipcat Datacenter / Rent-A-Slice / “hands not on keyboard” / leaseable CPU and network How much traffic is coming from them?
  • 42. http://github.com/client9/ipcat No implication of wrong doing if on the list
  • 43. • Almost every action on Etsy has laddered rate-limit • We learn the hard way what is not limited • Virtually no performance impact at scale • Should we open source the driver?
  • 44. Nick Galbreath nickg@etsy.com @ngalbreath SANS AppSec Las Vegas 2012

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n