SlideShare a Scribd company logo
ONE MAN OPS
      Reliability & Scale in AWS while letting you sleep through the night
                                                         Jos Boumans - @jiboumans
http://www.fwallpaper.net/picture_pics-Sleepy-cat.html
ONE OF A KIND
   My own category
RIPE NCC
Engineering manager for RIPE Database
                                        http://www.ripe.net/db
CANONICAL
                    Engineering manager for Ubuntu Server 10.04 & 10.10

http://lukeroberts.deviantart.com/art/Destroy-Ubuntu-93235775          http://www.ubuntu.com/business/server/overview
KRUX
VP of Operations & Infrastructure

                                    http://www.krux.com/
GOOD GUYS OF DATA PRIVACY
LOTS OF TRAFFIC
http://www.americapictures.net/buenos-aires-traffic-city-night-argentina.html
0                              2,500                 5,000        7,500   10,000



               AVERAGE REQUESTS* / SEC
                                                              *Twitter: New tweets
                                                              Wikipedia: Articles read
https://twitter.com/tps_watcher
                                                              Krux: New data points
http://stats.wikimedia.org/EN/TablesPageViewsMonthlyCombined.htm
0                            125,000,000                            250,000,000   375,000,000   500,000,000




                   MONTHLY UNIQUE USERS
http://www.mediabistro.com/alltwitter/twitter-active-total-users_b17655
http://technorati.com/technology/article/wikipedias-nonprofit-parent-raises-20-million/
WE CHOSE 'THE CLOUD'
http://previewnetworks.com/blog/
THERE ARE DOWNSIDES
http://modernsavage.hubpages.com/hub/10-springfield-shopper-headlines
FOCUS ON AWS
               http://aws.amazon.com/
APRIL 21, 2011
                                                                                                                    http://aws.amazon.com/message/65648/
http://businessnerds.wordpress.com/2011/05/28/so-far-so-good…-the-review/   http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html
... SOME OUTAGES ...
... SKIPPED FOR BREVITY ...
JUNE 14, 2012
http://www.laczik.org/BMW/repair/E38_wiring_harness/E38_wiring_harness.html   http://blog.pagerduty.com/2012/06/outage-post-mortem-june-14/
JUNE 29, 2012
http://www.fanpop.com/spots/thunderstorm/images/25416163/title/thunderstorms-wallpaper   http://aws.amazon.com/message/67457/
AWS OUTAGE = YOUR OUTAGE
http://it.mario.wikia.com/wiki/Lakitu
THE RULES HAVE CHANGED
                                                        You're not in Kansas anymore

http://entreatmenot.blogspot.com/2011/04/shattered-dreams.html
NETWORK WILL PARTITION
                                                              And it will happen often

http://thevinylvillain.blogspot.com/2010_04_01_archive.html
DISK IO WILL FLUCTUATE
                                                     On a good day, it's mediocre

http://www.freeguidetonwcamping.com/oregon_washington_main/washington/southwest_wa/cape_disappointment_sp.htm
IP ADDRESSES WILL CHANGE
                     IP lease is 8 hours
                    DNS TTL is 60 seconds
www.fantom-xp.com
INSTANCES WILL DIE
                                  And it will always be your Database Master

http://room57.deviantart.com/art/Hangman-188353196
HUMANS MAKE MISTAKES
     Including your humans
EMBRACE FAILURE
                                Hardware will fail. Humans will make errors.
                                   Nature will produce thunderstorms.
http://www.freeguidetonwcamping.com/oregon_washington_main/washington/southwest_wa/cape_disappointment_sp.htm
ADJUST YOUR STRATEGY
                                                      Don't bring a knife to a gun fight

http://www.flickr.com/photos/statlerhotel/6628770499/sizes/l/in/photostream/
DATA STORES
                                                     Some work better than others

http://gustavhoiland.com/2010/03/10/stacked-boxes/
RDBMS
  CouchDB
                                                   BigTable Based
Dynamo Based
                                                 Master / Slave based




               CAP THEOREM
       Your choice: sacrifice availability or consistency.
                       Orange is a lie.
MYSQL / ORACLE VS RDS
  See: Network partitioning & instances dying
BIGTABLE BASED STORES
            HBase, Accumulo, Hypertable
 Still suffer when network partitioning happens
                                                  http://www.cloudera.com/cdh4/
DYNAMO BASED STORES
                                                         Cassandra, Riak, DynamoDB

http://www.fromoldbooks.org/Walker-ElectricLightingForShips/pages/015-Siemens-Alternate-Current-Dynamo//1552x1175-q75.html   http://aws.amazon.com/dynamodb/faqs/
GO HOSTED?
                                 CouchDB, MongoDB, Riak, Cassandra, HBase
                                          Your Latency May Vary
http://www.fromoldbooks.org/Walker-ElectricLightingForShips/pages/015-Siemens-Alternate-Current-Dynamo//1552x1175-q75.html
CLIENT SIDE STORAGE
                                          Keep a copy of your users data locally

http://www.wired.com/gadgetlab/2012/03/badass-gadget-ammo-lunch-box/       http://www.w3.org/2001/tag/2010/09/ClientSideStorage.html
FILE STORES
                                                                   EBS vs Instance Store

http://homedezine.blogspot.com/2011/04/day-my-cat-removed-carpet-photo-studio.html
SIMPLE STORAGE SERVICE
                                                        S3: Arguably AWS' best feature

http://www.iwallpaper.us/gold-star-fo-christmas-wallpaper-140/
TRAFFIC SHAPING
                                                Control every part of the request

http://www.visualphotos.com/image/2x4154765/man_standing_with_traffic_cones_in_shape_of_u-turn
STAY LOCAL IF YOU CAN
                 Going off box exposes you to risks you need to mitigate

http://southshorewoman.com/issue/june-2010/article/local-character
CACHE WHAT YOU CAN
                                  HTTP Responses, DB Queries, User content
                                         Browsers have caches too!
http://theoatmeal.com/blog/charity_money
USE ELASTIC LOAD BALANCERS
                                                They will save you more than once

http://wallpapers5.com/wallpaper/Balance-Green-Tree-Frog/
USE GLOBAL LOAD BALANCING
  Fail over to the closest data center on region failure
SHOUT OUT: DYN
DNS for Bit.ly, Quora, Twitter, Wikia, etc
USE A CDN
                                        Critical items should always be available

http://kadanthuponanimidangal.blogspot.com/2010/12/blog-post_6992.html
MEASURE EVERYTHING
                Find outliers, deviants & trends before they cause trouble

http://www.themoviedb.org/movie/629-the-usual-suspects
GRAPHITE, STATSD & COLLECTD
                       Use Statsd & Collectd for application/system metrics
                           Use graphite to store, aggregate & visualize
                                                                                                                    http://hostedgraphite.com/
http://bakingismyzen.blogspot.com/2011/07/beignets-cant-have-just-one.html   http://jiboumans.wordpress.com/2012/07/02/measure-all-the-things/
GRAPH EVENTS
         Deployments, outages, CDN reconfigurations, failed builds, etc
          Anything that's important to the health of your eco system
http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/
COMPARE WEEK TO WEEK
                          Overlay week to week graphs using timeShift()
                         Quickly identifies trends and deviations from trends
http://obfuscurity.com/2012/04/Unhelpful-Graphite-Tip-10
FORECASTING
                                 Use Holt-Winters confidence bands
                        Verify that your metrics are within normal tolerance
https://github.com/ripienaar/graphite-graph-dsl/wiki/Creating-Holt-Winters-Forecasts
FIND INDIVIDUAL OUTLIERS
                                                      Absolute numbers mean very little
                                                       Use mean & standard deviation
http://en.wikipedia.org/wiki/File:Black_sheep-1.jpg
ALERT ON TRENDS
                                Once you go over a threshold, it's too late
                              Alert on unwanted trends and preemptively fix
http://sub-second.blogspot.com/2012/06/reporting-response-times-percentile.html   http://aphyr.github.com/riemann/
MEASURE WITHOUT RETROFIT
                                          LogFormat "http.beacon:%D|ms" stats
                                         CustomLog "|nc -u localhost 8125" stats
http://absinthemindedhero.blogspot.com/2012/03/victory-nonetheless.html   http://jiboumans.wordpress.com/2012/07/02/measure-all-the-things/
SHOUT OUT: NEW RELIC
         Python, Ruby, .NET, Java, PHP support
In depth profiling of your app for performance & errors.
CONFIGURATION MANAGEMENT
                                                             Unique snowflakes are bad

http://www.torange.us/Plants/Conifers/spruce-needles-in-hoarfrost-424.html
PUPPET VS CHEF
      Yes.

                         http://puppetlabs.com/
                 http://www.opscode.com/chef
INFRASTRUCTURE AS CODE
                                            Use different environments
                                            Measure and report on it
http://americansingercanary.com/green.htm
SHOUT OUT: UBUNTU
                                      Ubuntu + cloud-init + boto = awesome*
                                                                         *I am biased

http://www.123rf.com/photo_4871141_food-pyramid-isolated-on-white.html                  https://github.com/krux/ops-tools
DEV = PRODUCTION
                          "I dunno, it worked on my laptop"
                                 Instead, use vagrant
http://vagrantup.com/                                         http://vagrantup.com/
ROLL YOUR OWN AMIS
                                                Instantly boot up new deployments
                                                     Reduce Time to Respond
http://bakingismyzen.blogspot.com/2011/07/beignets-cant-have-just-one.html   http://puppetlabs.com/blog/rapid-scaling-with-auto-generated-amis-using-puppet/
CONFIDENT DEPLOYS
                                                   That human error could be yours

http://www.etsy.com/listing/37178125/stormtrooper-regrets-those-were-the
CONTINUOUS INTEGRATION
      Ours: Github + Jenkins + FPM + apt::s3
   From commit to deployable in one command                         http://github.com/
                                                                 http://jenkins-ci.org/
                                                   https://github.com/thekad/apt-s3
                                          https://github.com/jordansissel/fpm/wiki/
ONE CLICK DEPLOYMENTS
                                        Deployments should not be exciting.
                                      Don't create a checklist; automate & track
http://www.thegreenhead.com/2012/07/one-click-butter-cutter.php                    https://checkmarkable.com/
DARK LAUNCHES
               Exercise the code without impacting the user experience
                                                                          http://www.kissmetrics.com/
http://www.layoutsparks.com/pictures/moon-23                   https://github.com/yahoo/boomerang/
SHADOW TRAFFIC
                                                    Test new code against live traffic

http://doppelthingers.tumblr.com/post/12839979386/traffic-light-shadow-hangman-and-possibly-his   https://gist.github.com/3125323
SLEEP TIGHT
                                           Slides at: www.Slideshare.net/jiboumans
                                                 We're hiring: www.krux.com
http://raafay-awan.blogspot.com/2011/08/cats-cutest-of-creatures.html

More Related Content

Reliability & Scale in AWS while letting you sleep through the night

  • 1. ONE MAN OPS Reliability & Scale in AWS while letting you sleep through the night Jos Boumans - @jiboumans http://www.fwallpaper.net/picture_pics-Sleepy-cat.html
  • 2. ONE OF A KIND My own category
  • 3. RIPE NCC Engineering manager for RIPE Database http://www.ripe.net/db
  • 4. CANONICAL Engineering manager for Ubuntu Server 10.04 & 10.10 http://lukeroberts.deviantart.com/art/Destroy-Ubuntu-93235775 http://www.ubuntu.com/business/server/overview
  • 5. KRUX VP of Operations & Infrastructure http://www.krux.com/
  • 6. GOOD GUYS OF DATA PRIVACY
  • 8. 0 2,500 5,000 7,500 10,000 AVERAGE REQUESTS* / SEC *Twitter: New tweets Wikipedia: Articles read https://twitter.com/tps_watcher Krux: New data points http://stats.wikimedia.org/EN/TablesPageViewsMonthlyCombined.htm
  • 9. 0 125,000,000 250,000,000 375,000,000 500,000,000 MONTHLY UNIQUE USERS http://www.mediabistro.com/alltwitter/twitter-active-total-users_b17655 http://technorati.com/technology/article/wikipedias-nonprofit-parent-raises-20-million/
  • 10. WE CHOSE 'THE CLOUD' http://previewnetworks.com/blog/
  • 12. FOCUS ON AWS http://aws.amazon.com/
  • 13. APRIL 21, 2011 http://aws.amazon.com/message/65648/ http://businessnerds.wordpress.com/2011/05/28/so-far-so-good…-the-review/ http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html
  • 14. ... SOME OUTAGES ... ... SKIPPED FOR BREVITY ...
  • 15. JUNE 14, 2012 http://www.laczik.org/BMW/repair/E38_wiring_harness/E38_wiring_harness.html http://blog.pagerduty.com/2012/06/outage-post-mortem-june-14/
  • 17. AWS OUTAGE = YOUR OUTAGE http://it.mario.wikia.com/wiki/Lakitu
  • 18. THE RULES HAVE CHANGED You're not in Kansas anymore http://entreatmenot.blogspot.com/2011/04/shattered-dreams.html
  • 19. NETWORK WILL PARTITION And it will happen often http://thevinylvillain.blogspot.com/2010_04_01_archive.html
  • 20. DISK IO WILL FLUCTUATE On a good day, it's mediocre http://www.freeguidetonwcamping.com/oregon_washington_main/washington/southwest_wa/cape_disappointment_sp.htm
  • 21. IP ADDRESSES WILL CHANGE IP lease is 8 hours DNS TTL is 60 seconds www.fantom-xp.com
  • 22. INSTANCES WILL DIE And it will always be your Database Master http://room57.deviantart.com/art/Hangman-188353196
  • 23. HUMANS MAKE MISTAKES Including your humans
  • 24. EMBRACE FAILURE Hardware will fail. Humans will make errors. Nature will produce thunderstorms. http://www.freeguidetonwcamping.com/oregon_washington_main/washington/southwest_wa/cape_disappointment_sp.htm
  • 25. ADJUST YOUR STRATEGY Don't bring a knife to a gun fight http://www.flickr.com/photos/statlerhotel/6628770499/sizes/l/in/photostream/
  • 26. DATA STORES Some work better than others http://gustavhoiland.com/2010/03/10/stacked-boxes/
  • 27. RDBMS CouchDB BigTable Based Dynamo Based Master / Slave based CAP THEOREM Your choice: sacrifice availability or consistency. Orange is a lie.
  • 28. MYSQL / ORACLE VS RDS See: Network partitioning & instances dying
  • 29. BIGTABLE BASED STORES HBase, Accumulo, Hypertable Still suffer when network partitioning happens http://www.cloudera.com/cdh4/
  • 30. DYNAMO BASED STORES Cassandra, Riak, DynamoDB http://www.fromoldbooks.org/Walker-ElectricLightingForShips/pages/015-Siemens-Alternate-Current-Dynamo//1552x1175-q75.html http://aws.amazon.com/dynamodb/faqs/
  • 31. GO HOSTED? CouchDB, MongoDB, Riak, Cassandra, HBase Your Latency May Vary http://www.fromoldbooks.org/Walker-ElectricLightingForShips/pages/015-Siemens-Alternate-Current-Dynamo//1552x1175-q75.html
  • 32. CLIENT SIDE STORAGE Keep a copy of your users data locally http://www.wired.com/gadgetlab/2012/03/badass-gadget-ammo-lunch-box/ http://www.w3.org/2001/tag/2010/09/ClientSideStorage.html
  • 33. FILE STORES EBS vs Instance Store http://homedezine.blogspot.com/2011/04/day-my-cat-removed-carpet-photo-studio.html
  • 34. SIMPLE STORAGE SERVICE S3: Arguably AWS' best feature http://www.iwallpaper.us/gold-star-fo-christmas-wallpaper-140/
  • 35. TRAFFIC SHAPING Control every part of the request http://www.visualphotos.com/image/2x4154765/man_standing_with_traffic_cones_in_shape_of_u-turn
  • 36. STAY LOCAL IF YOU CAN Going off box exposes you to risks you need to mitigate http://southshorewoman.com/issue/june-2010/article/local-character
  • 37. CACHE WHAT YOU CAN HTTP Responses, DB Queries, User content Browsers have caches too! http://theoatmeal.com/blog/charity_money
  • 38. USE ELASTIC LOAD BALANCERS They will save you more than once http://wallpapers5.com/wallpaper/Balance-Green-Tree-Frog/
  • 39. USE GLOBAL LOAD BALANCING Fail over to the closest data center on region failure
  • 40. SHOUT OUT: DYN DNS for Bit.ly, Quora, Twitter, Wikia, etc
  • 41. USE A CDN Critical items should always be available http://kadanthuponanimidangal.blogspot.com/2010/12/blog-post_6992.html
  • 42. MEASURE EVERYTHING Find outliers, deviants & trends before they cause trouble http://www.themoviedb.org/movie/629-the-usual-suspects
  • 43. GRAPHITE, STATSD & COLLECTD Use Statsd & Collectd for application/system metrics Use graphite to store, aggregate & visualize http://hostedgraphite.com/ http://bakingismyzen.blogspot.com/2011/07/beignets-cant-have-just-one.html http://jiboumans.wordpress.com/2012/07/02/measure-all-the-things/
  • 44. GRAPH EVENTS Deployments, outages, CDN reconfigurations, failed builds, etc Anything that's important to the health of your eco system http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/
  • 45. COMPARE WEEK TO WEEK Overlay week to week graphs using timeShift() Quickly identifies trends and deviations from trends http://obfuscurity.com/2012/04/Unhelpful-Graphite-Tip-10
  • 46. FORECASTING Use Holt-Winters confidence bands Verify that your metrics are within normal tolerance https://github.com/ripienaar/graphite-graph-dsl/wiki/Creating-Holt-Winters-Forecasts
  • 47. FIND INDIVIDUAL OUTLIERS Absolute numbers mean very little Use mean & standard deviation http://en.wikipedia.org/wiki/File:Black_sheep-1.jpg
  • 48. ALERT ON TRENDS Once you go over a threshold, it's too late Alert on unwanted trends and preemptively fix http://sub-second.blogspot.com/2012/06/reporting-response-times-percentile.html http://aphyr.github.com/riemann/
  • 49. MEASURE WITHOUT RETROFIT LogFormat "http.beacon:%D|ms" stats CustomLog "|nc -u localhost 8125" stats http://absinthemindedhero.blogspot.com/2012/03/victory-nonetheless.html http://jiboumans.wordpress.com/2012/07/02/measure-all-the-things/
  • 50. SHOUT OUT: NEW RELIC Python, Ruby, .NET, Java, PHP support In depth profiling of your app for performance & errors.
  • 51. CONFIGURATION MANAGEMENT Unique snowflakes are bad http://www.torange.us/Plants/Conifers/spruce-needles-in-hoarfrost-424.html
  • 52. PUPPET VS CHEF Yes. http://puppetlabs.com/ http://www.opscode.com/chef
  • 53. INFRASTRUCTURE AS CODE Use different environments Measure and report on it http://americansingercanary.com/green.htm
  • 54. SHOUT OUT: UBUNTU Ubuntu + cloud-init + boto = awesome* *I am biased http://www.123rf.com/photo_4871141_food-pyramid-isolated-on-white.html https://github.com/krux/ops-tools
  • 55. DEV = PRODUCTION "I dunno, it worked on my laptop" Instead, use vagrant http://vagrantup.com/ http://vagrantup.com/
  • 56. ROLL YOUR OWN AMIS Instantly boot up new deployments Reduce Time to Respond http://bakingismyzen.blogspot.com/2011/07/beignets-cant-have-just-one.html http://puppetlabs.com/blog/rapid-scaling-with-auto-generated-amis-using-puppet/
  • 57. CONFIDENT DEPLOYS That human error could be yours http://www.etsy.com/listing/37178125/stormtrooper-regrets-those-were-the
  • 58. CONTINUOUS INTEGRATION Ours: Github + Jenkins + FPM + apt::s3 From commit to deployable in one command http://github.com/ http://jenkins-ci.org/ https://github.com/thekad/apt-s3 https://github.com/jordansissel/fpm/wiki/
  • 59. ONE CLICK DEPLOYMENTS Deployments should not be exciting. Don't create a checklist; automate & track http://www.thegreenhead.com/2012/07/one-click-butter-cutter.php https://checkmarkable.com/
  • 60. DARK LAUNCHES Exercise the code without impacting the user experience http://www.kissmetrics.com/ http://www.layoutsparks.com/pictures/moon-23 https://github.com/yahoo/boomerang/
  • 61. SHADOW TRAFFIC Test new code against live traffic http://doppelthingers.tumblr.com/post/12839979386/traffic-light-shadow-hangman-and-possibly-his https://gist.github.com/3125323
  • 62. SLEEP TIGHT Slides at: www.Slideshare.net/jiboumans We're hiring: www.krux.com http://raafay-awan.blogspot.com/2011/08/cats-cutest-of-creatures.html