Reliability & Scale in AWS while letting you sleep through the night
- 1. ONE MAN OPS
Reliability & Scale in AWS while letting you sleep through the night
Jos Boumans - @jiboumans
http://www.fwallpaper.net/picture_pics-Sleepy-cat.html
- 4. CANONICAL
Engineering manager for Ubuntu Server 10.04 & 10.10
http://lukeroberts.deviantart.com/art/Destroy-Ubuntu-93235775 http://www.ubuntu.com/business/server/overview
- 8. 0 2,500 5,000 7,500 10,000
AVERAGE REQUESTS* / SEC
*Twitter: New tweets
Wikipedia: Articles read
https://twitter.com/tps_watcher
Krux: New data points
http://stats.wikimedia.org/EN/TablesPageViewsMonthlyCombined.htm
- 9. 0 125,000,000 250,000,000 375,000,000 500,000,000
MONTHLY UNIQUE USERS
http://www.mediabistro.com/alltwitter/twitter-active-total-users_b17655
http://technorati.com/technology/article/wikipedias-nonprofit-parent-raises-20-million/
- 13. APRIL 21, 2011
http://aws.amazon.com/message/65648/
http://businessnerds.wordpress.com/2011/05/28/so-far-so-good…-the-review/ http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html
- 17. AWS OUTAGE = YOUR OUTAGE
http://it.mario.wikia.com/wiki/Lakitu
- 18. THE RULES HAVE CHANGED
You're not in Kansas anymore
http://entreatmenot.blogspot.com/2011/04/shattered-dreams.html
- 19. NETWORK WILL PARTITION
And it will happen often
http://thevinylvillain.blogspot.com/2010_04_01_archive.html
- 20. DISK IO WILL FLUCTUATE
On a good day, it's mediocre
http://www.freeguidetonwcamping.com/oregon_washington_main/washington/southwest_wa/cape_disappointment_sp.htm
- 21. IP ADDRESSES WILL CHANGE
IP lease is 8 hours
DNS TTL is 60 seconds
www.fantom-xp.com
- 22. INSTANCES WILL DIE
And it will always be your Database Master
http://room57.deviantart.com/art/Hangman-188353196
- 24. EMBRACE FAILURE
Hardware will fail. Humans will make errors.
Nature will produce thunderstorms.
http://www.freeguidetonwcamping.com/oregon_washington_main/washington/southwest_wa/cape_disappointment_sp.htm
- 25. ADJUST YOUR STRATEGY
Don't bring a knife to a gun fight
http://www.flickr.com/photos/statlerhotel/6628770499/sizes/l/in/photostream/
- 26. DATA STORES
Some work better than others
http://gustavhoiland.com/2010/03/10/stacked-boxes/
- 27. RDBMS
CouchDB
BigTable Based
Dynamo Based
Master / Slave based
CAP THEOREM
Your choice: sacrifice availability or consistency.
Orange is a lie.
- 29. BIGTABLE BASED STORES
HBase, Accumulo, Hypertable
Still suffer when network partitioning happens
http://www.cloudera.com/cdh4/
- 30. DYNAMO BASED STORES
Cassandra, Riak, DynamoDB
http://www.fromoldbooks.org/Walker-ElectricLightingForShips/pages/015-Siemens-Alternate-Current-Dynamo//1552x1175-q75.html http://aws.amazon.com/dynamodb/faqs/
- 31. GO HOSTED?
CouchDB, MongoDB, Riak, Cassandra, HBase
Your Latency May Vary
http://www.fromoldbooks.org/Walker-ElectricLightingForShips/pages/015-Siemens-Alternate-Current-Dynamo//1552x1175-q75.html
- 32. CLIENT SIDE STORAGE
Keep a copy of your users data locally
http://www.wired.com/gadgetlab/2012/03/badass-gadget-ammo-lunch-box/ http://www.w3.org/2001/tag/2010/09/ClientSideStorage.html
- 33. FILE STORES
EBS vs Instance Store
http://homedezine.blogspot.com/2011/04/day-my-cat-removed-carpet-photo-studio.html
- 34. SIMPLE STORAGE SERVICE
S3: Arguably AWS' best feature
http://www.iwallpaper.us/gold-star-fo-christmas-wallpaper-140/
- 35. TRAFFIC SHAPING
Control every part of the request
http://www.visualphotos.com/image/2x4154765/man_standing_with_traffic_cones_in_shape_of_u-turn
- 36. STAY LOCAL IF YOU CAN
Going off box exposes you to risks you need to mitigate
http://southshorewoman.com/issue/june-2010/article/local-character
- 37. CACHE WHAT YOU CAN
HTTP Responses, DB Queries, User content
Browsers have caches too!
http://theoatmeal.com/blog/charity_money
- 38. USE ELASTIC LOAD BALANCERS
They will save you more than once
http://wallpapers5.com/wallpaper/Balance-Green-Tree-Frog/
- 39. USE GLOBAL LOAD BALANCING
Fail over to the closest data center on region failure
- 41. USE A CDN
Critical items should always be available
http://kadanthuponanimidangal.blogspot.com/2010/12/blog-post_6992.html
- 42. MEASURE EVERYTHING
Find outliers, deviants & trends before they cause trouble
http://www.themoviedb.org/movie/629-the-usual-suspects
- 43. GRAPHITE, STATSD & COLLECTD
Use Statsd & Collectd for application/system metrics
Use graphite to store, aggregate & visualize
http://hostedgraphite.com/
http://bakingismyzen.blogspot.com/2011/07/beignets-cant-have-just-one.html http://jiboumans.wordpress.com/2012/07/02/measure-all-the-things/
- 44. GRAPH EVENTS
Deployments, outages, CDN reconfigurations, failed builds, etc
Anything that's important to the health of your eco system
http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/
- 45. COMPARE WEEK TO WEEK
Overlay week to week graphs using timeShift()
Quickly identifies trends and deviations from trends
http://obfuscurity.com/2012/04/Unhelpful-Graphite-Tip-10
- 46. FORECASTING
Use Holt-Winters confidence bands
Verify that your metrics are within normal tolerance
https://github.com/ripienaar/graphite-graph-dsl/wiki/Creating-Holt-Winters-Forecasts
- 47. FIND INDIVIDUAL OUTLIERS
Absolute numbers mean very little
Use mean & standard deviation
http://en.wikipedia.org/wiki/File:Black_sheep-1.jpg
- 48. ALERT ON TRENDS
Once you go over a threshold, it's too late
Alert on unwanted trends and preemptively fix
http://sub-second.blogspot.com/2012/06/reporting-response-times-percentile.html http://aphyr.github.com/riemann/
- 49. MEASURE WITHOUT RETROFIT
LogFormat "http.beacon:%D|ms" stats
CustomLog "|nc -u localhost 8125" stats
http://absinthemindedhero.blogspot.com/2012/03/victory-nonetheless.html http://jiboumans.wordpress.com/2012/07/02/measure-all-the-things/
- 50. SHOUT OUT: NEW RELIC
Python, Ruby, .NET, Java, PHP support
In depth profiling of your app for performance & errors.
- 51. CONFIGURATION MANAGEMENT
Unique snowflakes are bad
http://www.torange.us/Plants/Conifers/spruce-needles-in-hoarfrost-424.html
- 52. PUPPET VS CHEF
Yes.
http://puppetlabs.com/
http://www.opscode.com/chef
- 53. INFRASTRUCTURE AS CODE
Use different environments
Measure and report on it
http://americansingercanary.com/green.htm
- 54. SHOUT OUT: UBUNTU
Ubuntu + cloud-init + boto = awesome*
*I am biased
http://www.123rf.com/photo_4871141_food-pyramid-isolated-on-white.html https://github.com/krux/ops-tools
- 55. DEV = PRODUCTION
"I dunno, it worked on my laptop"
Instead, use vagrant
http://vagrantup.com/ http://vagrantup.com/
- 56. ROLL YOUR OWN AMIS
Instantly boot up new deployments
Reduce Time to Respond
http://bakingismyzen.blogspot.com/2011/07/beignets-cant-have-just-one.html http://puppetlabs.com/blog/rapid-scaling-with-auto-generated-amis-using-puppet/
- 57. CONFIDENT DEPLOYS
That human error could be yours
http://www.etsy.com/listing/37178125/stormtrooper-regrets-those-were-the
- 58. CONTINUOUS INTEGRATION
Ours: Github + Jenkins + FPM + apt::s3
From commit to deployable in one command http://github.com/
http://jenkins-ci.org/
https://github.com/thekad/apt-s3
https://github.com/jordansissel/fpm/wiki/
- 59. ONE CLICK DEPLOYMENTS
Deployments should not be exciting.
Don't create a checklist; automate & track
http://www.thegreenhead.com/2012/07/one-click-butter-cutter.php https://checkmarkable.com/
- 60. DARK LAUNCHES
Exercise the code without impacting the user experience
http://www.kissmetrics.com/
http://www.layoutsparks.com/pictures/moon-23 https://github.com/yahoo/boomerang/
- 61. SHADOW TRAFFIC
Test new code against live traffic
http://doppelthingers.tumblr.com/post/12839979386/traffic-light-shadow-hangman-and-possibly-his https://gist.github.com/3125323
- 62. SLEEP TIGHT
Slides at: www.Slideshare.net/jiboumans
We're hiring: www.krux.com
http://raafay-awan.blogspot.com/2011/08/cats-cutest-of-creatures.html