Architectural Patterns of Resilient Distributed Systems
- 4. Today’s Journey Why care?
Resilience
literature
Resilience in
industry
Conclusions
@randommood
♥
- 5. OBLIGATORY DISCLAIMER SLIDE
All from a
practitioner’s
perspective!
@randommood
Things you may see in this talk
Pugs
Fast talking
Life pondering
Un-tweetable moments
Rantifestos
What surprised me this year
Wedding factoids and trivia
- 13. @randommood
Fraction of successfully answered queries
Close to uptime but more useful because
it directly maps to user experience
(uptime misses this)
Focus on yield rather than uptime
Yield
- 14. @randommood
From Coda Hale’s “You can’t sacrifice partition tolerance”
Server A Server B Server C
Baby AnimalsCute
Harvest Fraction of the complete result
- 15. @randommood
From Coda Hale’s “You can’t sacrifice partition tolerance”
Server A Server B Server C
Baby AnimalsCute
X
66% harvest
Harvest Fraction of the complete result
- 17. @randommood
#2 Decomposition & Orthogonality
Decomposing into subsystems independently
intolerant to harvest degradation but the
application can continue if they fail
You can only provide strong consistency for the
subsystems that need it
Orthogonal mechanisms (state vs functionality)
♥
- 35. @randommood
Insights from Cook’s model
Engineering resilience requires a model of
safety based on: mentoring, responding,
adapting, and learning
System safety is about what can happen,
where the operating point actually is, and
what we do under pressure
Resilience is operator community focused
- 36. @randommood
Engineering system resilience
Build support for continuous maintenance
Reveal control of system to operators
Know it’s going to get moved, replaced, and
used in ways you did not intend
Think about configurations as interfaces
- 47. Strategies to build resilience
Code standards
Programming
patterns
Testing (full system!)
Metrics & monitoring
Convergence to
good state
Hazard inventories
Redundancies
Feature flags
Dark deploys
Runbooks & docs
Canaries
System verification
Formal methods
Fault injection
Classical engineering Reactive Operations Unknown-Unknown
The goal is to build
failure domain
independence
- 52. @randommood
API inherently more vulnerable
to any system failures or
latencies in the stack
Without fault tolerance: 30
dependencies w 99.99% uptime
could result in 2+ hours of
downtime per month!
Leveraged client libraries
- 53. @randommood
Netflix’s resilient patterns
Aggressive network timeouts &
retries. Use of Semaphores.
Separate threads on per-
dependency thread pools
Circuit-breakers to relieve
pressure in underlying systems
Exceptions cause app to shed
load until things are healthy
- 56. @randommood
Key insights from Chubby
Library vs service? Service and client library
control + storage of small data files with
restricted operations
Engineers don’t plan for: availability,
consensus, primary elections, failures, their
own bugs, operability, or the future. They also
don’t understand Distributed Systems
- 57. @randommood
Key insights from Chubby
Centralized services are hard to construct but
you can dedicate effort into architecting them
well and making them failure-tolerant
Restricting user behavior increased resilience
Consumers of your service are part of your UNK-
UNK scenarios
- 59. @randommood
Key insights from Truce
Evolution of our purging
system from v1 to v3
Used Bimodal Multicast
(Gossip protocol) to
provide extremely fast
purging speed
Design concerns & system
evolution
Tyler McMullen Bruce Spang
- 63. @randommood
So we have a myriad of systems with
different stages of evolution
Resilient systems like Varnish, Powderhorn,
and Faild have taught us many lessons but
some applications have availability
problems, why?
But wait a minute! ♥
- 66. @randommood
Redundancies are key
Redundancies of resources,
execution paths, checks,
replication of data, replay of
messages, anti-entropy build
resilience
Gossip / epidemic protocols too
Capacity planning matters
Optimizations
can make your
system less
resilient!
- 67. @randommood
Unawareness of proximity to
error boundary means we are
always guessing
Complex operations make
systems less resilient & more
incident-prone
You design operability too!
Operations matter
- 69. @randommood
Leverage Engineering best practices
Resiliency and testing are correlated. TEST!
Versioning from the start - provide an upgrade
path from day 1
Upgrades & evolvability of systems is still tricky.
Mixed-mode operations need to be common
Re-examine the way we prototype systems
- 71. tl;dr
OPERABILITYWHILE IN DESIGN UNK-UNK
Are we favoring
harvest or yield?
Orthogonality &
decomposition FTW
Do we have enough
redundancies in
place?
Are we resilient to
our dependencies?
Am I providing
enough control to
my operators?
Would I want to be
on call for this?
Rank your services:
what can be
dropped, killed,
deferred?
Monitoring and
alerting in place?
The existence of this
stresses diligence
on the other two
areas
Have we done
everything we can?
Abandon hope and
resort to human
sacrifices
♥ ♥
Theory matters!
- 72. IMPROVING OPERABILITYWHILE IN DESIGN
Test dependency failures
Code reviews != tests. Have both
Distrust client behavior, even if
they are internal
Version (APIs, protocols, disk
formats) from the start. Support
mixed-mode operations.
Checksum all the things
Error handling, circuit breakers,
backpressure, leases, timeouts
Automation shortcuts taken
while in a rush will come back to
haunt you
Release stability is o"en tied to
system stability. Iron out your
deploy process
Link alerts to playbooks
Consolidate system
configuration (data bags, config
file, etc)
tl;dr♥ ♥
Operators determine resilience
- 73. @randommood
We can’t recover from lack of
design. Not minding harvest/yield
means we sign up for a redesign
the moment we finish coding.
TODAY’S RANTIFESTO♥ ♥
- 74. Thank you!github.com/Randommood/Strangeloop2015
77
Special thanks to
Paul Borrill, Jordan West, Caitie
McCaffrey, Camille Fournier, Mike
O'Neill, Neha Narula, Joao Taveira,
Tyler McMullen, Zac Duncan,
Nathan Taylor, Ian Fung, Armon
Dadgard, Peter Alvaro, Peter Bailis,
Bruce Spang, Matt Whiteley, Alex
Rasmussen, Aysulu Greenberg,
Elaine Greenberg, and Greg Bako.