Architectural Patterns of Resilient Distributed Systems

! OMG Strangeloop 2015!!
Architectural
Patterns of
Resilient
Distributed
Systems

Ines
Sombra
@Randommood
ines@fastly.com

Globally distributed & highly available

Today’s Journey Why care?
Resilience
literature
Resilience in
industry
Conclusions
@randommood
♥

OBLIGATORY DISCLAIMER SLIDE
 
All from a
practitioner’s
perspective!
@randommood
Things you may see in this talk
Pugs
Fast talking
Life pondering
Un-tweetable moments
Rantifestos
What surprised me this year
Wedding factoids and trivia

How can I
make a
system
more
resilient?
@randommood
♥

@randommood
Resilience is the ability
of a system to adapt or
keep working when
challenges occur

Defining Resilience
Fault-tolerance
Evolvability
Scalability
Failure isolation
Complexity
management
@randommood
♥

It’s
what
really
matters
@randommood

@randommood
Fraction of successfully answered queries
Close to uptime but more useful because
it directly maps to user experience
(uptime misses this)
Focus on yield rather than uptime
Yield

@randommood
From Coda Hale’s “You can’t sacrifice partition tolerance”
Server A Server B Server C
Baby AnimalsCute
Harvest Fraction of the complete result

@randommood
From Coda Hale’s “You can’t sacrifice partition tolerance”
Server A Server B Server C
Baby AnimalsCute
X
66% harvest
Harvest Fraction of the complete result

@randommood
#1: Probabilistic Availability
Graceful harvest degradation under faults
Randomness to make the worst-case &
average-case the same
Replication of high-priority data for greater
harvest control
Degrading results based on client capability

@randommood
#2 Decomposition & Orthogonality
Decomposing into subsystems independently
intolerant to harvest degradation but the
application can continue if they fail
You can only provide strong consistency for the
subsystems that need it
Orthogonal mechanisms (state vs functionality)
♥

@randommood
“If your system favors
yield or harvest is an
outcome of its design”
Fox & Brewer

Economic failure
boundary
Unacceptable
workload
boundary
Accident
boundary
Cook & Rasmussen
Operating point

Economic failure
boundary
Unacceptable
workload
boundary
Accident
boundary
Cook & Rasmussen

Economic failure
boundary
Unacceptable
workload
boundary
Accident
boundary Pressure
towards
eﬀiciency
Cook & Rasmussen

Economic failure
boundary
Unacceptable
workload
boundary
Accident
boundary Pressure
towards
eﬀiciency
Reduction
of eﬀort
Cook & Rasmussen

Economic failure
boundary
Unacceptable
workload
boundary
Accident
boundary Pressure
towards
eﬀiciency
Reduction
of eﬀort
Cook & Rasmussen
Incident!

Economic failure
boundary
Unacceptable
workload
boundary
Accident
boundary Pressure
towards
eﬀiciency
Reduction
of eﬀort
Safety
Campaign
Cook & Rasmussen

Economic failure
boundary
Unacceptable
workload
boundary
Accident
boundary Pressure
towards
eﬀiciency
Reduction
of eﬀort
error
margin
Marginal
boundary
Safety
Campaign
Cook & Rasmussen

error margin
Original
marginal
boundary
R.I.Cook - 2004
Acceptable
operating point
Accident
boundary
Flirting with the margin

R.I.Cook - 2004
Accident
boundary
Flirting with the margin
New marginal
boundary!

@randommood
Insights from Cook’s model
Engineering resilience requires a model of
safety based on: mentoring, responding,
adapting, and learning
System safety is about what can happen,
where the operating point actually is, and
what we do under pressure
Resilience is operator community focused

@randommood
Engineering system resilience
Build support for continuous maintenance
Reveal control of system to operators
Know it’s going to get moved, replaced, and
used in ways you did not intend
Think about configurations as interfaces

Architectural Patterns of Resilient Distributed Systems

Traditional 
engineering
Reactive 
ops unk-unk
@randommood
Probability
of failure
Rank
A system’s complexity
Cascading or
catastrophic failures &
you don’t know where
they will come from!
Same area as other 2
combined

Traditional 
engineering
Reactive 
ops unk-unk
@randommood
Failure areas need != strategies
Probability
of failure
Rank

Traditional 
engineering
Reactive 
ops unk-unk
@randommood
Probability
of failure
Rank
Kingsbury

Traditional 
engineering
Reactive 
ops unk-unk
@randommood
Probability
of failure
Rank
Kingsbury
VS

Traditional 
engineering
Reactive 
ops unk-unk
@randommood
Probability
of failure
Rank
Kingsbury
Alvaro
VS

Strategies to build resilience
Code standards
Programming
patterns
Testing (full system!)
Metrics & monitoring
Convergence to
good state
Hazard inventories
Redundancies
Feature flags
Dark deploys
Runbooks & docs
Canaries
System verification
Formal methods
Fault injection
Classical engineering Reactive Operations Unknown-Unknown
The goal is to build
failure domain
independence

@randommood
“Thinking about building
system resilience using a
single discipline is
insuﬀicient. We need
diﬀerent strategies”
Borrill

@randommood
Now with
sparkles!
✨
✨

@randommood
API inherently more vulnerable
to any system failures or
latencies in the stack
Without fault tolerance: 30
dependencies w 99.99% uptime
could result in 2+ hours of
downtime per month!
Leveraged client libraries

@randommood
Netflix’s resilient patterns
Aggressive network timeouts &
retries. Use of Semaphores.
Separate threads on per-
dependency thread pools
Circuit-breakers to relieve
pressure in underlying systems
Exceptions cause app to shed
load until things are healthy

@randommood
We went on a diet
just like you!#

@randommood
Key insights from Chubby
Library vs service? Service and client library
control + storage of small data files with
restricted operations
Engineers don’t plan for: availability,
consensus, primary elections, failures, their
own bugs, operability, or the future. They also
don’t understand Distributed Systems

@randommood
Key insights from Chubby
Centralized services are hard to construct but
you can dedicate eﬀort into architecting them
well and making them failure-tolerant
Restricting user behavior increased resilience
Consumers of your service are part of your UNK-
UNK scenarios

@randommood
And the family arrives!

@randommood
Key insights from Truce
Evolution of our purging
system from v1 to v3
Used Bimodal Multicast
(Gossip protocol) to
provide extremely fast
purging speed
Design concerns & system
evolution
Tyler McMullen Bruce Spang

Existing
best practices
won’t save
you
@randommood
Key insights from NetSys
João Taveira Araújo  
looking suave
Faild allows us to fail &
recover hosts via MAC-
swapping and ECMP on
switches
Do immediate or gradual
host failure & recovery
Watch Joao’s talk

@randommood
So we have a myriad of systems with
diﬀerent stages of evolution
Resilient systems like Varnish, Powderhorn,
and Faild have taught us many lessons but
some applications have availability
problems, why?
But wait a minute! ♥

Resilient
architectural
patterns

@randommood
Redundancies are key
Redundancies of resources,
execution paths, checks,
replication of data, replay of
messages, anti-entropy build
resilience
Gossip / epidemic protocols too
Capacity planning matters
Optimizations
can make your
system less
resilient!

@randommood
Unawareness of proximity to
error boundary means we are
always guessing
Complex operations make
systems less resilient & more
incident-prone
You design operability too!
Operations matter

@randommood
Complexity if increases
safety is actually good
Adding resilience may
come at the cost of
other desired goals
(e.g. performance,
simplicity, cost, etc)
Not all complexity is bad

@randommood
Leverage Engineering best practices
Resiliency and testing are correlated. TEST!
Versioning from the start - provide an upgrade
path from day 1
Upgrades & evolvability of systems is still tricky.
Mixed-mode operations need to be common
Re-examine the way we prototype systems

tl;dr
OPERABILITYWHILE IN DESIGN UNK-UNK
Are we favoring
harvest or yield?
Orthogonality &
decomposition FTW
Do we have enough
redundancies in
place?
Are we resilient to
our dependencies?
Am I providing
enough control to
my operators?
Would I want to be
on call for this?
Rank your services:
what can be
dropped, killed,
deferred?
Monitoring and
alerting in place?
The existence of this
stresses diligence
on the other two
areas
Have we done
everything we can?
Abandon hope and
resort to human
sacrifices
♥ ♥
Theory matters!

IMPROVING OPERABILITYWHILE IN DESIGN
Test dependency failures
Code reviews != tests. Have both
Distrust client behavior, even if
they are internal
Version (APIs, protocols, disk
formats) from the start. Support
mixed-mode operations.
Checksum all the things
Error handling, circuit breakers,
backpressure, leases, timeouts
Automation shortcuts taken
while in a rush will come back to
haunt you
Release stability is o"en tied to
system stability. Iron out your
deploy process
Link alerts to playbooks
Consolidate system
configuration (data bags, config
file, etc)
tl;dr♥ ♥
Operators determine resilience

@randommood
We can’t recover from lack of
design. Not minding harvest/yield
means we sign up for a redesign
the moment we finish coding.
TODAY’S RANTIFESTO♥ ♥

Thank you!github.com/Randommood/Strangeloop2015
77
Special thanks to
Paul Borrill, Jordan West, Caitie
McCaﬀrey, Camille Fournier, Mike
O'Neill, Neha Narula, Joao Taveira,
Tyler McMullen, Zac Duncan,
Nathan Taylor, Ian Fung, Armon
Dadgard, Peter Alvaro, Peter Bailis,
Bruce Spang, Matt Whiteley, Alex
Rasmussen, Aysulu Greenberg,
Elaine Greenberg, and Greg Bako.

Architectural Patterns of Resilient Distributed Systems

Related slideshows

More Related Content

Architectural Patterns of Resilient Distributed Systems