SlideShare a Scribd company logo
From	Resilient	to	Antifragile	
Chaos	Engineering	Primer
By	@Sergiu_Bodiu	
Solution	Architect
@Sergiu_Bodiu2
From	Resilient	to	Antifragile	
Chaos	Engineering	Primer
By	@Sergiu_Bodiu	
Solution	Architect				
DevOpsDays
Singapore

Conference
Singapore Spring
User Group
@Sergiu_Bodiu
what	is	an	ARCHITECT
3 https://www.thekua.com/atwork/2016/11/the-well-rounded-architect/@patkua
@Sergiu_Bodiu
Risk	management
4
The	new	normal:	
from RESILIENT
to ANTIFRAGILE
@Sergiu_Bodiu
A	new	way	to	look	at	organizations
5
Fragile:		At	risk	of	total	failure	/	financial	ruin	
Resilient:	Takes	damage,	avoids	total	failure,	
recovers	
Robust:	Absorbs	uncertainty,	repels	blows,	
avoids	damage	
Antifragile:	Responds	to	stress	by	mutating,	
maintains	fitness	for	purpose.	Identity	Change.
@Sergiu_Bodiu
Blueprint	for	living	in	a	Black	Swan	world.
6
Antifragile
and only
the
Antifragile,
will Make it.
@Sergiu_Bodiu
https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing
7
1. The network is reliable.
2. Latency is zero.
3. Bandwidth is infinite.
4. The network is secure.
5. Topology doesn't change.
6. There is one administrator.
7. Transport cost is zero.
8. The network is homogeneous.
@Sergiu_Bodiu
Software	is	Single	Point	of	Failure
8
Root Cause Analysis: While component failures such as
NETWORK, STORAGE, SERVER, HARDWARE, and POWER
failures are anticipated and thus guarded with extra
redundancies.
@Sergiu_Bodiu
Distributed	Systems	Complexity
9
Complexity is
like
Addiction…
Case study: How
complexity creeps in
- @jasonfried
https://m.signalvnoise.com/case-study-how-complexity-creeps-in-cba48023e6a1
From Resilient to Antifragile Chaos Engineering Primer
@Sergiu_Bodiu
Chaos	Engineering
11
Discipline of experimenting on
a distributed system in order to
build confidence in the
system’s capability to withstand
turbulent conditions in
production.
NETFLIX
http://principlesofchaos.org
@Sergiu_Bodiu
Some	outages	in	the	Region
12
SingTel fined a record $6m for Bukit
Panjang exchange fire;
Telstra goes down again, people
can't drink beer or catch Ubers
Amazon Web Services outage
causes Australian website chaos
@Sergiu_Bodiu
Backups
13
"Backups always succeed.
It's the restores that fail.
Test your backups by practicing
restores!"
Using Chaos Monkey
@Sergiu_Bodiu
Netflix	Simian	Army
14
Suite of tools for
keeping your
cloud operating
in top form.
https://github.com/Netflix/
SimianArmy
@Sergiu_Bodiu
Chaos	Monkey
15
1.Active during normal working
hours
2.Break things in production
3.Design better software
services
4.Embracing failure
http://techblog.netflix.com/2016/10/netflix-chaos-monkey-upgraded.html
@Sergiu_Bodiu
https://github.com/Netflix/security_monkey
16 https://github.com/Netflix/security_monkey
Monitor AWS and GCP
accounts for policy changes
and alerts on insecure
configurations.
Security Monkey can be
extended with custom
account types, custom
watchers, custom auditors,
and custom alerters.
@Sergiu_Bodiu
Other	Monkeys
17
•Latency Monkey
•Janitor Monkey
•Conformity
Monkey
•Doctor Monkey
@Sergiu_Bodiu18
@Sergiu_Bodiu
PRINCIPLES	>	TOOLS	
Why	do	we	do	>	
What	we	do
19
@Sergiu_Bodiu
Dejirafication
20Alexey Krivitsky https://www.slideshare.net/krivitsky/dejirafication-clean-your-process
@Sergiu_Bodiu
Principles	of	Chaos
21
1.Build a Hypothesis around Steady State Behavior
2.Vary Real-world Events
3.Run Experiments in Production
4.Automate Experiments to Run Continuously
Chaos Engineering Whitepaper 2016
@Sergiu_Bodiu
Hypothesize
22
> sudo watch
• Start with steady state behavior.
• Monitor metrics that are visible
• Capture an interaction between the users and the system.
TIP: Utilisationis Virtually Useless as a Metric!
@Sergiu_Bodiu
Vary	Events	
23
> sudo halt
• Terminate virtual machine instances
• Inject latency into requests between services
• Fail requests between services
• Fail an internal microservice
• Make an entire region unavailable
TIP: Select only a subset of users
@Sergiu_Bodiu
Experiment
24
• End to end TESTING (Expensive)
• Process is slow
• Configuration Drfit from Production
• 92% ERRORS could be prevented (Simple)
TIP: Customersdon't behave as your JMeter
script.
https://www.usenix.org/system/files/conference/osdi14/osdi14paperyuan.pdf
@Sergiu_Bodiu
Automate
25
> sudo while (1)
• Distributed systems changes continuously over time.
• Engineers modify the behavior of existing services, add new
services.
• Engineers are changing runtime configuration parameters,
upgrading and patching systems
TIP: Depending on the context, changethe rate of
each experiment.
@Sergiu_Bodiu
Principles	of	Chaos
26
1.Build a Hypothesis around Steady State Behavior
2.Vary Real-world Events
3.Run Experiments in Production
4.Automate Experiments to Run Continuously
TIP: Intentionally breakthings, compare
measured with expected impact, and correct any
problems uncovered this way.
Chaos Engineering Whitepaper 2016
@Sergiu_Bodiu
Reference	Architecture	for	Cloud	Native	Platform
27 https://content.pivotal.io/white-papers/the-upside-down-economics-of-building-your-own-platform
@Sergiu_Bodiu
Pivotal	Cloud	Foundry
28
@Sergiu_Bodiu
Chaos	Lemur	demo
29
Chaos Lemur =
Chaos Monkey + PCF
@Sergiu_Bodiu
Locust	demo
30
Locust	is	an	open-source	Python	load	testing	
framework.	
• Define	user	behaviour	in	code	
• Can	execute	end-to-end	user	test	with	sessions	and	
cookies.	
• Expands	to	multiple	slaves	to	increase	load	capacity	
• Allows	for	distributed	user	paths	based	on	
percentages	
		
Gatling	is	an	open-source	Scala	load	testing	framework	
• High	performance	
• Ready-to-present	HTML	reports	
• Scenario	recorder	and	developer-friendly	DSL
@Sergiu_Bodiu
Lessons	Learned
31
• Systematic approach to Chaos Testing
• This is incredible hard under pressure.
• Don’t wait so long to start load testing.
• The conversations drive new requirements.
• Changing architecture last minute is extremely
dangerous.
• Join the community
• Build relation with Networking Team, Database Team,
Third Party Partners, Vendors etc..
• Make everything Asynchronous (Embrace Failure,
Background Tasks, Retry, Idempotence)
@Sergiu_Bodiu
The	importance	of	reliability
32
Don't trust claims systems make
about themselves & their
dependencies.
Verify by breaking.
@Sergiu_Bodiu
Clean	your	process
33
Culture	>	Principles	>	
Tools	
> Post Mortem
> sudo halt

Incident Start

Impact
@Sergiu_Bodiu
Testing	Pyramid
34https://watirmelon.blog/2012/01/31/introducing-the-software-testing-ice-cream-cone/
@Sergiu_Bodiu
Further	Reading
35
https://www.infoq.com/br/presentations/exercising-failure-at-
netflix	
https://www.infoq.com/podcasts/failure-as-a-service	
https://www.infoq.com/articles/chaos-engineering	
@Ops_Engineering	https://www.youtube.com/watch?
v=CZ3wIuvmHeM	
@caseyrosenthal	https://www.youtube.com/watch?
v=Q4nniyAarbs	
Peter	Alvaro:	Orchestrated	Chaos:	Applying	Failure	Testing	
Research	at	Scale	
Adrian	Colyer	Simple	Testing	Can	Prevent	Most	Critical	Failures
Thank You
@sergiu_bodiu
Questions
@Sergiu_Bodiu
Principles
37
Any	developer	building	applications	
which	run	as	a	service.	Ops	engineers	who	
deploy	or	manage	such	applications.	
https://12factor.net:	
Anyone	working	in	software	that	writes	tests	or	
maintains	continuous	integration	
pipelines.	
http://www.10factor.ci

More Related Content

From Resilient to Antifragile Chaos Engineering Primer