From Resilient to Antifragile Chaos Engineering Primer

From Resilient to Antifragile
Chaos Engineering Primer
By @Sergiu_Bodiu
Solution Architect

@Sergiu_Bodiu2
From Resilient to Antifragile
Chaos Engineering Primer
By @Sergiu_Bodiu
Solution Architect
DevOpsDays
Singapore 
Conference
Singapore Spring
User Group

@Sergiu_Bodiu
what is an ARCHITECT
3 https://www.thekua.com/atwork/2016/11/the-well-rounded-architect/@patkua

@Sergiu_Bodiu
Risk management
4
The new normal:
from RESILIENT
to ANTIFRAGILE

@Sergiu_Bodiu
A new way to look at organizations
5
Fragile: At risk of total failure / financial ruin
Resilient: Takes damage, avoids total failure,
recovers
Robust: Absorbs uncertainty, repels blows,
avoids damage
Antifragile: Responds to stress by mutating,
maintains fitness for purpose. Identity Change.

@Sergiu_Bodiu
Blueprint for living in a Black Swan world.
6
Antifragile
and only
the
Antifragile,
will Make it.

@Sergiu_Bodiu
https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing
7
1. The network is reliable.
2. Latency is zero.
3. Bandwidth is infinite.
4. The network is secure.
5. Topology doesn't change.
6. There is one administrator.
7. Transport cost is zero.
8. The network is homogeneous.

@Sergiu_Bodiu
Software is Single Point of Failure
8
Root Cause Analysis: While component failures such as
NETWORK, STORAGE, SERVER, HARDWARE, and POWER
failures are anticipated and thus guarded with extra
redundancies.

@Sergiu_Bodiu
Distributed Systems Complexity
9
Complexity is
like
Addiction…
Case study: How
complexity creeps in
- @jasonfried
https://m.signalvnoise.com/case-study-how-complexity-creeps-in-cba48023e6a1

From Resilient to Antifragile Chaos Engineering Primer

@Sergiu_Bodiu
Chaos Engineering
11
Discipline of experimenting on
a distributed system in order to
build confidence in the
system’s capability to withstand
turbulent conditions in
production.
NETFLIX
http://principlesofchaos.org

@Sergiu_Bodiu
Some outages in the Region
12
SingTel fined a record $6m for Bukit
Panjang exchange fire;
Telstra goes down again, people
can't drink beer or catch Ubers
Amazon Web Services outage
causes Australian website chaos

@Sergiu_Bodiu
Backups
13
"Backups always succeed.
It's the restores that fail.
Test your backups by practicing
restores!"
Using Chaos Monkey

@Sergiu_Bodiu
Netflix Simian Army
14
Suite of tools for
keeping your
cloud operating
in top form.
https://github.com/Netflix/
SimianArmy

@Sergiu_Bodiu
Chaos Monkey
15
1.Active during normal working
hours
2.Break things in production
3.Design better software
services
4.Embracing failure
http://techblog.netflix.com/2016/10/netflix-chaos-monkey-upgraded.html

@Sergiu_Bodiu
https://github.com/Netflix/security_monkey
16 https://github.com/Netflix/security_monkey
Monitor AWS and GCP
accounts for policy changes
and alerts on insecure
configurations.
Security Monkey can be
extended with custom
account types, custom
watchers, custom auditors,
and custom alerters.

@Sergiu_Bodiu
Other Monkeys
17
•Latency Monkey
•Janitor Monkey
•Conformity
Monkey
•Doctor Monkey

@Sergiu_Bodiu
PRINCIPLES > TOOLS
Why do we do >
What we do
19

@Sergiu_Bodiu
Dejirafication
20Alexey Krivitsky https://www.slideshare.net/krivitsky/dejirafication-clean-your-process

@Sergiu_Bodiu
Principles of Chaos
21
1.Build a Hypothesis around Steady State Behavior
2.Vary Real-world Events
3.Run Experiments in Production
4.Automate Experiments to Run Continuously
Chaos Engineering Whitepaper 2016

@Sergiu_Bodiu
Hypothesize
22
> sudo watch
• Start with steady state behavior.
• Monitor metrics that are visible
• Capture an interaction between the users and the system.
TIP: Utilisationis Virtually Useless as a Metric!

@Sergiu_Bodiu
Vary Events
23
> sudo halt
• Terminate virtual machine instances
• Inject latency into requests between services
• Fail requests between services
• Fail an internal microservice
• Make an entire region unavailable
TIP: Select only a subset of users

@Sergiu_Bodiu
Experiment
24
• End to end TESTING (Expensive)
• Process is slow
• Configuration Drfit from Production
• 92% ERRORS could be prevented (Simple)
TIP: Customersdon't behave as your JMeter
script.
https://www.usenix.org/system/files/conference/osdi14/osdi14paperyuan.pdf

@Sergiu_Bodiu
Automate
25
> sudo while (1)
• Distributed systems changes continuously over time.
• Engineers modify the behavior of existing services, add new
services.
• Engineers are changing runtime configuration parameters,
upgrading and patching systems
TIP: Depending on the context, changethe rate of
each experiment.

@Sergiu_Bodiu
Principles of Chaos
26
1.Build a Hypothesis around Steady State Behavior
2.Vary Real-world Events
3.Run Experiments in Production
4.Automate Experiments to Run Continuously
TIP: Intentionally breakthings, compare
measured with expected impact, and correct any
problems uncovered this way.
Chaos Engineering Whitepaper 2016

@Sergiu_Bodiu
Reference Architecture for Cloud Native Platform
27 https://content.pivotal.io/white-papers/the-upside-down-economics-of-building-your-own-platform

@Sergiu_Bodiu
Pivotal Cloud Foundry
28

@Sergiu_Bodiu
Chaos Lemur demo
29
Chaos Lemur =
Chaos Monkey + PCF

@Sergiu_Bodiu
Locust demo
30
Locust is an open-source Python load testing
framework.
• Define user behaviour in code
• Can execute end-to-end user test with sessions and
cookies.
• Expands to multiple slaves to increase load capacity
• Allows for distributed user paths based on
percentages

Gatling is an open-source Scala load testing framework
• High performance
• Ready-to-present HTML reports
• Scenario recorder and developer-friendly DSL

@Sergiu_Bodiu
Lessons Learned
31
• Systematic approach to Chaos Testing
• This is incredible hard under pressure.
• Don’t wait so long to start load testing.
• The conversations drive new requirements.
• Changing architecture last minute is extremely
dangerous.
• Join the community
• Build relation with Networking Team, Database Team,
Third Party Partners, Vendors etc..
• Make everything Asynchronous (Embrace Failure,
Background Tasks, Retry, Idempotence)

@Sergiu_Bodiu
The importance of reliability
32
Don't trust claims systems make
about themselves & their
dependencies.
Verify by breaking.

@Sergiu_Bodiu
Clean your process
33
Culture > Principles >
Tools
> Post Mortem
> sudo halt

Incident Start

Impact

@Sergiu_Bodiu
Testing Pyramid
34https://watirmelon.blog/2012/01/31/introducing-the-software-testing-ice-cream-cone/

@Sergiu_Bodiu
Further Reading
35
https://www.infoq.com/br/presentations/exercising-failure-at-
netflix
https://www.infoq.com/podcasts/failure-as-a-service
https://www.infoq.com/articles/chaos-engineering
@Ops_Engineering https://www.youtube.com/watch?
v=CZ3wIuvmHeM
@caseyrosenthal https://www.youtube.com/watch?
v=Q4nniyAarbs
Peter Alvaro: Orchestrated Chaos: Applying Failure Testing
Research at Scale
Adrian Colyer Simple Testing Can Prevent Most Critical Failures

Thank You
@sergiu_bodiu
Questions

@Sergiu_Bodiu
Principles
37
Any developer building applications
which run as a service. Ops engineers who
deploy or manage such applications.
https://12factor.net:
Anyone working in software that writes tests or
maintains continuous integration
pipelines.
http://www.10factor.ci

From Resilient to Antifragile Chaos Engineering Primer

Related slideshows

More Related Content

From Resilient to Antifragile Chaos Engineering Primer