From Resilient to Antifragile Chaos Engineering Primer
- 22. @Sergiu_Bodiu
Hypothesize
22
> sudo watch
• Start with steady state behavior.
• Monitor metrics that are visible
• Capture an interaction between the users and the system.
TIP: Utilisationis Virtually Useless as a Metric!
- 23. @Sergiu_Bodiu
Vary Events
23
> sudo halt
• Terminate virtual machine instances
• Inject latency into requests between services
• Fail requests between services
• Fail an internal microservice
• Make an entire region unavailable
TIP: Select only a subset of users
- 24. @Sergiu_Bodiu
Experiment
24
• End to end TESTING (Expensive)
• Process is slow
• Configuration Drfit from Production
• 92% ERRORS could be prevented (Simple)
TIP: Customersdon't behave as your JMeter
script.
https://www.usenix.org/system/files/conference/osdi14/osdi14paperyuan.pdf
- 25. @Sergiu_Bodiu
Automate
25
> sudo while (1)
• Distributed systems changes continuously over time.
• Engineers modify the behavior of existing services, add new
services.
• Engineers are changing runtime configuration parameters,
upgrading and patching systems
TIP: Depending on the context, changethe rate of
each experiment.
- 26. @Sergiu_Bodiu
Principles of Chaos
26
1.Build a Hypothesis around Steady State Behavior
2.Vary Real-world Events
3.Run Experiments in Production
4.Automate Experiments to Run Continuously
TIP: Intentionally breakthings, compare
measured with expected impact, and correct any
problems uncovered this way.
Chaos Engineering Whitepaper 2016
- 31. @Sergiu_Bodiu
Lessons Learned
31
• Systematic approach to Chaos Testing
• This is incredible hard under pressure.
• Don’t wait so long to start load testing.
• The conversations drive new requirements.
• Changing architecture last minute is extremely
dangerous.
• Join the community
• Build relation with Networking Team, Database Team,
Third Party Partners, Vendors etc..
• Make everything Asynchronous (Embrace Failure,
Background Tasks, Retry, Idempotence)