SlideShare a Scribd company logo
Designing Apps for Resiliency
Masashi Narumoto
Principal Lead PM
AzureCAT patterns & practices
Agenda
• What is ’resiliency’?
• Why it’s so important?
• Process to improve resiliency
• Resiliency checklist
What is ‘Resiliency’?
• Resiliency is the ability to recover from failures and continue to
function. It's not about avoiding failures, but responding to failures in
a way that avoids downtime or data loss.
• High availability is the ability of the application to keep running in a
healthy state, without significant downtime.
• Disaster recovery is the ability to recover from rare but major incidents:
Non-transient, wide-scale failures, such as service disruption that affects an
entire region.
Why it’s so important?
• More transient faults in the cloud
• Dependent service may go down
• SLA < 100% means something could go wrong at some point
• More focus on MTTR rather than MTBF
Process to improve resiliency
Plan Design Implement Test Deploy Monitor Respond
Define
requirements
Identify
failures
Implement
recovery
strategies
Inject failures
Simulate FO
Deploy apps in a
reliable manner
Monitor
failures
Take actions
to fix issues
Defining resiliency requirements
Major incident occurs Service recoveredData backupData backupData backup
Recovery Time Objective
(RTO)
Recovery Point Objective
(RPO)
RPO: The maximum time period in which data might be lost
RTO: Duration of time in which the service must be restored after an incident
Business recovered
Maximum Tolerable Outage (MTO)
SLA (Service Level Agreement)
Composite SLA
Composite SLA = ? Composite SLA = ?
Cache
Fallback action:
Return data from local cache
99.94% 99.95%99.95%
99.95% x 99.99% = 99.94%
1.0 − (0.0001 × 0.001) = 99.99999%
Composite SLA for two regions = (1 − (1 − N)(1 − N)) x Traffic manager SLA
1 – (1 – 0.9995) x ( 1 – 0.9995)= 0.99999975(1 – (1 – 0.9995) x ( 1 – 0.9995)) x 0.9999 = 0.999899
Designing for resiliency
Reading data from SQL Server fails
A web server goes down
A NVA goes down
1. Identify possible failures
2. Rate risk of each failure
(impact x likelihood)
3. Design resiliency strategy
- Detection
- Recovery
- Diagnostics
Failure mode analysis
https://azure.microsoft.com/en-us/documentation/articles/guidance-resiliency-failure-mode-analysis/
Rack awareness
Web tier
Availability set
Middle tier
Availability set
Data tier
Availability set
Fault domain 1
Replica #1
Replica #1
Replica #2
Fault domain 2 Fault domain 3
Shard #2Shard #1
Load balance multiple instances
Application gateway for
- L7 routing
- SSL termination
Failover / Failback
Traffic manager
Priority routing method
Web
Application
Data
Web
Application
Data
Automatedfailover
Manualfailback
Primary region
Secondary region (regional pair)
WebWebWeb
Data
ApplicationApplication
Data
Data replication Azure storage
Geo replica (RA-GRS)
LocationMode = PrimaryThenSecondaryLocationMode = SecondaryOnly
Periodically check
If it’s back online
Retry transient failures
See ‘Azure retry guidance’ for more details
< E2E latency requirement
Circuit Breaker
Remote service
Your application
User
Hold resources while retrying operation
Lead to cascading failures
Failed
Circuit Breaker
https://github.com/App-vNext/Polly
Bulkhead
Service A Service B Service C
Thread pool Thread pool Thread pool
Workload 1 Workload 2
Thread pool Thread poolThread pool
Workload 1 Workload 2
Memory
CPU
Disk
Thread pool
Connection pool
Network connection
Other design patterns for resiliency
• Compensating transaction
• Scheduler-agent-supervisor
• Throttling
• Load leveling
• Leader election
See ‘Cloud design patterns’
Principles of chaos engineering
• Build hypothesis around steady state behavior
• Vary real-world events
• Run experiments in production
• Automate experiments to run consistently
http://principlesofchaos.org/
Control Group
Experimental Group
HW/SW failures
Spike in traffic
Verify difference
In terms of steady state
Feed production traffic
Testing for resiliency
• Fault injection testing
• Shut down VM instances
• Crash processes
• Expire certificates
• Change access keys
• Shut down the DNS service on domain controllers
• Limit available system resources, such as RAM or number of threads
• Unmount disks
• Redeploy a VM
• Load testing
• Use production data as much you can
• VSTS, JMeter
• Soak testing
• Longer period under normal production load
Blue/Green and Canary release
Web App DB
Web App DB
Blue/Green Deployment
Web App DB
Web App DB
Canary release
90%
10%
Current version
New version
Current version
New version
LoadBalancer
ReverseProxy
Deployment slots at App Service
Dark launching
New feature
Toggle enable/disable
User Interface
Production environment
Resiliency checklist
• https://azure.microsoft.com/en-us/documentation/articles/guidance-
resiliency-checklist/
Other resources
http://docs.microsoft.com/Azure
Resiliency / High Availability / Disaster Recovery
Throttling
Circuit breaker
Zero downtime deployment
Eventual consistency
Data restore
Retry
Graceful degradation
Geo-replica
Multi-region deployment

More Related Content

Designing apps for resiliency

  • 1. Designing Apps for Resiliency Masashi Narumoto Principal Lead PM AzureCAT patterns & practices
  • 2. Agenda • What is ’resiliency’? • Why it’s so important? • Process to improve resiliency • Resiliency checklist
  • 3. What is ‘Resiliency’? • Resiliency is the ability to recover from failures and continue to function. It's not about avoiding failures, but responding to failures in a way that avoids downtime or data loss. • High availability is the ability of the application to keep running in a healthy state, without significant downtime. • Disaster recovery is the ability to recover from rare but major incidents: Non-transient, wide-scale failures, such as service disruption that affects an entire region.
  • 4. Why it’s so important? • More transient faults in the cloud • Dependent service may go down • SLA < 100% means something could go wrong at some point • More focus on MTTR rather than MTBF
  • 5. Process to improve resiliency Plan Design Implement Test Deploy Monitor Respond Define requirements Identify failures Implement recovery strategies Inject failures Simulate FO Deploy apps in a reliable manner Monitor failures Take actions to fix issues
  • 6. Defining resiliency requirements Major incident occurs Service recoveredData backupData backupData backup Recovery Time Objective (RTO) Recovery Point Objective (RPO) RPO: The maximum time period in which data might be lost RTO: Duration of time in which the service must be restored after an incident Business recovered Maximum Tolerable Outage (MTO)
  • 7. SLA (Service Level Agreement)
  • 8. Composite SLA Composite SLA = ? Composite SLA = ? Cache Fallback action: Return data from local cache 99.94% 99.95%99.95% 99.95% x 99.99% = 99.94% 1.0 − (0.0001 × 0.001) = 99.99999% Composite SLA for two regions = (1 − (1 − N)(1 − N)) x Traffic manager SLA 1 – (1 – 0.9995) x ( 1 – 0.9995)= 0.99999975(1 – (1 – 0.9995) x ( 1 – 0.9995)) x 0.9999 = 0.999899
  • 9. Designing for resiliency Reading data from SQL Server fails A web server goes down A NVA goes down 1. Identify possible failures 2. Rate risk of each failure (impact x likelihood) 3. Design resiliency strategy - Detection - Recovery - Diagnostics
  • 11. Rack awareness Web tier Availability set Middle tier Availability set Data tier Availability set Fault domain 1 Replica #1 Replica #1 Replica #2 Fault domain 2 Fault domain 3 Shard #2Shard #1
  • 12. Load balance multiple instances Application gateway for - L7 routing - SSL termination
  • 13. Failover / Failback Traffic manager Priority routing method Web Application Data Web Application Data Automatedfailover Manualfailback Primary region Secondary region (regional pair) WebWebWeb Data ApplicationApplication Data
  • 14. Data replication Azure storage Geo replica (RA-GRS) LocationMode = PrimaryThenSecondaryLocationMode = SecondaryOnly Periodically check If it’s back online
  • 15. Retry transient failures See ‘Azure retry guidance’ for more details < E2E latency requirement
  • 16. Circuit Breaker Remote service Your application User Hold resources while retrying operation Lead to cascading failures Failed
  • 18. Bulkhead Service A Service B Service C Thread pool Thread pool Thread pool Workload 1 Workload 2 Thread pool Thread poolThread pool Workload 1 Workload 2 Memory CPU Disk Thread pool Connection pool Network connection
  • 19. Other design patterns for resiliency • Compensating transaction • Scheduler-agent-supervisor • Throttling • Load leveling • Leader election See ‘Cloud design patterns’
  • 20. Principles of chaos engineering • Build hypothesis around steady state behavior • Vary real-world events • Run experiments in production • Automate experiments to run consistently http://principlesofchaos.org/ Control Group Experimental Group HW/SW failures Spike in traffic Verify difference In terms of steady state Feed production traffic
  • 21. Testing for resiliency • Fault injection testing • Shut down VM instances • Crash processes • Expire certificates • Change access keys • Shut down the DNS service on domain controllers • Limit available system resources, such as RAM or number of threads • Unmount disks • Redeploy a VM • Load testing • Use production data as much you can • VSTS, JMeter • Soak testing • Longer period under normal production load
  • 22. Blue/Green and Canary release Web App DB Web App DB Blue/Green Deployment Web App DB Web App DB Canary release 90% 10% Current version New version Current version New version LoadBalancer ReverseProxy
  • 23. Deployment slots at App Service
  • 24. Dark launching New feature Toggle enable/disable User Interface Production environment
  • 27. Resiliency / High Availability / Disaster Recovery Throttling Circuit breaker Zero downtime deployment Eventual consistency Data restore Retry Graceful degradation Geo-replica Multi-region deployment

Editor's Notes

  1. Everybody is talking about it but the its definition is not clear. I’ll clarify what it means Why everybody is taking about it? There’s a number of reasons Main part of this topic is how to make your app resilient. I’ll show you some of the example of checklist
  2. DR? Data backup? These are all true statements but none of them clearly define what resiliency means. In order to be HA, it doesn’t need to go down and come back online. If you’re app is running w/ 100% uptime w/o any failures, it’s HA but you never know if it’s resilient. Once something bad happens, then it may take days to come back online which is not really resilient at all. DR needs to be a catastrophic failure such as something that could take down entire DC. For example..
  3. Why it’s so important? Why everybody is talking about resiliency? Transient faults because of commodity HW, networking, multi-tenant shared model Remote services could go down at any time 99.99% means 4 mins downtime a month. Do you want to sit down and wait for 4 minutes or do something else? I’d rather do something because you never know it’s going to be 4 minutes or 4 hours. Based on the assumption that anything goes wrong at some point, focus has been shifting from MTBF to MTTR
  4. We’re getting into more interesting part. We discussed what resiliency means, why it’s so important. Now we’re getting into ‘how’ part. This is the process to improve resiliency in your system in 7 steps from plan to respond. Let’s talk about each step. Clearly define your requirements, otherwise you don’t know what you’re aiming for Identify all possible failures you may see and Implement recovery strategies to bounce back from these failures To make sure these strategies work, you need to test them by injecting failures Deployment needs to be resilient too. Because deploying new version is the most common cause of failures Monitoring is key to QoS. Monitor errors, latency, throughputs etc. in percentile. You need to take actions quickly to mitigate the downtime
  5. There’re two common requirements when it comes to resiliency. RPO: defines the interval of data backup RTO: defines the requirements for hot/warm/cold stand-by MTO: how long a particular business process can be down
  6. If you look at well-experienced customers, they define availability requirements per each use case. Decompose your workload and define availability requirements (uptime, latency etc.) per each Higher SLA comes with cost because of redundant services/components. Measuring downtime will become an issue when you target 5’nine’s
  7. The fact that App Service offers 99.95% doesn’t mean that the entire system has 99.95%. Other important fact is that SLA doesn’t guarantee that it always up 99.95% of the time. You’ll get money back when it violates SLA. It’s not just a number game. This is where resiliency comes into play. SLA is not guaranteed. If we don’t meet SLA, you get money back. Definition of SLA varies depending on the service.
  8. In order to design your app to be resilient, you need to identify all possible failures first. Then implement resilient strategies against them,
  9. To help you identify all possible failures, we published list of most common failures on Azure. It has a few items per each service. 30 to 40 items in total. Let’s take a look. In the case of DocumentDB. When you fail to read data from it, the client SDK retries the operation for you. The only transient fault it retries against is throttling (429). If you constantly get 420, consider increasing its scale (RU) DocumentDB now supports geo-replica. If primary region fails, it will switch traffic to other regions in the list you configure For diagnostics, you need to log all errors at client side.
  10. You can think of rack as power module. If it goes down, anything belong to it go down all together. So it’s better to distribute VMs across different racks for redundancy sake. This is where availability set comes into play. Each machine in the same AS belongs to different rack. VMSS automatically put VMs in 5 FD, 5 UD but it doesn’t support data disk yet.
  11. Avoiding SPOF is critical for resiliency. Many customers still don’t know these basics. They deploy critical workload on a single machine. For that, you nee to have redundant components. One goes down but still others are running. In this case, put VMs in the same tier into the same availability set with LB. LB would distribute requests to VMs in backend address pool Health probe can be either Http or Tcp depending on the workload. By default it pings root path ‘/’. You may want to expose health endpoint to monitor all critical component.
  12. There’s a risk of data loss in FO, take a snapshot and ensure the data integrity.
  13. If it’s less frequent transient faults, set the property to PrimaryThenSecondary. It’ll switch to secondary region for you If it’s more frequent or non-transient faults, set the property to SecondaryOnly otherwise it keeps hitting and getting errors from primary. You need to monitor the primary region, when it comes back then set the property back to PrimaryOnly or PTS One thing to notice is that Azure storage wouldn’t failover to secondary until reginal wide disaster happens which I don’t think we have had yet. This strategy is applicable for read not write.
  14. Let’s take a look at a few resiliency strategies to recover from failures you identified above. Exponential back-off for non-interactive transaction Quick liner retry for interactive transaction Anti-patterns: Cascading retry (5x5 = 25) More than one immediate retry Many attempts with regular interval (Randomize interval)
  15. People often say don’t waste your time, let’s circuit break and fail fast That is only a part of the problem. Real issues is the cascading failures. Also by keep retrying failed operations, the remote service can’t recover from the failed state
  16. Type of resources to isolate are not limited to but they are most common ones.
  17. Given the chaotic nature of the cloud and distributed system, always something happens somewhere. it makes sense to follow chaos engineering principles. Define the steady state as the measurable output of a system, rather than internal attributes of the system Introduce real-world chaotic events such as HW-failure, SW-failure, spike in traffic etc. Best way to validate the system at production scale is to run experiment in production. Netflix at least once a month, inject faults in one of their regions to see if their system can keep up and running. Since it’s such a time consuming tasks, you should automate the experiments and run them continuously Chaos engineering is not testing, it’s validation of the system. https://www.youtube.com/watch?v=Q4nniyAarbs
  18. Tools = Chaos monkey/kong, ToxiProxy https://en.wikipedia.org/wiki/Soak_testing
  19. Deploy current and new version into two identical environments (blue, green) Do smoke test on new version then switch traffic to it. Canary release is to incrementally switches from current to new using LB. Use Akamai or equivalent to do Canary. The unique name for this environment comes from a tactic used by coal miners: they’d bring canaries with them into the coal mines to monitor the levels of carbon monoxide in the air; if the canary died, they knew that the level of toxic gas in the air was high, and they’d leave the mines. In either case you should be able to rollback if the new version doesn’t work Graceful shutdown and Switching DB/Storage are the challenge. Github route request to blue and green, compares the result from blue and green. Make sure they are identical. Dark launch: Deploy new features without enabling it to users. Make sure it won’t cause any issues in production, then enable it.
  20. This is how it works in App Service. You can have up to 15 deployment slots
  21. Deploy a new feature to prod env without enabling it to users. Make sure it works with in the prod infrustracture, no memory leaks, no nothing. Then enable it to users on UI. If something bad happens, then disable it in UI. Facebook does this.
  22. All other proven practices are in this doc. You can use this list when you have ADR with your customers. Give us feedback.