Designing apps for resiliency

Designing Apps for Resiliency
Masashi Narumoto
Principal Lead PM
AzureCAT patterns & practices

Agenda
• What is ’resiliency’?
• Why it’s so important?
• Process to improve resiliency
• Resiliency checklist

What is ‘Resiliency’?
• Resiliency is the ability to recover from failures and continue to
function. It's not about avoiding failures, but responding to failures in
a way that avoids downtime or data loss.
• High availability is the ability of the application to keep running in a
healthy state, without significant downtime.
• Disaster recovery is the ability to recover from rare but major incidents:
Non-transient, wide-scale failures, such as service disruption that affects an
entire region.

Why it’s so important?
• More transient faults in the cloud
• Dependent service may go down
• SLA < 100% means something could go wrong at some point
• More focus on MTTR rather than MTBF

Process to improve resiliency
Plan Design Implement Test Deploy Monitor Respond
Define
requirements
Identify
failures
Implement
recovery
strategies
Inject failures
Simulate FO
Deploy apps in a
reliable manner
Monitor
failures
Take actions
to fix issues

Defining resiliency requirements
Major incident occurs Service recoveredData backupData backupData backup
Recovery Time Objective
(RTO)
Recovery Point Objective
(RPO)
RPO: The maximum time period in which data might be lost
RTO: Duration of time in which the service must be restored after an incident
Business recovered
Maximum Tolerable Outage (MTO)

Composite SLA
Composite SLA = ? Composite SLA = ?
Cache
Fallback action:
Return data from local cache
99.94% 99.95%99.95%
99.95% x 99.99% = 99.94%
1.0 − (0.0001 × 0.001) = 99.99999%
Composite SLA for two regions = (1 − (1 − N)(1 − N)) x Traffic manager SLA
1 – (1 – 0.9995) x ( 1 – 0.9995)= 0.99999975(1 – (1 – 0.9995) x ( 1 – 0.9995)) x 0.9999 = 0.999899

Designing for resiliency
Reading data from SQL Server fails
A web server goes down
A NVA goes down
1. Identify possible failures
2. Rate risk of each failure
(impact x likelihood)
3. Design resiliency strategy
- Detection
- Recovery
- Diagnostics

Failure mode analysis
https://azure.microsoft.com/en-us/documentation/articles/guidance-resiliency-failure-mode-analysis/

Rack awareness
Web tier
Availability set
Middle tier
Availability set
Data tier
Availability set
Fault domain 1
Replica #1
Replica #1
Replica #2
Fault domain 2 Fault domain 3
Shard #2Shard #1

Load balance multiple instances
Application gateway for
- L7 routing
- SSL termination

Failover / Failback
Traffic manager
Priority routing method
Web
Application
Data
Web
Application
Data
Automatedfailover
Manualfailback
Primary region
Secondary region (regional pair)
WebWebWeb
Data
ApplicationApplication
Data

Data replication Azure storage
Geo replica (RA-GRS)
LocationMode = PrimaryThenSecondaryLocationMode = SecondaryOnly
Periodically check
If it’s back online

Retry transient failures
See ‘Azure retry guidance’ for more details
< E2E latency requirement

Circuit Breaker
Remote service
Your application
User
Hold resources while retrying operation
Lead to cascading failures
Failed

Circuit Breaker
https://github.com/App-vNext/Polly

Bulkhead
Service A Service B Service C
Thread pool Thread pool Thread pool
Workload 1 Workload 2
Thread pool Thread poolThread pool
Workload 1 Workload 2
Memory
CPU
Disk
Thread pool
Connection pool
Network connection

Other design patterns for resiliency
• Compensating transaction
• Scheduler-agent-supervisor
• Throttling
• Load leveling
• Leader election
See ‘Cloud design patterns’

Principles of chaos engineering
• Build hypothesis around steady state behavior
• Vary real-world events
• Run experiments in production
• Automate experiments to run consistently
http://principlesofchaos.org/
Control Group
Experimental Group
HW/SW failures
Spike in traffic
Verify difference
In terms of steady state
Feed production traffic

Testing for resiliency
• Fault injection testing
• Shut down VM instances
• Crash processes
• Expire certificates
• Change access keys
• Shut down the DNS service on domain controllers
• Limit available system resources, such as RAM or number of threads
• Unmount disks
• Redeploy a VM
• Load testing
• Use production data as much you can
• VSTS, JMeter
• Soak testing
• Longer period under normal production load

Blue/Green and Canary release
Web App DB
Web App DB
Blue/Green Deployment
Web App DB
Web App DB
Canary release
90%
10%
Current version
New version
Current version
New version
LoadBalancer
ReverseProxy

Deployment slots at App Service

Dark launching
New feature
Toggle enable/disable
User Interface
Production environment

Resiliency checklist
• https://azure.microsoft.com/en-us/documentation/articles/guidance-
resiliency-checklist/

Other resources
http://docs.microsoft.com/Azure

Resiliency / High Availability / Disaster Recovery
Throttling
Circuit breaker
Zero downtime deployment
Eventual consistency
Data restore
Retry
Graceful degradation
Geo-replica
Multi-region deployment

Designing apps for resiliency

More Related Content

Designing apps for resiliency

Editor's Notes