Cloud design principles
- 1. Design principles for Azure
applications
Masashi Narumoto
Principle lead PM
AzureCAT patterns&practices
- 2. Traditional vs. Modern application
Traditional on-premises Modern cloud
Relational database Polyglot persistence
Strong consistency Eventual consistency
Design for predictable scalability Design for unbound scalability
Serial and synchronized processing Parallel and asynchronous processing
Monolithic, centralized Decomposed, de-centralized
Snowflake servers Immutable infrastructure
Integrated authentication Federated authentication
Design to keep app running (MTBF) Design for failure (MTTR)
Onetime big update Frequent small update
Manual management Automated self-management
- 5. Design principles for Azure applications
• Use managed services
• Minimize coordination
• Partition around limits
• Design to scale out
• Design for self-healing
• Make all things redundant
• Use the best data store for the job
• Design for evolution
• Design for operations
• Build for the needs of business
- 7. Use managed services
• Managed service reduces management tasks significantly
• Patch, Version, Resource tuning, Cluster management
• Setting up elasticsearch yourself vs. using Azure search
• Managed services can be used even in IaaS workload
• Cache, Messaging, Storage etc.
• If version, scalability limit, cost , portability doesn’t meet your
requirements, then consider pure IaaS approach
- 15. Design to scale out
• Avoid instance stickiness
• Find the bottle-neck and resolve it instead of blindly scale up/out
• Stateful part of the system is most likely become the bottle-neck
• Use built-in auto-scaling feature
• Schedule based for predictable, parameter based for un-predictable
load
• Design for scale-in to make sure you won’t drop balls
• Consider aggressive auto-scaling for critical workload
- 18. Design for self-healing
• Retry operations at transient faults
• Protect failing remote services (Circuit breaker)
• Compensate failed transactions
• Bulkhead
• Throttling
• Fall back operation
• Service degradation
• Load leveling
• Leader election
• Fault injection
• Chaos engineering
• Check pointing long running transactions. Restart from where it failed.
- 20. Make all things redundant
• Load balancing
• Availability set
• Paired region
• Auto-Failover / Manual-failback
• Synchronize front and backend
• Redundant Traffic manager
• Geo-replica
• Partition for availability
• A/A vs. A/P topology
• Point in time Backup/Restore
• RTO/RPO
- 22. Use best data store for the job
• Don’t use SQL for everything (monolithic persistence)
• Logging, Blob, Documents
• How to choose right storage
• Data type, Use case, Others
• Microservices architecture encourages use of polyglot storage
• Each service owns its private data in best format
• Shift from ACID to BASE transaction
• Eventual consistency
• Compensating transaction
- 24. Design for evolution
• Key for continuous innovation (independent deployment)
• Keep high cohesion loose coupling
• Capture domain knowledge in one place
• Compose tightly coupled features together
• Use asynchronous messaging to avoid waiting
• Avoid fat GW, it should be dumb pipe
• Expose open standard interface
• Design and test against service contract
• Abstract infrastructure away from domain logic
• Offload common tasks to a separate service
- 26. Design for operations
• Make things observable
• Instrument for both monitoring and root cause analysis
• Use distributed tracing and correlation
• Automate management tasks
• Track and version configuration
• (Aggregate logs and metrics)
• Standardize logs and metrics
• Involve operation teams in design and planning
- 28. Build for the needs of business
• Functional – DDD, DCA
• Bounded context leads to service boundary
• Context map leads to service dependency
• Aggregate, Domain service/event lead to microservices and inter service comm
• Non-functional - RTO/RPO/MTO, SLO/SLA
• RTO leads to failover period
• RPO leads to backup interval
• SLA leads to choice of services w/ level of redundancy
• Throughput/Latency leads to choice of SKU w/ partitioning
- 29. Traceability from business to software
Business Domain
Core
domain
Bounded context & context mapFurther breakdown per service
characteristics
Business modeling Group of high cohesive services
talking to each other via loosely
coupled API
- 31. Shipping domain with aggregates
Shipping
Drone Package
Delivery DeliveryScheduler
DeliverySupervisor
Editor's Notes
- I’m trying to compare the common characteristics of each
These common characteristics raise questions that you need to answer.
How to choose the right storage? (Polyglot cheat sheet)
How to deal with eventual consistency issues? (Data consistency primer)
How to make apps scalable? (Auto-scaling guidance)
How to control concurrent access? (Concurrent access guidance, WIP)
How to decompose a monolith to distributed components? (Data/Compute partitioning guidance)
How to make apps immutable?
How to choose the right authentication model? (Identity guidance)
How to design multi-tenant apps? (Multi-tenant guidance)
How to deal with transient/non-transient faults? (Retry guidance)
https://dzone.com/articles/martin-fowler-snowflake
- Add practical examples per each bullet
Minimize coordination
- concurrency control
HCLC
- encapsulate domain knowledge, contact,
Scale-out/in
- avoid instance stickiness, deal with scale-in
Decomposition
- Decompose per functional / Non-functional reqs
API
- REST vs. RPC
Redundancy
- Different level of redundancy
Self-healing
- CB, Retry, compensation, throttling, fallback,
Polyglot
-
Observable
- correlating transactions
- Don’t write your own OS!!
Master/Client node, Avoid split brain issue, Perf tuning, patch/version up etc.
SQL DB, Azure Redis, DocumentDB, AAD, Azure Search, HDI
- https://www.youtube.com/watch?v=EYJnWttrC9k
CouchDB supports MVCC
Optimistic vs. Pessimistic concurrency control
MVCC
Data partitioning
Event sourcing
Exactly once operation (causes coordination)
MapReduce
Idempotent operations
Leader election
- Partition for scalability/query-performance/size limits
Three different partitioning strategy (V, H, F)
Hybrid approach (V & H)
Design the shard key to avoid hot spot
Partition different level of envelop (DB, Node, Account, Subscription)
Partition different part of application (DB, Storage, Cache, Queue, Cluster, LB)
- This is often refered to as sharding.
Store different set of rows in different partitions.
Each partition has the same schema.
Choose shard key for even distribution to avoid hot spot.
- Store different columns in different partitions.
Group the columns that commonly used together so you don’t need to join.
Critical vs. Non-critical or Sensitive vs. Non-sensitive data. So you can manage them separately.
- More often than not, you take the hybrid approach.
Store structured data in RDBMS while binary files in NoSQL store. Then horizontally partition the RDBMS.
- Tax accounting app has huge spike in Mar/Apr.
A single rockstar causes a partition to be hotspot.
Load testing, monitoring to figure bottle-neck!!
Auto-scaling guidance
- Consider aggressive auto-scaling for critical workload
Service fabric doesn't support auto-scale-in
- Resiliency guidance
- Resiliency guidance
- Average, count etc.
Choosing storage guidance
When you need different storage
Design considerations
Transactions and consistency integrity across multiple storages
CAP theorem
Compensating transaction using queue, supervisors
High level Selection criteria (data type, skillset, other trade-off)
CQRS and Event sourcing with microservices
Polyglot persistence is becoming natural solution for microservices
- Microservices guidance
- Make things observable
Automate management tasks
Secret management
Expose health endpoint to check system internals
Make all things traceable
Logging, tracing
Instrument your app
Correlate service interactions within a transaction
Collect five key metrics
business, client, app, system, service
Use APM tools
Look for outliers
- Capture business intent and trace the software design so when intent changes, you can identify where to modify
- Domain represents problem space (business)
BC represents solution space (software)
One BC can have multiple different architecture styles, infrastructures etc.?
- How delivery service know its status? Is it coming from delivery mgmt service? (pull or push)
Do we want to merge requestHandler and GW?
GW does only token checking, delegate auth to auth service in account BC
Why it has Package, Drone, Delivery as service but no service for account and 3rd party? Do we need them?
Why doesn’t delivery service contain drone and package aggregate?
Does drone need persistent storage or cache?
What is the best API style?
Depending on the responsibility and latency req of the drone service in this context, it can be just caching status
Every event from drone come via EventHub to only DroneMgmt or + Delivery service?
Account service subscribes delivery events and do the following once it’s completed
Collect ratings, send emails, schedule payment