SlideShare a Scribd company logo
Designing a platform agnostic
High Availability (HA) system
Runcy Oommen
Mar 01 | Worldwide Software Architecture Summit '22
| The HA Agenda |
• Definition and necessity for HA
• Types of HA system
• Channel establishment (also PUB-SUB style)
• Crossover - Heartbeat implementation
• Critical service monitoring
• Probing and networking overview
• Pre-emption concept in a HA system
• Floating IP to determine "active" node
• Design gotchas & pitfalls
Career
• Principal SDE, SonicWall, 18+ years industry exp primarily
in systems, cloud (private & public), security, networking
• 10x multi-cloud certified
• Special interest in serverless, containers and cloud-native
offerings. Firm believer of multi-hybrid cloud
Community
• Organizer of GDG Cloud and Cloud Native meetup
groups in Bangalore
• Regular speaker at domestic & international cloud, tech
& security conferences
• Multiple hackathon wins in cloud/security topics
• Recognized by Google as a community influencer
What does "High Availability" mean?
A system characteristic, that aims to ensure an
agreed level of operational performance -
usually uptime - for a higher-than-normal period
Principles of systems design to help achieve HA:
1.Eliminating single point of failure - building redundancy
2.Reliable crossover - continues to deliver the functionality
3.Detection of failures as they occur
Reference:
https://en.wikipedia.org/wiki/High_availability
Necessity for
HA design
• A critical point of any infrastructure/platform
•With increased adoption in remote work,
it's imperative to provide enhanced availability​
•It's architecture independent – monolith
or micro-services​​
•It's a trendy topic in modern software
design/development​​​
Types of HA system
ACTIVE-STANDBY
IMPLEMENTATION
ACTIVE-ACTIVE
IMPLEMENTATION
@runcyoommen
Control channel establishment
Importance of
the HA channel
Acts as THE communication link
between the nodes
Performs an important heart-beat
monitoring mechanism
Main control plane for data, status
and command exchange
Link to be established via a reliable
stack (e.g. TCP/IP)
@runcyoommen
Design definition of the channel
• Fast and easy to implement, if custom logic is
not a hard requirement
PUBlisher <> SUBscriber model
•A publisher and subscriber mechanism to be
present at each node​
•Essentially the same thing as control channel –
events handled by appropriate topics​​
•Many established language specific libraries
and message queues exist​​​
•Rabbit MQ, Centrifugo, Google Pub/Sub etc...​​​​
Crossover - Heartbeat mechanism
• Determines the overall state of the
HA cluster
@runcyoommen
•Handles crossover logic in cases
like network loss, power outage, system
crash, host corruption etc...​
•Provides certain threshold checks and
failure buffers against false alarms​​
Critical service
monitoring
Maintain a roster of important processes
Provides appropriate weightage to each service
Perform failover depending on overall service status
Threshold mechanism for self-recovery
Implement a watchdog daemon for retries
@runcyoommen
Design definition - Service monitoring
Probing and networking overview in HA
Need for continuous probe
to a hostname or IP address
Specified at certain interval
lasting for a certain duration
Check for failure buffers
to avert false positives
Probing functionality
External IP/hostname – Determines
internet access
@runcyoommen
Internal host – Implementation of
a closed network​​
3rd party software dependency for
library or SDK​​​
Design definition – Probing params
Floating IP - Concept
• A virtual addressing mechanism used in a HA cluster that moves
between devices - also known as Virtual IP
@runcyoommen
• Determines the Active node in the event of a link or device failure​
• Usually configured in the same subnet space as the physical interface​​
Floating IP - Implementation
• Logic determined by the highest node priority – there could be
other criteria as well
Node 2
Priority: 100
(Standby)
Node 1
Priority: 200
(Active)
•The Virtual address binds the MAC address associated with the physical
interface of node with highest priority​
•Example package/software - Keepalived, HAProxy etc...​
HA concept - Pre-emption
• Essentially means that the Primary
takes back the 'Active' role on
recovery
• Always maintain initial status-quo
• Caution: Might exhibit aggressive
behavior leading to untoward
overall HA status
@runcyoommen
Pre-emption (example walkthrough)
Primary
(Active)
Secondary
(Standby)
• Initial HA setup status
Pre-emption (example walkthrough)
Secondary
(Active)
Primary
(Down)
• Failover taken place
Pre-emption (example walkthrough)
Secondary
(Active)
Primary
(Up)
• Primary comes back up
Pre-emption (example walkthrough)
Primary
(Active)
Secondary
(Standby)
• Previous status-quo restored
HA design
gotchas &
pitfalls to avoid
• HA config storage
oNever store data about the config in a DB
oStatus should be node dependent
@runcyoommen
•Split-brain condition​
oEach node thinking it to be the Active one​
oLeads to channel failure and config
sync issues​
•Huge design difference for Active-Standby &
Active-Active deployment​​
High Availability – Key takeaways
•The right architecture – Huge difference between
Active-Standby & Active-Active implementation
•Robust channel – Paramount for the control and
overall state of the cluster​
•Failover & Heartbeat – Ensure deep tests to finalize
the code and design. Never stop iterating/fine-tuning​
•Reliable 3rd party package – Helps offload your
priority determination to get the 'Active' node​
@runcyoommen
https://runcy.me Runcy Oommen

More Related Content

Designing A Platform Agnostic HA System

  • 1. Designing a platform agnostic High Availability (HA) system Runcy Oommen Mar 01 | Worldwide Software Architecture Summit '22
  • 2. | The HA Agenda | • Definition and necessity for HA • Types of HA system • Channel establishment (also PUB-SUB style) • Crossover - Heartbeat implementation • Critical service monitoring • Probing and networking overview • Pre-emption concept in a HA system • Floating IP to determine "active" node • Design gotchas & pitfalls
  • 3. Career • Principal SDE, SonicWall, 18+ years industry exp primarily in systems, cloud (private & public), security, networking • 10x multi-cloud certified • Special interest in serverless, containers and cloud-native offerings. Firm believer of multi-hybrid cloud Community • Organizer of GDG Cloud and Cloud Native meetup groups in Bangalore • Regular speaker at domestic & international cloud, tech & security conferences • Multiple hackathon wins in cloud/security topics • Recognized by Google as a community influencer
  • 4. What does "High Availability" mean? A system characteristic, that aims to ensure an agreed level of operational performance - usually uptime - for a higher-than-normal period Principles of systems design to help achieve HA: 1.Eliminating single point of failure - building redundancy 2.Reliable crossover - continues to deliver the functionality 3.Detection of failures as they occur Reference: https://en.wikipedia.org/wiki/High_availability
  • 5. Necessity for HA design • A critical point of any infrastructure/platform •With increased adoption in remote work, it's imperative to provide enhanced availability​ •It's architecture independent – monolith or micro-services​​ •It's a trendy topic in modern software design/development​​​
  • 6. Types of HA system ACTIVE-STANDBY IMPLEMENTATION ACTIVE-ACTIVE IMPLEMENTATION @runcyoommen
  • 8. Importance of the HA channel Acts as THE communication link between the nodes Performs an important heart-beat monitoring mechanism Main control plane for data, status and command exchange Link to be established via a reliable stack (e.g. TCP/IP) @runcyoommen
  • 9. Design definition of the channel
  • 10. • Fast and easy to implement, if custom logic is not a hard requirement PUBlisher <> SUBscriber model •A publisher and subscriber mechanism to be present at each node​ •Essentially the same thing as control channel – events handled by appropriate topics​​ •Many established language specific libraries and message queues exist​​​ •Rabbit MQ, Centrifugo, Google Pub/Sub etc...​​​​
  • 11. Crossover - Heartbeat mechanism • Determines the overall state of the HA cluster @runcyoommen •Handles crossover logic in cases like network loss, power outage, system crash, host corruption etc...​ •Provides certain threshold checks and failure buffers against false alarms​​
  • 12. Critical service monitoring Maintain a roster of important processes Provides appropriate weightage to each service Perform failover depending on overall service status Threshold mechanism for self-recovery Implement a watchdog daemon for retries @runcyoommen
  • 13. Design definition - Service monitoring
  • 14. Probing and networking overview in HA Need for continuous probe to a hostname or IP address Specified at certain interval lasting for a certain duration Check for failure buffers to avert false positives
  • 15. Probing functionality External IP/hostname – Determines internet access @runcyoommen Internal host – Implementation of a closed network​​ 3rd party software dependency for library or SDK​​​
  • 16. Design definition – Probing params
  • 17. Floating IP - Concept • A virtual addressing mechanism used in a HA cluster that moves between devices - also known as Virtual IP @runcyoommen • Determines the Active node in the event of a link or device failure​ • Usually configured in the same subnet space as the physical interface​​
  • 18. Floating IP - Implementation • Logic determined by the highest node priority – there could be other criteria as well Node 2 Priority: 100 (Standby) Node 1 Priority: 200 (Active) •The Virtual address binds the MAC address associated with the physical interface of node with highest priority​ •Example package/software - Keepalived, HAProxy etc...​
  • 19. HA concept - Pre-emption • Essentially means that the Primary takes back the 'Active' role on recovery • Always maintain initial status-quo • Caution: Might exhibit aggressive behavior leading to untoward overall HA status @runcyoommen
  • 24. HA design gotchas & pitfalls to avoid • HA config storage oNever store data about the config in a DB oStatus should be node dependent @runcyoommen •Split-brain condition​ oEach node thinking it to be the Active one​ oLeads to channel failure and config sync issues​ •Huge design difference for Active-Standby & Active-Active deployment​​
  • 25. High Availability – Key takeaways •The right architecture – Huge difference between Active-Standby & Active-Active implementation •Robust channel – Paramount for the control and overall state of the cluster​ •Failover & Heartbeat – Ensure deep tests to finalize the code and design. Never stop iterating/fine-tuning​ •Reliable 3rd party package – Helps offload your priority determination to get the 'Active' node​