Designing A Platform Agnostic HA System

Designing a platform agnostic
High Availability (HA) system
Runcy Oommen
Mar 01 | Worldwide Software Architecture Summit '22

| The HA Agenda |
• Definition and necessity for HA
• Types of HA system
• Channel establishment (also PUB-SUB style)
• Crossover - Heartbeat implementation
• Critical service monitoring
• Probing and networking overview
• Pre-emption concept in a HA system
• Floating IP to determine "active" node
• Design gotchas & pitfalls

Career
• Principal SDE, SonicWall, 18+ years industry exp primarily
in systems, cloud (private & public), security, networking
• 10x multi-cloud certified
• Special interest in serverless, containers and cloud-native
offerings. Firm believer of multi-hybrid cloud
Community
• Organizer of GDG Cloud and Cloud Native meetup
groups in Bangalore
• Regular speaker at domestic & international cloud, tech
& security conferences
• Multiple hackathon wins in cloud/security topics
• Recognized by Google as a community influencer

What does "High Availability" mean?
A system characteristic, that aims to ensure an
agreed level of operational performance -
usually uptime - for a higher-than-normal period
Principles of systems design to help achieve HA:
1.Eliminating single point of failure - building redundancy
2.Reliable crossover - continues to deliver the functionality
3.Detection of failures as they occur
Reference:
https://en.wikipedia.org/wiki/High_availability

Necessity for
HA design
• A critical point of any infrastructure/platform
•With increased adoption in remote work,
it's imperative to provide enhanced availability
•It's architecture independent – monolith
or micro-services
•It's a trendy topic in modern software
design/development

Types of HA system
ACTIVE-STANDBY
IMPLEMENTATION
ACTIVE-ACTIVE
IMPLEMENTATION
@runcyoommen

Importance of
the HA channel
Acts as THE communication link
between the nodes
Performs an important heart-beat
monitoring mechanism
Main control plane for data, status
and command exchange
Link to be established via a reliable
stack (e.g. TCP/IP)
@runcyoommen

Design definition of the channel

• Fast and easy to implement, if custom logic is
not a hard requirement
PUBlisher <> SUBscriber model
•A publisher and subscriber mechanism to be
present at each node
•Essentially the same thing as control channel –
events handled by appropriate topics
•Many established language specific libraries
and message queues exist
•Rabbit MQ, Centrifugo, Google Pub/Sub etc...

Crossover - Heartbeat mechanism
• Determines the overall state of the
HA cluster
@runcyoommen
•Handles crossover logic in cases
like network loss, power outage, system
crash, host corruption etc...
•Provides certain threshold checks and
failure buffers against false alarms

Critical service
monitoring
Maintain a roster of important processes
Provides appropriate weightage to each service
Perform failover depending on overall service status
Threshold mechanism for self-recovery
Implement a watchdog daemon for retries
@runcyoommen

Design definition - Service monitoring

Probing and networking overview in HA
Need for continuous probe
to a hostname or IP address
Specified at certain interval
lasting for a certain duration
Check for failure buffers
to avert false positives

Probing functionality
External IP/hostname – Determines
internet access
@runcyoommen
Internal host – Implementation of
a closed network
3rd party software dependency for
library or SDK

Design definition – Probing params

Floating IP - Concept
• A virtual addressing mechanism used in a HA cluster that moves
between devices - also known as Virtual IP
@runcyoommen
• Determines the Active node in the event of a link or device failure
• Usually configured in the same subnet space as the physical interface

Floating IP - Implementation
• Logic determined by the highest node priority – there could be
other criteria as well
Node 2
Priority: 100
(Standby)
Node 1
Priority: 200
(Active)
•The Virtual address binds the MAC address associated with the physical
interface of node with highest priority
•Example package/software - Keepalived, HAProxy etc...

HA concept - Pre-emption
• Essentially means that the Primary
takes back the 'Active' role on
recovery
• Always maintain initial status-quo
• Caution: Might exhibit aggressive
behavior leading to untoward
overall HA status
@runcyoommen

Pre-emption (example walkthrough)
Primary
(Active)
Secondary
(Standby)
• Initial HA setup status

Secondary
(Active)
Primary
(Down)
• Failover taken place

Secondary
(Active)
Primary
(Up)
• Primary comes back up

Primary
(Active)
Secondary
(Standby)
• Previous status-quo restored

HA design
gotchas &
pitfalls to avoid
• HA config storage
oNever store data about the config in a DB
oStatus should be node dependent
@runcyoommen
•Split-brain condition
oEach node thinking it to be the Active one
oLeads to channel failure and config
sync issues
•Huge design difference for Active-Standby &
Active-Active deployment

High Availability – Key takeaways
•The right architecture – Huge difference between
Active-Standby & Active-Active implementation
•Robust channel – Paramount for the control and
overall state of the cluster
•Failover & Heartbeat – Ensure deep tests to finalize
the code and design. Never stop iterating/fine-tuning
•Reliable 3rd party package – Helps offload your
priority determination to get the 'Active' node

@runcyoommen
https://runcy.me Runcy Oommen

Designing A Platform Agnostic HA System

More Related Content

Designing A Platform Agnostic HA System