Coordination in distributed systems

Coordination in
Distributed Systems
Andrea Monacchi
Consensus, Conﬁguration, Reliability

Agenda
1. Consensus
2. Apache Zookeeper
3. ETCD

Consensus
● Multiple processes must agree on a value (e.g. time, state, price)
○ https://www.confluent.io/blog/distributed-consensus-reloaded-apache-zookeeper-and-replication-in-kafka/
● Synchronization in systems - Very general topic - many applications
○ clock synchronization - e.g. firefly synchronization
○ smart power grids - e.g. phased-locked loop (PLL)
○ load balancing
○ state-machine replication and distributed replica
○ distributed-lock and distributed-transaction algorithms
● In practice
○ environment is noisy - i.e. faults may arise endogenously/exougenously
■ Crash failures (process stops) - process or network caused
■ Byzantine failures (most general failure) - probabilistic terms, identification is strategy/protocol driven
○ consensus protocols must be fault tolerant (accept failure)

Consensus
Protocol requirements:
● Agreement - consensus should result in one common value
● Termination - consensus process eventually converges
● Integrity - if correct processes picked v, then v should be selected by consensus process

Consensus
Single-valued
● agreeing on single integer value
● agreeing on binary value (binary consensus)
Reference algorithm: Paxos (by Leslie Lamport)
● difﬁcult to understand and implement
● even worse in multi-paxos variant
● implementations far from theory (multiple
ﬂavours derived)
Multi-valued
● agreement of a sequence of values
● can be decomposed in multiple single-valued
Reference algorithm: Multi-Paxos, Raft
● Raft has understandability at its core
● consensus decomposed to 3 sub-problems

Raft
Consensus process decomposed to 3 sub-problems:
● leader election - upon failures on the current
leader (only 1 leader at time)
● log replication - the leader keeps the logs of all
other servers in sync with his ones, via
replication (identical append-only logs)
● safety - if any server has committed a log entry
at a certain index, no other one can apply a
different log for that index (append only log)
Peculiarities:
● strong leadership - log replicated only from leader
to followers; logs are received from clients on
leader, then propagated by leader to followers
who apply it when considered safe.
● leader election - using randomized timers and
heartbeats from leader to followers;
● Membership changes - using new joint consensus
approach to keep cluster operating during
conﬁguration changes;
https://raft.github.io/raft.pdf

Raft: leader election
Each node is a state machine in 1 of these states: ● Each follower uses a randomized timer and a heartbeat on the current
leader;
● Upon timer expiration, node becomes candidate and asks all nodes for
votes; Voting as first-come-first-served on requests;
● If majority arrives, then node confirms leaderships by sending
heartbeat to all to establish authority;
● No concept of time, rather arbitrary terms of arbitrary length with 1
defined leader
● Term identified by ID monotonically increasing and used to mark each
communication;
● If Candidate receives AppendEntries (by other candidate) with same or
higher Term ID, then becomes follower and updates local Term ID.
● If more candidates at same time exist vote may lead to no majority
(split votes), then new election begins;
● Random election timeouts used by each node to prevent split votes;
first expiring asks for new elections;
2 remote-procedure calls (RPCs):
● RequestVotes - from candidate to all
● AppendEntries - from leader to followers
to replicate log entries and/or send heartbeat
source: original paper

Raft: log replication
● Leader receives command for its FSM from client
● Leader appends command to its log
● Leader calls AppendEntries in parallel to all nodes to
replicate the entry;
● If log entry was safely replicated by majority of
followers then (it is safe to commit it) run command
and return result;
● if not (e.g. follower crashed or too slow) leader retries
AppendEntries indeﬁnetely until all followers are in
sync with their log;
● Term ID stored with command as Log entry to detect
inconsistencies;
● Followers can locally apply committed command;
Notes:
● Followers’ crashes can be easily recovered
● Leaders’ crashes may leave log inconsistent
○ not all log entries may have been replicated
Solution - Log Matching Property:
● AppendEntries include consistency check, i.e. reference
to TermID and command of previous Log Entry;
● Log Entry is refused if previous Log Entry does not
match that of leader;
● Conﬂicting entries overwritten with that of leader
(appended after last common entry for both);
● Leaders never overwrite their own log;

Raft: Safety
Problem:
Followers may be unavailable while leader commits certain
entries and then become leader itself, thus overwriting the
previous leader entries;
Solution:
Restrict access to leadership by ensuring that leader for a term
contains all entries committed in previous terms (completeness);
Implementation:
● RequestVote RPC includes info on candidate log, so that the
voter can decline if its local log is more up to date than the
candidate one;
● Logs of candidate and voter compared by i) last entries (in
terms of Term Id and command) and ii) length;
● To avoid leader crashing before committing entries and
having partially replicated the entry; only allow leader to
commit entries that have the current Term ID; By the
properties of log replication, only if consistency check is
passed (last terms match) additional entries can be added;
thus resulting in correct log merging;

Apache Zookeeper
● Coordination using a shared ﬁle system
○ CRUD operations on tiny ﬁles called ZNodes, having stat-like versioning and ACLS
○ ZNodes can have associated data and children nodes - Read this
○ Event-driven communication - Clients can watch for changes on ZNodes - Example
○ Ephemeral Nodes - can’t have children and expire when the creating process dies
○ TTL Nodes - automatic removal upon timer expiration
● Used for Service Discovery, Metadata Management, Synchronization, Leader Election
○ e.g. Hadoop/Yarn and Kafka, many more https://zookeeper.apache.org/doc/r3.6.2/zookeeperUseCases.html
○ Using a barrier to synchronize distributed processes (fork/join model)
■ use node b1/ as barrier, each process adds a child node in b1, of kind b1/p1, b1/p2, and so on
■ when enough processes have created their nodes (as provided to the barrier), the join can start
■ example can be generalized to other synch. models, e.g. locks, 2-phase Commit, Leader election
○ Producer-consumer model simply adding child nodes (using sequential ID) in the ZNode queue

Apache Zookeeper: Example
Apache Kafka Cluster
git clone
https://github.com/simplesteph/kafka-stack-docker-compose.gi
t
docker-compose -f zk-single-kafka-single.yml up -d
docker-compose -f zk-multiple-kafka-multiple.yml up -d
ZooNavigator
docker run
-d --network host
-e HTTP_PORT=9000
--name zoonavigator
--restart unless-stopped
elkozmon/zoonavigator:latest

Apache Zookeeper: Example
$ kaf topic create example_topic_folks
✅ Created topic!
Topic Name: example_topic_folks
Partitions: 1
Replication Factor: 1
Cleanup Policy: delete

ETCD
● Very similar to Zookeeper
○ ﬁle system-like structure with similar functionalities for Nodes (e.g. event-reaction on changes, TTL)
● Raft protocol to distribute replicas across ETCD cluster
● Either addressed via REST/JSON (e.g. by UIs) or gRPC (by other services) - See API v2, v3
● Very lightweight and performant (written in Go) - core component of Kubernetes

Coordination in distributed systems

Related slideshows

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (12)

Similar to Coordination in distributed systems

Similar to Coordination in distributed systems (20)

More from Andrea Monacchi

More from Andrea Monacchi (11)

Recently uploaded

Recently uploaded (20)

Coordination in distributed systems