Square's Lessons Learned from Implementing a Key-Value Store with Raft

HOSTED BY
Lessons Learnt from Implementing
a Key-Value Store with Raft
Omar Elgabry
Software Engineer at Square

Omar Elgabry
Software Engineer at Square
■ I am a software engineer at Square
■ This is my second time at the P99 conf
■ Blog: https://medium.com/@OmarElgabry

Intro
Motivation
■ To go through the journey of implementing a KV store, and the micro-lessons one can learn from
building a fault-tolerant, strongly consistent distributed DB using Raft.
Goals
■ To build a KV store that achieves the following:
● fault tolerance: provide availability despite server and network failures using replication
● consistency guarantees (linearizability): behave as if all operations are executed on a single machine
● performance guarantees: should be equivalent to other consensus algorithms such as Paxos
● recoverability: recovers from failures

ROWA (read one, write all)
Replica Replica
Leader
Read
Write
⌛
Config
read & writes

ROWA (read one, write all)
Follower Follower
Leader
💀
💀
Leader
Config
failures
🤔

Majority
Follower Follower
Leader
⌛ … Majority (⅔)
read & writes

Majority
Replica
Replica
Leader
💀
🤔
💀
Leader
failures

ROWA vs Majority
■ Raft is more complex vs ROWA simpler
■ Raft does not rely on a single master to make critical decisions
■ Majority quorum systems has been gaining ground
● due to their toleration of temporarily slow/broken replicas

■ log replication is the backbone of Raft.
Log Replication
x=4 x=5 y=2 z=1
x=4
y=2
z=1
x=5
Log KV store
·Disk
·In-memory

Log Replication
■ order the log entries, client requests and commands
■ help nodes agree on a single execution order
● same order → same final state
■ help the leader ensure followers have identical logs
■ help the leader re-send logs to lagging followers
■ help with electing a new leader with the most up-to-date logs
■ allow logs replay after server crashes and reboots
… but why?

Log Replication
Follower
Follower
Leader
❗
applying log entries
x=4
x=4
x=4

Log Replication
log properties
■ once a command in a log entry is applied by any server, no other server
executes a different command for that log entry
■ applied log entries can’t be rolled back
■ log entries might be overridden, iff it hasn’t been replicated & applied on
majority followers

Network Partitioning
Follower
Leader
follower
Follower
Follower Follower
X
❗
Follower

Network Partitioning
Follower
Leader
leader
Follower
Follower Follower
❗
Leader
Split Brain Syndrome
🤔

Leader Elections
Follower
Leader
voting
Follower
Follower Follower
❗ X
Candidate
Leader
🤔
👍 👎
3 1
Follower

Leader Elections
split votes
Follower
Follower Follower
Candidate
👍 👎
2 2
Candidate
👍 👎
2 2
Leader
❗
Follower Follower
Candidate
Leader
🕐 🕥
Follower
https://www.p99conf.io/session/square-engineerings-fail-fast-retry-soon-performance-optimization-technique

Recoverability
snapshots
Raft
·Disk

Recoverability
snapshot performance issues
■ when to take snapshot
● based on log size, fixed timer, etc
● schedule should balance between resources and performance
■ how to unblock processing of write requests while taking snapshots
● copy-on-write (e.g., fork() in Linux)
■ client read reqs and snapshot access the same underlying KV store data
■ only the parts of state being updated are copied, and thus, not affecting the snapshot

Recoverability
snapshot size considerations
■ raft's snapshot logic is reasonable if the KV store size is small
● it simplifies the logic by always copying the entire KV store state
■ if a follower was offline for some time and leader discarded its log while
taking a snapshot
● leader can't repair that follower by sending logs
● leader sends a copy of its snapshot in chunks over the network
■ copying large amounts of data (for snapshots and lagging followers) is not
scalable
● incremental snapshots (e.g., LSM trees)
● persist KV store data on disk

Scaling Raft
scale reads
■ consistency guarantees: no stale reads
● KV store must reflect the behavior expected of a single server
● e.g., reads reqs started after a write op is completed should see the result from the write op
■ can followers respond to read requests?
● no, unless we avoid stale read.
https://www.usenix.org/system/files/conference/hotcloud17/hotcloud17-paper-arora.pdf

Scaling Raft
scale reads
■ can leader respond to read-only requests by reading local copy, without replicating
log entries? yes, under two conditions:
1. no split-brain
● one leader might return a stale value that has been updated by another recent leader.
● one solution is to use leases on successful heartbeats from majority
■ leader is then allowed to respond to read-only requests for a lease period, without replicating
● new leader can’t execute write reqs until this lease period has elapsed.
2. no-op operation
● new leader must append and apply a no-op operation into the log at the start of its election term.
● ensures all prev log entries before the election term are consistent with the majority and applied
before new leader starts to serve client requests.
Noop
Noop

Scaling Raft
scale writes
■ consistency guarantees: ordering & concurrent writes
● the state at any point in time must reflect the result of sequential execution of ordered
operations, in the same order they were sent (or received by the KV store)
● concurrent ops can be executed in any order as long as all clients see the same order
■ batching
● to batch multiple client requests in order to send them efficiently over the network, and to
write them to disk.
● optimizes throughput
● leader broadcast the batch when it exceeds the size or timeout threshold.

■ pipelining
● to send next batch to followers without having to wait for previous batch to finish
● optimizes latency
● leader can send the next batch halfway through the processing time of the first batch
■ parallelizing
● to parallelize and independent operations, e.g., disk writes from sending batches over the
network
Scaling Raft
scale writes

Scaling Raft
scale writes
Follower
Follower
Leader
Batch 1
·Disk
Batch 2

References
■ Diego O. and John O. Stanford University, In Search of an Understandable Consensus
Algorithm, https://raft.github.io/raft.pdf
■ Massachusetts Institute of Technology (MIT), MIT 6.824 distributed systems class,
https://pdos.csail.mit.edu/6.824/index.html
■ Diego O., Consensus: Bridging Theory and Practice, published by Stanford University,
https://github.com/ongardie/dissertation/blob/master/stanford.pdf
■ Unmesh Joshi, Patterns of Distributed Systems,
https://martinfowler.com/articles/patterns-of-distributed-systems

Omar Elgabry
omar.elgabry.93@gmail.com
LinkedIn/omarelgabry
Thank you! Let’s connect.

Square's Lessons Learned from Implementing a Key-Value Store with Raft

Recommended

Recommended

More Related Content

Similar to Square's Lessons Learned from Implementing a Key-Value Store with Raft

Similar to Square's Lessons Learned from Implementing a Key-Value Store with Raft (20)

More from ScyllaDB

More from ScyllaDB (20)

Recently uploaded

Recently uploaded (20)

Square's Lessons Learned from Implementing a Key-Value Store with Raft