SlideShare a Scribd company logo
HOSTED BY
Lessons Learnt from Implementing
a Key-Value Store with Raft
Omar Elgabry
Software Engineer at Square
Omar Elgabry
Software Engineer at Square
■ I am a software engineer at Square
■ This is my second time at the P99 conf
■ Blog: https://medium.com/@OmarElgabry
Intro
Motivation
■ To go through the journey of implementing a KV store, and the micro-lessons one can learn from
building a fault-tolerant, strongly consistent distributed DB using Raft.
Goals
■ To build a KV store that achieves the following:
● fault tolerance: provide availability despite server and network failures using replication
● consistency guarantees (linearizability): behave as if all operations are executed on a single machine
● performance guarantees: should be equivalent to other consensus algorithms such as Paxos
● recoverability: recovers from failures
ROWA (read one, write all)
Replica Replica
Leader
Read
Write
⌛
Config
read & writes
ROWA (read one, write all)
Follower Follower
Leader
💀
💀
Leader
Config
failures
🤔
Majority
Follower Follower
Leader
⌛ … Majority (⅔)
read & writes
Majority
Replica
Replica
Leader
💀
🤔
💀
Leader
failures
ROWA vs Majority
■ Raft is more complex vs ROWA simpler
■ Raft does not rely on a single master to make critical decisions
■ Majority quorum systems has been gaining ground
● due to their toleration of temporarily slow/broken replicas
Raft
■ log replication is the backbone of Raft.
Log Replication
x=4 x=5 y=2 z=1
x=4
y=2
z=1
x=5
Log KV store
·Disk
·In-memory
Log Replication
■ order the log entries, client requests and commands
■ help nodes agree on a single execution order
● same order → same final state
■ help the leader ensure followers have identical logs
■ help the leader re-send logs to lagging followers
■ help with electing a new leader with the most up-to-date logs
■ allow logs replay after server crashes and reboots
… but why?
Log Replication
Follower
Follower
Leader
❗
applying log entries
x=4
x=4
x=4
Log Replication
log properties
■ once a command in a log entry is applied by any server, no other server
executes a different command for that log entry
■ applied log entries can’t be rolled back
■ log entries might be overridden, iff it hasn’t been replicated & applied on
majority followers
Network Partitioning
Follower
Leader
follower
Follower
Follower Follower
X
❗
Follower
Network Partitioning
Follower
Leader
leader
Follower
Follower Follower
❗
Leader
Split Brain Syndrome
🤔
Leader Elections
Follower
Leader
voting
Follower
Follower Follower
❗ X
Candidate
Leader
🤔
👍 👎
3 1
Follower
Leader Elections
split votes
Follower
Follower Follower
Candidate
👍 👎
2 2
Candidate
👍 👎
2 2
Leader
❗
Follower Follower
Candidate
Leader
🕐 🕥
Follower
https://www.p99conf.io/session/square-engineerings-fail-fast-retry-soon-performance-optimization-technique
Recoverability
snapshots
Raft
·Disk
Recoverability
snapshots
Raft
·Disk
Recoverability
snapshot performance issues
■ when to take snapshot
● based on log size, fixed timer, etc
● schedule should balance between resources and performance
■ how to unblock processing of write requests while taking snapshots
● copy-on-write (e.g., fork() in Linux)
■ client read reqs and snapshot access the same underlying KV store data
■ only the parts of state being updated are copied, and thus, not affecting the snapshot
Recoverability
snapshot size considerations
■ raft's snapshot logic is reasonable if the KV store size is small
● it simplifies the logic by always copying the entire KV store state
■ if a follower was offline for some time and leader discarded its log while
taking a snapshot
● leader can't repair that follower by sending logs
● leader sends a copy of its snapshot in chunks over the network
■ copying large amounts of data (for snapshots and lagging followers) is not
scalable
● incremental snapshots (e.g., LSM trees)
● persist KV store data on disk
Scaling Raft
scale reads
■ consistency guarantees: no stale reads
● KV store must reflect the behavior expected of a single server
● e.g., reads reqs started after a write op is completed should see the result from the write op
■ can followers respond to read requests?
● no, unless we avoid stale read.
https://www.usenix.org/system/files/conference/hotcloud17/hotcloud17-paper-arora.pdf
Scaling Raft
scale reads
■ can leader respond to read-only requests by reading local copy, without replicating
log entries? yes, under two conditions:
1. no split-brain
● one leader might return a stale value that has been updated by another recent leader.
● one solution is to use leases on successful heartbeats from majority
■ leader is then allowed to respond to read-only requests for a lease period, without replicating
● new leader can’t execute write reqs until this lease period has elapsed.
2. no-op operation
● new leader must append and apply a no-op operation into the log at the start of its election term.
● ensures all prev log entries before the election term are consistent with the majority and applied
before new leader starts to serve client requests.
Noop
Noop
Scaling Raft
scale writes
■ consistency guarantees: ordering & concurrent writes
● the state at any point in time must reflect the result of sequential execution of ordered
operations, in the same order they were sent (or received by the KV store)
● concurrent ops can be executed in any order as long as all clients see the same order
■ batching
● to batch multiple client requests in order to send them efficiently over the network, and to
write them to disk.
● optimizes throughput
● leader broadcast the batch when it exceeds the size or timeout threshold.
■ pipelining
● to send next batch to followers without having to wait for previous batch to finish
● optimizes latency
● leader can send the next batch halfway through the processing time of the first batch
■ parallelizing
● to parallelize and independent operations, e.g., disk writes from sending batches over the
network
Scaling Raft
scale writes
Scaling Raft
scale writes
Follower
Follower
Leader
Batch 1
·Disk
Batch 2
References
■ Diego O. and John O. Stanford University, In Search of an Understandable Consensus
Algorithm, https://raft.github.io/raft.pdf
■ Massachusetts Institute of Technology (MIT), MIT 6.824 distributed systems class,
https://pdos.csail.mit.edu/6.824/index.html
■ Diego O., Consensus: Bridging Theory and Practice, published by Stanford University,
https://github.com/ongardie/dissertation/blob/master/stanford.pdf
■ Unmesh Joshi, Patterns of Distributed Systems,
https://martinfowler.com/articles/patterns-of-distributed-systems
Omar Elgabry
omar.elgabry.93@gmail.com
LinkedIn/omarelgabry
Thank you! Let’s connect.

More Related Content

Similar to Square's Lessons Learned from Implementing a Key-Value Store with Raft

Evolution Of MongoDB Replicaset
Evolution Of MongoDB ReplicasetEvolution Of MongoDB Replicaset
Evolution Of MongoDB Replicaset
M Malai
 
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
javier ramirez
 
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera ClusterWebinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
Continuent
 
Best Practice for Achieving High Availability in MariaDB
Best Practice for Achieving High Availability in MariaDBBest Practice for Achieving High Availability in MariaDB
Best Practice for Achieving High Availability in MariaDB
MariaDB plc
 
Evolution of MongoDB Replicaset and Its Best Practices
Evolution of MongoDB Replicaset and Its Best PracticesEvolution of MongoDB Replicaset and Its Best Practices
Evolution of MongoDB Replicaset and Its Best Practices
Mydbops
 
OpenPOWER Application Optimization
OpenPOWER Application Optimization OpenPOWER Application Optimization
OpenPOWER Application Optimization
Ganesan Narayanasamy
 
MariaDB High Availability
MariaDB High AvailabilityMariaDB High Availability
MariaDB High Availability
MariaDB plc
 
Buytaert kris my_sql-pacemaker
Buytaert kris my_sql-pacemakerBuytaert kris my_sql-pacemaker
Buytaert kris my_sql-pacemaker
kuchinskaya
 
LCE13: Test and Validation Summit: The future of testing at Linaro
LCE13: Test and Validation Summit: The future of testing at LinaroLCE13: Test and Validation Summit: The future of testing at Linaro
LCE13: Test and Validation Summit: The future of testing at Linaro
Linaro
 
LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...
LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...
LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...
Linaro
 
Scalable Web Apps
Scalable Web AppsScalable Web Apps
Scalable Web Apps
Piotr Pelczar
 
Manging scalability of distributed system
Manging scalability of distributed systemManging scalability of distributed system
Manging scalability of distributed system
Atin Mukherjee
 
kafka
kafkakafka
02 2017 emea_roadshow_milan_ha
02 2017 emea_roadshow_milan_ha02 2017 emea_roadshow_milan_ha
02 2017 emea_roadshow_milan_ha
mlraviol
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
C4Media
 
Perforce Administration: Optimization, Scalability, Availability and Reliability
Perforce Administration: Optimization, Scalability, Availability and ReliabilityPerforce Administration: Optimization, Scalability, Availability and Reliability
Perforce Administration: Optimization, Scalability, Availability and Reliability
Perforce
 
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE PlatformsFIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
FIWARE
 
Functional? Reactive? Why?
Functional? Reactive? Why?Functional? Reactive? Why?
Functional? Reactive? Why?
Aleksandr Tavgen
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
datamantra
 
PostgreSQL Replication in 10 Minutes - SCALE
PostgreSQL Replication in 10  Minutes - SCALEPostgreSQL Replication in 10  Minutes - SCALE
PostgreSQL Replication in 10 Minutes - SCALE
PostgreSQL Experts, Inc.
 

Similar to Square's Lessons Learned from Implementing a Key-Value Store with Raft (20)

Evolution Of MongoDB Replicaset
Evolution Of MongoDB ReplicasetEvolution Of MongoDB Replicaset
Evolution Of MongoDB Replicaset
 
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
 
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera ClusterWebinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
 
Best Practice for Achieving High Availability in MariaDB
Best Practice for Achieving High Availability in MariaDBBest Practice for Achieving High Availability in MariaDB
Best Practice for Achieving High Availability in MariaDB
 
Evolution of MongoDB Replicaset and Its Best Practices
Evolution of MongoDB Replicaset and Its Best PracticesEvolution of MongoDB Replicaset and Its Best Practices
Evolution of MongoDB Replicaset and Its Best Practices
 
OpenPOWER Application Optimization
OpenPOWER Application Optimization OpenPOWER Application Optimization
OpenPOWER Application Optimization
 
MariaDB High Availability
MariaDB High AvailabilityMariaDB High Availability
MariaDB High Availability
 
Buytaert kris my_sql-pacemaker
Buytaert kris my_sql-pacemakerBuytaert kris my_sql-pacemaker
Buytaert kris my_sql-pacemaker
 
LCE13: Test and Validation Summit: The future of testing at Linaro
LCE13: Test and Validation Summit: The future of testing at LinaroLCE13: Test and Validation Summit: The future of testing at Linaro
LCE13: Test and Validation Summit: The future of testing at Linaro
 
LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...
LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...
LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...
 
Scalable Web Apps
Scalable Web AppsScalable Web Apps
Scalable Web Apps
 
Manging scalability of distributed system
Manging scalability of distributed systemManging scalability of distributed system
Manging scalability of distributed system
 
kafka
kafkakafka
kafka
 
02 2017 emea_roadshow_milan_ha
02 2017 emea_roadshow_milan_ha02 2017 emea_roadshow_milan_ha
02 2017 emea_roadshow_milan_ha
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
Perforce Administration: Optimization, Scalability, Availability and Reliability
Perforce Administration: Optimization, Scalability, Availability and ReliabilityPerforce Administration: Optimization, Scalability, Availability and Reliability
Perforce Administration: Optimization, Scalability, Availability and Reliability
 
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE PlatformsFIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
 
Functional? Reactive? Why?
Functional? Reactive? Why?Functional? Reactive? Why?
Functional? Reactive? Why?
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
 
PostgreSQL Replication in 10 Minutes - SCALE
PostgreSQL Replication in 10  Minutes - SCALEPostgreSQL Replication in 10  Minutes - SCALE
PostgreSQL Replication in 10 Minutes - SCALE
 

More from ScyllaDB

Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...
Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...
Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...
ScyllaDB
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
ScyllaDB
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
ScyllaDB
 
Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...
Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...
Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...
ScyllaDB
 
Noise Canceling RUM by Tim Vereecke, Akamai
Noise Canceling RUM by Tim Vereecke, AkamaiNoise Canceling RUM by Tim Vereecke, Akamai
Noise Canceling RUM by Tim Vereecke, Akamai
ScyllaDB
 
Running a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU ImpactsRunning a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU Impacts
ScyllaDB
 
Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...
Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...
Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...
ScyllaDB
 
Performance Budgets for the Real World by Tammy Everts
Performance Budgets for the Real World by Tammy EvertsPerformance Budgets for the Real World by Tammy Everts
Performance Budgets for the Real World by Tammy Everts
ScyllaDB
 
Using Libtracecmd to Analyze Your Latency and Performance Troubles
Using Libtracecmd to Analyze Your Latency and Performance TroublesUsing Libtracecmd to Analyze Your Latency and Performance Troubles
Using Libtracecmd to Analyze Your Latency and Performance Troubles
ScyllaDB
 
Reducing P99 Latencies with Generational ZGC
Reducing P99 Latencies with Generational ZGCReducing P99 Latencies with Generational ZGC
Reducing P99 Latencies with Generational ZGC
ScyllaDB
 
5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X
5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X
5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X
ScyllaDB
 
How Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global ScaleHow Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global Scale
ScyllaDB
 
Conquering Load Balancing: Experiences from ScyllaDB Drivers
Conquering Load Balancing: Experiences from ScyllaDB DriversConquering Load Balancing: Experiences from ScyllaDB Drivers
Conquering Load Balancing: Experiences from ScyllaDB Drivers
ScyllaDB
 
Interaction Latency: Square's User-Centric Mobile Performance Metric
Interaction Latency: Square's User-Centric Mobile Performance MetricInteraction Latency: Square's User-Centric Mobile Performance Metric
Interaction Latency: Square's User-Centric Mobile Performance Metric
ScyllaDB
 
How to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory ModelHow to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory Model
ScyllaDB
 
99.99% of Your Traces are Trash by Paige Cruz
99.99% of Your Traces are Trash by Paige Cruz99.99% of Your Traces are Trash by Paige Cruz
99.99% of Your Traces are Trash by Paige Cruz
ScyllaDB
 
Making Python 100x Faster with Less Than 100 Lines of Rust
Making Python 100x Faster with Less Than 100 Lines of RustMaking Python 100x Faster with Less Than 100 Lines of Rust
Making Python 100x Faster with Less Than 100 Lines of Rust
ScyllaDB
 
A Deep Dive Into Concurrent React by Matheus Albuquerque
A Deep Dive Into Concurrent React by Matheus AlbuquerqueA Deep Dive Into Concurrent React by Matheus Albuquerque
A Deep Dive Into Concurrent React by Matheus Albuquerque
ScyllaDB
 
The Latency Stack: Discovering Surprising Sources of Latency
The Latency Stack: Discovering Surprising Sources of LatencyThe Latency Stack: Discovering Surprising Sources of Latency
The Latency Stack: Discovering Surprising Sources of Latency
ScyllaDB
 
eBPF vs Sidecars by Liz Rice at Isovalent
eBPF vs Sidecars by Liz Rice at IsovalenteBPF vs Sidecars by Liz Rice at Isovalent
eBPF vs Sidecars by Liz Rice at Isovalent
ScyllaDB
 

More from ScyllaDB (20)

Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...
Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...
Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
 
Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...
Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...
Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...
 
Noise Canceling RUM by Tim Vereecke, Akamai
Noise Canceling RUM by Tim Vereecke, AkamaiNoise Canceling RUM by Tim Vereecke, Akamai
Noise Canceling RUM by Tim Vereecke, Akamai
 
Running a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU ImpactsRunning a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU Impacts
 
Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...
Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...
Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...
 
Performance Budgets for the Real World by Tammy Everts
Performance Budgets for the Real World by Tammy EvertsPerformance Budgets for the Real World by Tammy Everts
Performance Budgets for the Real World by Tammy Everts
 
Using Libtracecmd to Analyze Your Latency and Performance Troubles
Using Libtracecmd to Analyze Your Latency and Performance TroublesUsing Libtracecmd to Analyze Your Latency and Performance Troubles
Using Libtracecmd to Analyze Your Latency and Performance Troubles
 
Reducing P99 Latencies with Generational ZGC
Reducing P99 Latencies with Generational ZGCReducing P99 Latencies with Generational ZGC
Reducing P99 Latencies with Generational ZGC
 
5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X
5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X
5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X
 
How Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global ScaleHow Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global Scale
 
Conquering Load Balancing: Experiences from ScyllaDB Drivers
Conquering Load Balancing: Experiences from ScyllaDB DriversConquering Load Balancing: Experiences from ScyllaDB Drivers
Conquering Load Balancing: Experiences from ScyllaDB Drivers
 
Interaction Latency: Square's User-Centric Mobile Performance Metric
Interaction Latency: Square's User-Centric Mobile Performance MetricInteraction Latency: Square's User-Centric Mobile Performance Metric
Interaction Latency: Square's User-Centric Mobile Performance Metric
 
How to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory ModelHow to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory Model
 
99.99% of Your Traces are Trash by Paige Cruz
99.99% of Your Traces are Trash by Paige Cruz99.99% of Your Traces are Trash by Paige Cruz
99.99% of Your Traces are Trash by Paige Cruz
 
Making Python 100x Faster with Less Than 100 Lines of Rust
Making Python 100x Faster with Less Than 100 Lines of RustMaking Python 100x Faster with Less Than 100 Lines of Rust
Making Python 100x Faster with Less Than 100 Lines of Rust
 
A Deep Dive Into Concurrent React by Matheus Albuquerque
A Deep Dive Into Concurrent React by Matheus AlbuquerqueA Deep Dive Into Concurrent React by Matheus Albuquerque
A Deep Dive Into Concurrent React by Matheus Albuquerque
 
The Latency Stack: Discovering Surprising Sources of Latency
The Latency Stack: Discovering Surprising Sources of LatencyThe Latency Stack: Discovering Surprising Sources of Latency
The Latency Stack: Discovering Surprising Sources of Latency
 
eBPF vs Sidecars by Liz Rice at Isovalent
eBPF vs Sidecars by Liz Rice at IsovalenteBPF vs Sidecars by Liz Rice at Isovalent
eBPF vs Sidecars by Liz Rice at Isovalent
 

Recently uploaded

Leveraging AI for Software Developer Productivity.pptx
Leveraging AI for Software Developer Productivity.pptxLeveraging AI for Software Developer Productivity.pptx
Leveraging AI for Software Developer Productivity.pptx
petabridge
 
Metadata Lakes for Next-Gen AI/ML - Datastrato
Metadata Lakes for Next-Gen AI/ML - DatastratoMetadata Lakes for Next-Gen AI/ML - Datastrato
Metadata Lakes for Next-Gen AI/ML - Datastrato
Zilliz
 
STKI Israeli Market Study 2024 final v1
STKI Israeli Market Study 2024 final  v1STKI Israeli Market Study 2024 final  v1
STKI Israeli Market Study 2024 final v1
Dr. Jimmy Schwarzkopf
 
Chapter 1 - Fundamentals of Testing V4.0
Chapter 1 - Fundamentals of Testing V4.0Chapter 1 - Fundamentals of Testing V4.0
Chapter 1 - Fundamentals of Testing V4.0
Neeraj Kumar Singh
 
Artificial Intelligence and Its Different Domains.pptx
Artificial Intelligence and Its Different Domains.pptxArtificial Intelligence and Its Different Domains.pptx
Artificial Intelligence and Its Different Domains.pptx
officialnavya2010
 
Database Management Myths for Developers
Database Management Myths for DevelopersDatabase Management Myths for Developers
Database Management Myths for Developers
John Sterrett
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
shanthidl1
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
Vijayananda Mohire
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
BookNet Canada
 
Supercomputing from the Desktop Workstation
Supercomputingfrom the Desktop WorkstationSupercomputingfrom the Desktop Workstation
Supercomputing from the Desktop Workstation
Larry Smarr
 
ASIMOV: Enterprise RAG at Dialog Axiata PLC
ASIMOV: Enterprise RAG at Dialog Axiata PLCASIMOV: Enterprise RAG at Dialog Axiata PLC
ASIMOV: Enterprise RAG at Dialog Axiata PLC
Zilliz
 
Dev Dives: Mining your data with AI-powered Continuous Discovery
Dev Dives: Mining your data with AI-powered Continuous DiscoveryDev Dives: Mining your data with AI-powered Continuous Discovery
Dev Dives: Mining your data with AI-powered Continuous Discovery
UiPathCommunity
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
UiPathCommunity
 
HTTP Adaptive Streaming – Quo Vadis (2024)
HTTP Adaptive Streaming – Quo Vadis (2024)HTTP Adaptive Streaming – Quo Vadis (2024)
HTTP Adaptive Streaming – Quo Vadis (2024)
Alpen-Adria-Universität
 
Data Protection in a Connected World: Sovereignty and Cyber Security
Data Protection in a Connected World: Sovereignty and Cyber SecurityData Protection in a Connected World: Sovereignty and Cyber Security
Data Protection in a Connected World: Sovereignty and Cyber Security
anupriti
 
this resume for sadika shaikh bca student
this resume for sadika shaikh bca studentthis resume for sadika shaikh bca student
this resume for sadika shaikh bca student
SadikaShaikh7
 
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdfPigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions
 
Blockchain and Cyber Defense Strategies in new genre times
Blockchain and Cyber Defense Strategies in new genre timesBlockchain and Cyber Defense Strategies in new genre times
Blockchain and Cyber Defense Strategies in new genre times
anupriti
 
MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
MYIR Product Brochure - A Global Provider of Embedded SOMs & SolutionsMYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
Linda Zhang
 
Artificial Intelligence (AI), Robotics and Computational fluid dynamics
Artificial Intelligence (AI), Robotics and Computational fluid dynamicsArtificial Intelligence (AI), Robotics and Computational fluid dynamics
Artificial Intelligence (AI), Robotics and Computational fluid dynamics
Chintan Kalsariya
 

Recently uploaded (20)

Leveraging AI for Software Developer Productivity.pptx
Leveraging AI for Software Developer Productivity.pptxLeveraging AI for Software Developer Productivity.pptx
Leveraging AI for Software Developer Productivity.pptx
 
Metadata Lakes for Next-Gen AI/ML - Datastrato
Metadata Lakes for Next-Gen AI/ML - DatastratoMetadata Lakes for Next-Gen AI/ML - Datastrato
Metadata Lakes for Next-Gen AI/ML - Datastrato
 
STKI Israeli Market Study 2024 final v1
STKI Israeli Market Study 2024 final  v1STKI Israeli Market Study 2024 final  v1
STKI Israeli Market Study 2024 final v1
 
Chapter 1 - Fundamentals of Testing V4.0
Chapter 1 - Fundamentals of Testing V4.0Chapter 1 - Fundamentals of Testing V4.0
Chapter 1 - Fundamentals of Testing V4.0
 
Artificial Intelligence and Its Different Domains.pptx
Artificial Intelligence and Its Different Domains.pptxArtificial Intelligence and Its Different Domains.pptx
Artificial Intelligence and Its Different Domains.pptx
 
Database Management Myths for Developers
Database Management Myths for DevelopersDatabase Management Myths for Developers
Database Management Myths for Developers
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
 
Supercomputing from the Desktop Workstation
Supercomputingfrom the Desktop WorkstationSupercomputingfrom the Desktop Workstation
Supercomputing from the Desktop Workstation
 
ASIMOV: Enterprise RAG at Dialog Axiata PLC
ASIMOV: Enterprise RAG at Dialog Axiata PLCASIMOV: Enterprise RAG at Dialog Axiata PLC
ASIMOV: Enterprise RAG at Dialog Axiata PLC
 
Dev Dives: Mining your data with AI-powered Continuous Discovery
Dev Dives: Mining your data with AI-powered Continuous DiscoveryDev Dives: Mining your data with AI-powered Continuous Discovery
Dev Dives: Mining your data with AI-powered Continuous Discovery
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
 
HTTP Adaptive Streaming – Quo Vadis (2024)
HTTP Adaptive Streaming – Quo Vadis (2024)HTTP Adaptive Streaming – Quo Vadis (2024)
HTTP Adaptive Streaming – Quo Vadis (2024)
 
Data Protection in a Connected World: Sovereignty and Cyber Security
Data Protection in a Connected World: Sovereignty and Cyber SecurityData Protection in a Connected World: Sovereignty and Cyber Security
Data Protection in a Connected World: Sovereignty and Cyber Security
 
this resume for sadika shaikh bca student
this resume for sadika shaikh bca studentthis resume for sadika shaikh bca student
this resume for sadika shaikh bca student
 
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdfPigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdf
 
Blockchain and Cyber Defense Strategies in new genre times
Blockchain and Cyber Defense Strategies in new genre timesBlockchain and Cyber Defense Strategies in new genre times
Blockchain and Cyber Defense Strategies in new genre times
 
MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
MYIR Product Brochure - A Global Provider of Embedded SOMs & SolutionsMYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
 
Artificial Intelligence (AI), Robotics and Computational fluid dynamics
Artificial Intelligence (AI), Robotics and Computational fluid dynamicsArtificial Intelligence (AI), Robotics and Computational fluid dynamics
Artificial Intelligence (AI), Robotics and Computational fluid dynamics
 

Square's Lessons Learned from Implementing a Key-Value Store with Raft

  • 1. HOSTED BY Lessons Learnt from Implementing a Key-Value Store with Raft Omar Elgabry Software Engineer at Square
  • 2. Omar Elgabry Software Engineer at Square ■ I am a software engineer at Square ■ This is my second time at the P99 conf ■ Blog: https://medium.com/@OmarElgabry
  • 3. Intro Motivation ■ To go through the journey of implementing a KV store, and the micro-lessons one can learn from building a fault-tolerant, strongly consistent distributed DB using Raft. Goals ■ To build a KV store that achieves the following: ● fault tolerance: provide availability despite server and network failures using replication ● consistency guarantees (linearizability): behave as if all operations are executed on a single machine ● performance guarantees: should be equivalent to other consensus algorithms such as Paxos ● recoverability: recovers from failures
  • 4. ROWA (read one, write all) Replica Replica Leader Read Write ⌛ Config read & writes
  • 5. ROWA (read one, write all) Follower Follower Leader 💀 💀 Leader Config failures 🤔
  • 6. Majority Follower Follower Leader ⌛ … Majority (⅔) read & writes
  • 8. ROWA vs Majority ■ Raft is more complex vs ROWA simpler ■ Raft does not rely on a single master to make critical decisions ■ Majority quorum systems has been gaining ground ● due to their toleration of temporarily slow/broken replicas
  • 10. ■ log replication is the backbone of Raft. Log Replication x=4 x=5 y=2 z=1 x=4 y=2 z=1 x=5 Log KV store ·Disk ·In-memory
  • 11. Log Replication ■ order the log entries, client requests and commands ■ help nodes agree on a single execution order ● same order → same final state ■ help the leader ensure followers have identical logs ■ help the leader re-send logs to lagging followers ■ help with electing a new leader with the most up-to-date logs ■ allow logs replay after server crashes and reboots … but why?
  • 13. Log Replication log properties ■ once a command in a log entry is applied by any server, no other server executes a different command for that log entry ■ applied log entries can’t be rolled back ■ log entries might be overridden, iff it hasn’t been replicated & applied on majority followers
  • 16. Leader Elections Follower Leader voting Follower Follower Follower ❗ X Candidate Leader 🤔 👍 👎 3 1 Follower
  • 17. Leader Elections split votes Follower Follower Follower Candidate 👍 👎 2 2 Candidate 👍 👎 2 2 Leader ❗ Follower Follower Candidate Leader 🕐 🕥 Follower https://www.p99conf.io/session/square-engineerings-fail-fast-retry-soon-performance-optimization-technique
  • 20. Recoverability snapshot performance issues ■ when to take snapshot ● based on log size, fixed timer, etc ● schedule should balance between resources and performance ■ how to unblock processing of write requests while taking snapshots ● copy-on-write (e.g., fork() in Linux) ■ client read reqs and snapshot access the same underlying KV store data ■ only the parts of state being updated are copied, and thus, not affecting the snapshot
  • 21. Recoverability snapshot size considerations ■ raft's snapshot logic is reasonable if the KV store size is small ● it simplifies the logic by always copying the entire KV store state ■ if a follower was offline for some time and leader discarded its log while taking a snapshot ● leader can't repair that follower by sending logs ● leader sends a copy of its snapshot in chunks over the network ■ copying large amounts of data (for snapshots and lagging followers) is not scalable ● incremental snapshots (e.g., LSM trees) ● persist KV store data on disk
  • 22. Scaling Raft scale reads ■ consistency guarantees: no stale reads ● KV store must reflect the behavior expected of a single server ● e.g., reads reqs started after a write op is completed should see the result from the write op ■ can followers respond to read requests? ● no, unless we avoid stale read. https://www.usenix.org/system/files/conference/hotcloud17/hotcloud17-paper-arora.pdf
  • 23. Scaling Raft scale reads ■ can leader respond to read-only requests by reading local copy, without replicating log entries? yes, under two conditions: 1. no split-brain ● one leader might return a stale value that has been updated by another recent leader. ● one solution is to use leases on successful heartbeats from majority ■ leader is then allowed to respond to read-only requests for a lease period, without replicating ● new leader can’t execute write reqs until this lease period has elapsed. 2. no-op operation ● new leader must append and apply a no-op operation into the log at the start of its election term. ● ensures all prev log entries before the election term are consistent with the majority and applied before new leader starts to serve client requests. Noop Noop
  • 24. Scaling Raft scale writes ■ consistency guarantees: ordering & concurrent writes ● the state at any point in time must reflect the result of sequential execution of ordered operations, in the same order they were sent (or received by the KV store) ● concurrent ops can be executed in any order as long as all clients see the same order ■ batching ● to batch multiple client requests in order to send them efficiently over the network, and to write them to disk. ● optimizes throughput ● leader broadcast the batch when it exceeds the size or timeout threshold.
  • 25. ■ pipelining ● to send next batch to followers without having to wait for previous batch to finish ● optimizes latency ● leader can send the next batch halfway through the processing time of the first batch ■ parallelizing ● to parallelize and independent operations, e.g., disk writes from sending batches over the network Scaling Raft scale writes
  • 27. References ■ Diego O. and John O. Stanford University, In Search of an Understandable Consensus Algorithm, https://raft.github.io/raft.pdf ■ Massachusetts Institute of Technology (MIT), MIT 6.824 distributed systems class, https://pdos.csail.mit.edu/6.824/index.html ■ Diego O., Consensus: Bridging Theory and Practice, published by Stanford University, https://github.com/ongardie/dissertation/blob/master/stanford.pdf ■ Unmesh Joshi, Patterns of Distributed Systems, https://martinfowler.com/articles/patterns-of-distributed-systems