SlideShare a Scribd company logo
Using consensus algorithm and
distributed store in designing
distributed system
Atin Mukherjee
GlusterFS Hacker
@mukherjee_atin
Topics
● What is consensus in distributed system?
● What is CAP theorem in distributed system
● Different distributed system design approaches
● Challenges in design of a distributed system
● What is RAFT algorithm and how it works
● Distributed store
● Combining RAFT & distributed store – in the form of
technologies like consul/etcd/zookeeper etc
● Q & A
What is consensus in distributed
system
● Consensus – An agrement but for what and
between whom?
● For what → the op/transaction to be committed
or not
● Between whom → Answer is pretty simple, the
nodes forming the distributed system
● Quorum – (n/2) + 1
CAP theorem
● Any two of the following three gurantees
– Consistency (all nodes see the same data at the
same time)
– Availability (a guarantee that every request
receives a response about whether it succeeded or
failed)
– Partition tolerance (the system continues to
operate despite arbitrary message loss or failure of
part of the system)

Recommended for you

Kernel Recipes 2016 - Landlock LSM: Unprivileged sandboxing
Kernel Recipes 2016 - Landlock LSM: Unprivileged sandboxingKernel Recipes 2016 - Landlock LSM: Unprivileged sandboxing
Kernel Recipes 2016 - Landlock LSM: Unprivileged sandboxing

Linux has multiple access-control features, which help to contain the damage from a malicious process. However, it is difficult and complex, especially for unprivileged users, to create a sandboxed application because of the currently administrator-oriented security. seccomp-bpf was a big step forward in empowering any user with the ability to filter syscalls and therefore limit access to some resources. Nevertheless, it lacks the ability to create a full standalone sandbox (e.g. restrict access to a set of files), unlike Seatbelt/XNU Sandbox or OpenBSD Pledge. In this talk, we present Landlock, a new Linux Security Module for unprivileged users. This brings some interesting challenges, from architecture design to userland API definition. Mickaël Salaün

linuxkernel
An Introduction to the Formalised Memory Model for Linux Kernel
An Introduction to the Formalised Memory Model for Linux KernelAn Introduction to the Formalised Memory Model for Linux Kernel
An Introduction to the Formalised Memory Model for Linux Kernel

Linux kernel provides executable and formalized memory model. These slides describe the nature of parallel programming in the Linux kernel and what memory model is and why it is necessary and important for kernel programmers. The slides were used at KOSSCON 2018 (https://kosscon.kr/).

linuxkernelparallel programming
RTAI - Earliest Deadline First
RTAI - Earliest Deadline FirstRTAI - Earliest Deadline First
RTAI - Earliest Deadline First

I used these slides last year to introduce RTAI and Earliest Deadline First for the course "Real-Time Operating Systems" (in English), here at University of Bologna. They include an architectural overview of RTAI, some scheduling algorithms including EDF, and instructions to install and use RTAI.

earliest deadline firstedfscheduling
Design approaches of distributed
system
● No meta data – all nodes share across their
data
● Meta data server – One node holds data where
others fetches from it
So which one is better???
Probably none of them? Ask yourself for a
minute....
Challenges in design of a distributed
system
● No meta data
– N * N exchange of Network messages
– Not scalable when N is probably in hundreds or
thousands
– Initialization time can be very high
– Can end up in a situation like “whom to believe,
whom not to” - popularly known as split brain
– How to undo a transaction locally
Challenges in design of a distributed
system contd...
● MDS (Meta data server)
– SPOF
Ahh!! so is this the only drawback??
– How about having replicas and then replica count??
– Additional N/W hop, lower performance
RAFT – A consensus algorithm
● Key features
– Leader followers based model
– Leader election
– Normal operation
– Safety and consistency after leader changes
– Neutralizing old leaders
– Client interactions
– Configuration changes

Recommended for you

LCA14: LCA14-412: GPGPU on ARM SoC session
LCA14: LCA14-412: GPGPU on ARM SoC sessionLCA14: LCA14-412: GPGPU on ARM SoC session
LCA14: LCA14-412: GPGPU on ARM SoC session

This document summarizes a meeting about accelerating SQLite with OpenCL on ARM SoCs. It discusses porting the Clover OpenCL implementation ("Shamrock") to ARM and newer LLVM versions to enable CPU-only OpenCL. Initial SQLite performance testing on ARM showed potential gains from parallelizing operations. The group's goals are to run SQLite's operations on the GPU using OpenCL kernels to achieve significant performance improvements over the CPU. Their current status and next steps involve porting the Khronos OpenCL conformance tests to ARM and updating Shamrock and SQLite to newer OpenCL specifications.

thomas gall2014lca14
Sweetening Systems Management with Salt
Sweetening Systems Management with SaltSweetening Systems Management with Salt
Sweetening Systems Management with Salt

Salt is an open source configuration management and remote execution system. It allows users to remotely execute commands and manage configurations on multiple systems. Key features include a master-minion architecture with remote execution capabilities, a flexible and extensible design, and support for configuration management through states. States allow users to declaratively define the configuration of systems and ensure consistency across environments.

lspe salt
Sgnog openflow demo-v1.0
Sgnog openflow demo-v1.0Sgnog openflow demo-v1.0
Sgnog openflow demo-v1.0

This document describes an OpenFlow demo and provides definitions of key OpenFlow concepts. It explains that an OpenFlow controller manages one or more OpenFlow switches by installing flow entries in their flow tables. It then gives examples of how the controller can discover network topology proactively by sending LLDP requests and reactively establish paths between nodes by handling ARP requests and replies to add flow entries.

openflow
RAFT : Server states
● Server states transition
RAFT : Terms
● Divided into two parts
– Election
– Normal operation
● At most 1 leader per term
● Failed election
● Split vote
● Each server maintains current term value
RAFT : Replicated state machine
● A picture says thousand words...
RAFT : Different RPCs
● RequestVote RPCs – Candidate sends to other
nodes for electing itself as leader
● AppendEntries RPCs – Normal operation
workload
● AppendEntries RPCs with no message - Heart
beat messages – Leader sends to all followers
to make its presence

Recommended for you

Transactional Memory
Transactional MemoryTransactional Memory
Transactional Memory

The document discusses novel paradigms for parallel programming on multicore processors. It covers parallel programming paradigms like transactional memory, which provides an easy way for programmers to achieve speed and balance. The document describes software transactional memory (STM) and hardware transactional memory (HTM), discussing their approaches to concurrency control, version management, and conflict detection. It also covers using STM for slot scheduling to efficiently schedule requests across threads to a shared resource.

smrut sarangiiit delhihardware transactional memory
The Silence of the Canaries
The Silence of the CanariesThe Silence of the Canaries
The Silence of the Canaries

Agenda: This talk will provide an in-depth review of the usage of canaries in the kernel and the interaction with userspace, as well as a short review of canaries and why they are needed in general so don't be afraid if you never heard of them. Speaker: Gil Yankovitch, CEO, Chief Security Researcher from Nyx Security Solutions

stack-overflowstack-smashingkernel
Free FreeRTOS Course-Task Management
Free FreeRTOS Course-Task ManagementFree FreeRTOS Course-Task Management
Free FreeRTOS Course-Task Management

This document provides an introduction to FreeRTOS version 6.0.5. It outlines the course objectives, which are to understand FreeRTOS services and APIs, experience different FreeRTOS features through labs, and understand the FreeRTOS porting process. The document describes key FreeRTOS concepts like tasks, task states, priorities, and provides an overview of task management APIs for creation, deletion, delaying, and querying tasks.

freertostrainingamr ali
RAFT : Leader Election
● current_term++
● Follower->Candidate
● Self vote
● Send request vote RPCs to all other servers, retry until either:
– Receive votes from majority of server
– Receive RPC from valid leader
– Election time out elapses – increment term
● Election properties
– Safety – allow at most one winner per term
– Liveness – some candidate must eventually win
RAFT : Picking the best leader
● Candidate include log info in RequestVote
RPCs with index & term of last log entry
● Voting server V denies vote if its log is more
complete by
(votingServerLastTerm > candidateLastTerm ||
((votingServerLastTerm == candidateLastTerm) &&
(votingServerLastIndex > candidateLastIndex))
● But is this enough to have crash consistency?
RAFT : New commitment rules
● For a leader to decide an entry is committed:
– Must be stored on a majority of server &
– At least one new entry from leader's term must also
be stored on majority of servers
RAFT : Log inconsistency
● Leader repairs log entries by
– Delete extraneous entries
– Fill in missing entries from the leader

Recommended for you

Lac2006 Lee Revell Slides
Lac2006 Lee Revell SlidesLac2006 Lee Revell Slides
Lac2006 Lee Revell Slides

This document discusses real-time audio performance issues with Linux kernel versions 2.4 and 2.6. It outlines how latency has improved from 20-50ms to 2-5ms between kernels 2.6.7 and 2.6.14 by addressing problems like the Big Kernel Lock, virtual console switching, IDE requests, filesystem issues, softirq handling, and performance bugs. Further optimizations to areas like route cache flushing and virtual memory could allow 1ms reliable latency without fully merging the real-time patchset. Deeper changes such as spinlock conversions and IRQ threading may be needed for sub-1ms latencies.

kernelrealtimelinux
Large scale overlay networks with ovn: problems and solutions
Large scale overlay networks with ovn: problems and solutionsLarge scale overlay networks with ovn: problems and solutions
Large scale overlay networks with ovn: problems and solutions

Han Zhou presents problems and solutions for scaling Open Virtual Network (OVN) components in large overlay networks. The key challenges addressed are: 1. Scaling the OVN controller by moving from recomputing all flows to incremental processing based on changes. 2. Scaling the southbound OVN database by increasing probe intervals, enabling fast resync on reconnect, and improving performance of the clustered mode. 3. Further work is planned to incrementally install flows, reduce per-host data, and scale out the southbound database with replicas.

ovnovsopen source
Process synchronization in Operating Systems
Process synchronization in Operating SystemsProcess synchronization in Operating Systems
Process synchronization in Operating Systems

A brief introduction to Process synchronization in Operating Systems with classical examples and solutions using semaphores. A good starting tutorial for beginners.

computerprocess synchronizationsemaphores
RAFT : Neutralizing old leaders
● Sender sends its term over RPC
● If sender's term in older than receiver's term
RPC is rejected else it receiver steps down to
follower, updates its term and process the RPC
RAFT : Client protocol
● Send commands to leader
– If leader is unknown, send to anyone
– If contacted server is not leader, it will redirect to leader
● Client gets back the response after the full cycle at leader
● Req- timeout
– Re-issues command to other server
– Unique id for each command at client to avoid duplicate
execution
Joint consensus phase
● 2 phase approach
● Need majority of both old and new
configurations for election and commitment
● Configuration change is just a log entry, applied
immediately on receipt (committed or not)
● Once joint consensus is committed, begin
replicate log entry for final configuration
Distributed store
● A common store which can be shared by
different nodes
● In the form of key value pair for ease of use
● Such distributed key value store
implementations are available.

Recommended for you

Libckpt transparent checkpointing under unix
Libckpt transparent checkpointing under unixLibckpt transparent checkpointing under unix
Libckpt transparent checkpointing under unix

James S. Plank, Micah Beck, Gerry Kingsley, Kai Li, “Libckpt: Transparent Checkpointing under UNIX“, Conference Proceedings, Usenix Winter 1995 Technical Conference, New Orleans, LA, January, 1995, pp. 213--223

fault tolerance
Is That A Penguin In My Windows?
Is That A Penguin In My Windows?Is That A Penguin In My Windows?
Is That A Penguin In My Windows?

The document discusses the Windows Subsystem for Linux (WSL), which allows Linux binaries to run natively on Windows. WSL implements Linux syscalls via two Windows drivers. It uses "picoprocesses" which are lightweight isolated processes. This allows Linux files to be accessed via special filesystems like DriveFS. The document also notes considerations for attackers using WSL, such as limitations on cross-process access between Linux and Windows.

windows kernelwindowswsl
Kernel Recipes 2016 - New hwmon device registration API - Jean Delvare
Kernel Recipes 2016 -  New hwmon device registration API - Jean DelvareKernel Recipes 2016 -  New hwmon device registration API - Jean Delvare
Kernel Recipes 2016 - New hwmon device registration API - Jean Delvare

The hwmon subsystem originates from the 1998 project lm-sensors. Along the way, there have been a lot of effort done to have all drivers present a standard interface to user-space, and consolidate the common plumbing into an easy-to-use, hard-to-get-wrong API. The final step of this long-running effort is happening right now. Jean Delvare, SUSE

kernelconferencelinux
etcd
● Named as /etc distributed
● Open source distributed consistent key value store
● Highly available and reliable
● Sequentially consistent
● Watchable
● Exposed via HTTP
● Runtime reconfigurable (Saling feature)
● Durable (snapshot backup/restore)
● Time to live keys (have a time out)
etcd cond..
● Bootstraping using RAFT
● Proxy mode in node
● Cluster configuration – etcdctl member
add/remove/list
● Similar projects like consul, zookeeper are also
available.
Why etcd
● Vibrant community
● 500+ applications like kubernetes, cloud
foundry using it
● 150+ developers
● Stable releases
References
● https://raftconsensus.github.io/
● https://www.youtube.com/watch?
v=YbZ3zDzDnrw
● https://github.com/coreos/etcd#etcd
● https://consul.io/

Recommended for you

Real Time Operating System Concepts
Real Time Operating System ConceptsReal Time Operating System Concepts
Real Time Operating System Concepts

Real-time operating systems (RTOSes) like VxWorks allow for deterministic and fast responses to external events. VxWorks uses multitasking to run applications as separate tasks with inter-task communication. It provides priority-based preemptive multitasking, fast context switching, interrupt handling, and networking capabilities to meet the needs of real-time embedded systems.

realtimeoperating
Continuous Performance Regression Testing with JfrUnit
Continuous Performance Regression Testing with JfrUnitContinuous Performance Regression Testing with JfrUnit
Continuous Performance Regression Testing with JfrUnit

Gunnar Morling is a software engineer and open-source enthusiast by heart. He is leading the Debezium project, a platform for change data capture (CDC). He is a Java Champion, the spec lead for Bean Validation 2.0 (JSR 380) and has founded multiple open source projects such as Deptective and MapStruct. Prior to joining Red Hat, Gunnar worked on a wide range of Java EE projects in the logistics and retail industries. He's based in Hamburg, Germany.

p99 confp99 latency
Etcd terraform by Alex Somesan
Etcd terraform by Alex SomesanEtcd terraform by Alex Somesan
Etcd terraform by Alex Somesan

Alex Somesan his presentation at the Berlin Terraform User Group about setting up ETCD with Terraform.

Q & A
THANK YOU

More Related Content

What's hot

Concurrency bug identification through kernel panic log (english)
Concurrency bug identification through kernel panic log (english)Concurrency bug identification through kernel panic log (english)
Concurrency bug identification through kernel panic log (english)
Sneeker Yeh
 
OpenZFS send and receive
OpenZFS send and receiveOpenZFS send and receive
OpenZFS send and receive
Matthew Ahrens
 
Rtai
RtaiRtai
Kernel Recipes 2016 - Landlock LSM: Unprivileged sandboxing
Kernel Recipes 2016 - Landlock LSM: Unprivileged sandboxingKernel Recipes 2016 - Landlock LSM: Unprivileged sandboxing
Kernel Recipes 2016 - Landlock LSM: Unprivileged sandboxing
Anne Nicolas
 
An Introduction to the Formalised Memory Model for Linux Kernel
An Introduction to the Formalised Memory Model for Linux KernelAn Introduction to the Formalised Memory Model for Linux Kernel
An Introduction to the Formalised Memory Model for Linux Kernel
SeongJae Park
 
RTAI - Earliest Deadline First
RTAI - Earliest Deadline FirstRTAI - Earliest Deadline First
RTAI - Earliest Deadline First
Stefano Bragaglia
 
LCA14: LCA14-412: GPGPU on ARM SoC session
LCA14: LCA14-412: GPGPU on ARM SoC sessionLCA14: LCA14-412: GPGPU on ARM SoC session
LCA14: LCA14-412: GPGPU on ARM SoC session
Linaro
 
Sweetening Systems Management with Salt
Sweetening Systems Management with SaltSweetening Systems Management with Salt
Sweetening Systems Management with Salt
mchesnut
 
Sgnog openflow demo-v1.0
Sgnog openflow demo-v1.0Sgnog openflow demo-v1.0
Sgnog openflow demo-v1.0
Jason Kalai Arasu
 
Transactional Memory
Transactional MemoryTransactional Memory
Transactional Memory
Smruti Sarangi
 
The Silence of the Canaries
The Silence of the CanariesThe Silence of the Canaries
The Silence of the Canaries
Kernel TLV
 
Free FreeRTOS Course-Task Management
Free FreeRTOS Course-Task ManagementFree FreeRTOS Course-Task Management
Free FreeRTOS Course-Task Management
Amr Ali (ISTQB CTAL Full, CSM, ITIL Foundation)
 
Lac2006 Lee Revell Slides
Lac2006 Lee Revell SlidesLac2006 Lee Revell Slides
Lac2006 Lee Revell Slides
rlrevell
 
Large scale overlay networks with ovn: problems and solutions
Large scale overlay networks with ovn: problems and solutionsLarge scale overlay networks with ovn: problems and solutions
Large scale overlay networks with ovn: problems and solutions
Han Zhou
 
Process synchronization in Operating Systems
Process synchronization in Operating SystemsProcess synchronization in Operating Systems
Process synchronization in Operating Systems
Ritu Ranjan Shrivastwa
 
Libckpt transparent checkpointing under unix
Libckpt transparent checkpointing under unixLibckpt transparent checkpointing under unix
Libckpt transparent checkpointing under unix
ZongYing Lyu
 
Is That A Penguin In My Windows?
Is That A Penguin In My Windows?Is That A Penguin In My Windows?
Is That A Penguin In My Windows?
zeroSteiner
 
Kernel Recipes 2016 - New hwmon device registration API - Jean Delvare
Kernel Recipes 2016 -  New hwmon device registration API - Jean DelvareKernel Recipes 2016 -  New hwmon device registration API - Jean Delvare
Kernel Recipes 2016 - New hwmon device registration API - Jean Delvare
Anne Nicolas
 
Real Time Operating System Concepts
Real Time Operating System ConceptsReal Time Operating System Concepts
Real Time Operating System Concepts
Sanjiv Malik
 
Continuous Performance Regression Testing with JfrUnit
Continuous Performance Regression Testing with JfrUnitContinuous Performance Regression Testing with JfrUnit
Continuous Performance Regression Testing with JfrUnit
ScyllaDB
 

What's hot (20)

Concurrency bug identification through kernel panic log (english)
Concurrency bug identification through kernel panic log (english)Concurrency bug identification through kernel panic log (english)
Concurrency bug identification through kernel panic log (english)
 
OpenZFS send and receive
OpenZFS send and receiveOpenZFS send and receive
OpenZFS send and receive
 
Rtai
RtaiRtai
Rtai
 
Kernel Recipes 2016 - Landlock LSM: Unprivileged sandboxing
Kernel Recipes 2016 - Landlock LSM: Unprivileged sandboxingKernel Recipes 2016 - Landlock LSM: Unprivileged sandboxing
Kernel Recipes 2016 - Landlock LSM: Unprivileged sandboxing
 
An Introduction to the Formalised Memory Model for Linux Kernel
An Introduction to the Formalised Memory Model for Linux KernelAn Introduction to the Formalised Memory Model for Linux Kernel
An Introduction to the Formalised Memory Model for Linux Kernel
 
RTAI - Earliest Deadline First
RTAI - Earliest Deadline FirstRTAI - Earliest Deadline First
RTAI - Earliest Deadline First
 
LCA14: LCA14-412: GPGPU on ARM SoC session
LCA14: LCA14-412: GPGPU on ARM SoC sessionLCA14: LCA14-412: GPGPU on ARM SoC session
LCA14: LCA14-412: GPGPU on ARM SoC session
 
Sweetening Systems Management with Salt
Sweetening Systems Management with SaltSweetening Systems Management with Salt
Sweetening Systems Management with Salt
 
Sgnog openflow demo-v1.0
Sgnog openflow demo-v1.0Sgnog openflow demo-v1.0
Sgnog openflow demo-v1.0
 
Transactional Memory
Transactional MemoryTransactional Memory
Transactional Memory
 
The Silence of the Canaries
The Silence of the CanariesThe Silence of the Canaries
The Silence of the Canaries
 
Free FreeRTOS Course-Task Management
Free FreeRTOS Course-Task ManagementFree FreeRTOS Course-Task Management
Free FreeRTOS Course-Task Management
 
Lac2006 Lee Revell Slides
Lac2006 Lee Revell SlidesLac2006 Lee Revell Slides
Lac2006 Lee Revell Slides
 
Large scale overlay networks with ovn: problems and solutions
Large scale overlay networks with ovn: problems and solutionsLarge scale overlay networks with ovn: problems and solutions
Large scale overlay networks with ovn: problems and solutions
 
Process synchronization in Operating Systems
Process synchronization in Operating SystemsProcess synchronization in Operating Systems
Process synchronization in Operating Systems
 
Libckpt transparent checkpointing under unix
Libckpt transparent checkpointing under unixLibckpt transparent checkpointing under unix
Libckpt transparent checkpointing under unix
 
Is That A Penguin In My Windows?
Is That A Penguin In My Windows?Is That A Penguin In My Windows?
Is That A Penguin In My Windows?
 
Kernel Recipes 2016 - New hwmon device registration API - Jean Delvare
Kernel Recipes 2016 -  New hwmon device registration API - Jean DelvareKernel Recipes 2016 -  New hwmon device registration API - Jean Delvare
Kernel Recipes 2016 - New hwmon device registration API - Jean Delvare
 
Real Time Operating System Concepts
Real Time Operating System ConceptsReal Time Operating System Concepts
Real Time Operating System Concepts
 
Continuous Performance Regression Testing with JfrUnit
Continuous Performance Regression Testing with JfrUnitContinuous Performance Regression Testing with JfrUnit
Continuous Performance Regression Testing with JfrUnit
 

Viewers also liked

Etcd terraform by Alex Somesan
Etcd terraform by Alex SomesanEtcd terraform by Alex Somesan
Etcd terraform by Alex Somesan
Maarten van der Hoef
 
Distributed Consensus: Making Impossible Possible [Revised]
Distributed Consensus: Making Impossible Possible [Revised]Distributed Consensus: Making Impossible Possible [Revised]
Distributed Consensus: Making Impossible Possible [Revised]
Heidi Howard
 
Replication and Synchronization Algorithms for Distributed Databases - Lena W...
Replication and Synchronization Algorithms for Distributed Databases - Lena W...Replication and Synchronization Algorithms for Distributed Databases - Lena W...
Replication and Synchronization Algorithms for Distributed Databases - Lena W...
distributed matters
 
Distributed Consensus A.K.A. "What do we eat for lunch?"
Distributed Consensus A.K.A. "What do we eat for lunch?"Distributed Consensus A.K.A. "What do we eat for lunch?"
Distributed Consensus A.K.A. "What do we eat for lunch?"
Konrad Malawski
 
We don't need consensus: All agreed?
We don't need consensus: All agreed?We don't need consensus: All agreed?
We don't need consensus: All agreed?
Weaveworks
 
Spark Stream and SEEP
Spark Stream and SEEPSpark Stream and SEEP
Spark Stream and SEEP
Amir Payberah
 
MapReduce
MapReduceMapReduce
MapReduce
Amir Payberah
 
Linux Module Programming
Linux Module ProgrammingLinux Module Programming
Linux Module Programming
Amir Payberah
 
MegaStore and Spanner
MegaStore and SpannerMegaStore and Spanner
MegaStore and Spanner
Amir Payberah
 
Main Memory - Part2
Main Memory - Part2Main Memory - Part2
Main Memory - Part2
Amir Payberah
 
Introduction to Operating Systems - Part2
Introduction to Operating Systems - Part2Introduction to Operating Systems - Part2
Introduction to Operating Systems - Part2
Amir Payberah
 
Process Management - Part2
Process Management - Part2Process Management - Part2
Process Management - Part2
Amir Payberah
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
Amir Payberah
 
Protection
ProtectionProtection
Protection
Amir Payberah
 
IO Systems
IO SystemsIO Systems
IO Systems
Amir Payberah
 
Security
SecuritySecurity
Security
Amir Payberah
 
File System Implementation - Part2
File System Implementation - Part2File System Implementation - Part2
File System Implementation - Part2
Amir Payberah
 
CPU Scheduling - Part2
CPU Scheduling - Part2CPU Scheduling - Part2
CPU Scheduling - Part2
Amir Payberah
 
Storage
StorageStorage
Storage
Amir Payberah
 
The Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics PlatformThe Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics Platform
Amir Payberah
 

Viewers also liked (20)

Etcd terraform by Alex Somesan
Etcd terraform by Alex SomesanEtcd terraform by Alex Somesan
Etcd terraform by Alex Somesan
 
Distributed Consensus: Making Impossible Possible [Revised]
Distributed Consensus: Making Impossible Possible [Revised]Distributed Consensus: Making Impossible Possible [Revised]
Distributed Consensus: Making Impossible Possible [Revised]
 
Replication and Synchronization Algorithms for Distributed Databases - Lena W...
Replication and Synchronization Algorithms for Distributed Databases - Lena W...Replication and Synchronization Algorithms for Distributed Databases - Lena W...
Replication and Synchronization Algorithms for Distributed Databases - Lena W...
 
Distributed Consensus A.K.A. "What do we eat for lunch?"
Distributed Consensus A.K.A. "What do we eat for lunch?"Distributed Consensus A.K.A. "What do we eat for lunch?"
Distributed Consensus A.K.A. "What do we eat for lunch?"
 
We don't need consensus: All agreed?
We don't need consensus: All agreed?We don't need consensus: All agreed?
We don't need consensus: All agreed?
 
Spark Stream and SEEP
Spark Stream and SEEPSpark Stream and SEEP
Spark Stream and SEEP
 
MapReduce
MapReduceMapReduce
MapReduce
 
Linux Module Programming
Linux Module ProgrammingLinux Module Programming
Linux Module Programming
 
MegaStore and Spanner
MegaStore and SpannerMegaStore and Spanner
MegaStore and Spanner
 
Main Memory - Part2
Main Memory - Part2Main Memory - Part2
Main Memory - Part2
 
Introduction to Operating Systems - Part2
Introduction to Operating Systems - Part2Introduction to Operating Systems - Part2
Introduction to Operating Systems - Part2
 
Process Management - Part2
Process Management - Part2Process Management - Part2
Process Management - Part2
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Protection
ProtectionProtection
Protection
 
IO Systems
IO SystemsIO Systems
IO Systems
 
Security
SecuritySecurity
Security
 
File System Implementation - Part2
File System Implementation - Part2File System Implementation - Part2
File System Implementation - Part2
 
CPU Scheduling - Part2
CPU Scheduling - Part2CPU Scheduling - Part2
CPU Scheduling - Part2
 
Storage
StorageStorage
Storage
 
The Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics PlatformThe Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics Platform
 

Similar to Consensus algo with_distributed_key_value_store_in_distributed_system

Manging scalability of distributed system
Manging scalability of distributed systemManging scalability of distributed system
Manging scalability of distributed system
Atin Mukherjee
 
Coordination in distributed systems
Coordination in distributed systemsCoordination in distributed systems
Coordination in distributed systems
Andrea Monacchi
 
Square's Lessons Learned from Implementing a Key-Value Store with Raft
Square's Lessons Learned from Implementing a Key-Value Store with RaftSquare's Lessons Learned from Implementing a Key-Value Store with Raft
Square's Lessons Learned from Implementing a Key-Value Store with Raft
ScyllaDB
 
Raft presentation
Raft presentationRaft presentation
Raft presentation
Patroclos Christou
 
Storing the real world data
Storing the real world dataStoring the real world data
Storing the real world data
Athira Mukundan
 
Choosing the right high availability strategy
Choosing the right high availability strategyChoosing the right high availability strategy
Choosing the right high availability strategy
MariaDB plc
 
Choosing the right high availability strategy
Choosing the right high availability strategyChoosing the right high availability strategy
Choosing the right high availability strategy
MariaDB plc
 
M|18 Choosing the Right High Availability Strategy for You
M|18 Choosing the Right High Availability Strategy for YouM|18 Choosing the Right High Availability Strategy for You
M|18 Choosing the Right High Availability Strategy for You
MariaDB plc
 
Unveiling etcd: Architecture and Source Code Deep Dive
Unveiling etcd: Architecture and Source Code Deep DiveUnveiling etcd: Architecture and Source Code Deep Dive
Unveiling etcd: Architecture and Source Code Deep Dive
Chieh (Jack) Yu
 
SVCC-2014
SVCC-2014SVCC-2014
SVCC-2014
John Brinnand
 
Megastore by Google
Megastore by GoogleMegastore by Google
Megastore by Google
Ankita Kapratwar
 
Raft in details
Raft in detailsRaft in details
Raft in details
Ivan Glushkov
 
Efficient Geographic Replication & Disaster Recovery
Efficient Geographic Replication & Disaster RecoveryEfficient Geographic Replication & Disaster Recovery
Efficient Geographic Replication & Disaster Recovery
Open Networking Summit
 
Heart of the SwarmKit: Store, Topology & Object Model
Heart of the SwarmKit: Store, Topology & Object ModelHeart of the SwarmKit: Store, Topology & Object Model
Heart of the SwarmKit: Store, Topology & Object Model
Docker, Inc.
 
Traffic Shaping Basics with PRIQ - pfSense Hangout February 2016
Traffic Shaping Basics with PRIQ - pfSense Hangout February 2016Traffic Shaping Basics with PRIQ - pfSense Hangout February 2016
Traffic Shaping Basics with PRIQ - pfSense Hangout February 2016
Netgate
 
Multithreaded processors ppt
Multithreaded processors pptMultithreaded processors ppt
Multithreaded processors ppt
Siddhartha Anand
 
Os prj ppt
Os prj pptOs prj ppt
Os prj ppt
Charmi Chokshi
 
NCM Training - Part 2 - Automation, Notification, Compliance and Reports
NCM Training - Part 2 - Automation, Notification, Compliance and ReportsNCM Training - Part 2 - Automation, Notification, Compliance and Reports
NCM Training - Part 2 - Automation, Notification, Compliance and Reports
ManageEngine, Zoho Corporation
 
Our journey into scalable player engagement platform
Our journey into scalable player engagement platformOur journey into scalable player engagement platform
Our journey into scalable player engagement platform
Idan Fridman
 
Distributed fun with etcd
Distributed fun with etcdDistributed fun with etcd
Distributed fun with etcd
Abdulaziz AlMalki
 

Similar to Consensus algo with_distributed_key_value_store_in_distributed_system (20)

Manging scalability of distributed system
Manging scalability of distributed systemManging scalability of distributed system
Manging scalability of distributed system
 
Coordination in distributed systems
Coordination in distributed systemsCoordination in distributed systems
Coordination in distributed systems
 
Square's Lessons Learned from Implementing a Key-Value Store with Raft
Square's Lessons Learned from Implementing a Key-Value Store with RaftSquare's Lessons Learned from Implementing a Key-Value Store with Raft
Square's Lessons Learned from Implementing a Key-Value Store with Raft
 
Raft presentation
Raft presentationRaft presentation
Raft presentation
 
Storing the real world data
Storing the real world dataStoring the real world data
Storing the real world data
 
Choosing the right high availability strategy
Choosing the right high availability strategyChoosing the right high availability strategy
Choosing the right high availability strategy
 
Choosing the right high availability strategy
Choosing the right high availability strategyChoosing the right high availability strategy
Choosing the right high availability strategy
 
M|18 Choosing the Right High Availability Strategy for You
M|18 Choosing the Right High Availability Strategy for YouM|18 Choosing the Right High Availability Strategy for You
M|18 Choosing the Right High Availability Strategy for You
 
Unveiling etcd: Architecture and Source Code Deep Dive
Unveiling etcd: Architecture and Source Code Deep DiveUnveiling etcd: Architecture and Source Code Deep Dive
Unveiling etcd: Architecture and Source Code Deep Dive
 
SVCC-2014
SVCC-2014SVCC-2014
SVCC-2014
 
Megastore by Google
Megastore by GoogleMegastore by Google
Megastore by Google
 
Raft in details
Raft in detailsRaft in details
Raft in details
 
Efficient Geographic Replication & Disaster Recovery
Efficient Geographic Replication & Disaster RecoveryEfficient Geographic Replication & Disaster Recovery
Efficient Geographic Replication & Disaster Recovery
 
Heart of the SwarmKit: Store, Topology & Object Model
Heart of the SwarmKit: Store, Topology & Object ModelHeart of the SwarmKit: Store, Topology & Object Model
Heart of the SwarmKit: Store, Topology & Object Model
 
Traffic Shaping Basics with PRIQ - pfSense Hangout February 2016
Traffic Shaping Basics with PRIQ - pfSense Hangout February 2016Traffic Shaping Basics with PRIQ - pfSense Hangout February 2016
Traffic Shaping Basics with PRIQ - pfSense Hangout February 2016
 
Multithreaded processors ppt
Multithreaded processors pptMultithreaded processors ppt
Multithreaded processors ppt
 
Os prj ppt
Os prj pptOs prj ppt
Os prj ppt
 
NCM Training - Part 2 - Automation, Notification, Compliance and Reports
NCM Training - Part 2 - Automation, Notification, Compliance and ReportsNCM Training - Part 2 - Automation, Notification, Compliance and Reports
NCM Training - Part 2 - Automation, Notification, Compliance and Reports
 
Our journey into scalable player engagement platform
Our journey into scalable player engagement platformOur journey into scalable player engagement platform
Our journey into scalable player engagement platform
 
Distributed fun with etcd
Distributed fun with etcdDistributed fun with etcd
Distributed fun with etcd
 

More from Atin Mukherjee

GlusterD 2.0 - Managing Distributed File System Using a Centralized Store
GlusterD 2.0 - Managing Distributed File System Using a Centralized StoreGlusterD 2.0 - Managing Distributed File System Using a Centralized Store
GlusterD 2.0 - Managing Distributed File System Using a Centralized Store
Atin Mukherjee
 
Ready to go
Ready to goReady to go
Ready to go
Atin Mukherjee
 
Glusterd_thread_synchronization_using_urcu_lca2016
Glusterd_thread_synchronization_using_urcu_lca2016Glusterd_thread_synchronization_using_urcu_lca2016
Glusterd_thread_synchronization_using_urcu_lca2016
Atin Mukherjee
 
Gluster d2.0
Gluster d2.0Gluster d2.0
Gluster d2.0
Atin Mukherjee
 
Thread synchronization in GlusterD using URCU
Thread synchronization in GlusterD using URCUThread synchronization in GlusterD using URCU
Thread synchronization in GlusterD using URCU
Atin Mukherjee
 
GlusterD - Daemon refactoring
GlusterD - Daemon refactoringGlusterD - Daemon refactoring
GlusterD - Daemon refactoring
Atin Mukherjee
 
Gluster fs architecture_&_roadmap_atin_punemeetup_2015
Gluster fs architecture_&_roadmap_atin_punemeetup_2015Gluster fs architecture_&_roadmap_atin_punemeetup_2015
Gluster fs architecture_&_roadmap_atin_punemeetup_2015
Atin Mukherjee
 

More from Atin Mukherjee (7)

GlusterD 2.0 - Managing Distributed File System Using a Centralized Store
GlusterD 2.0 - Managing Distributed File System Using a Centralized StoreGlusterD 2.0 - Managing Distributed File System Using a Centralized Store
GlusterD 2.0 - Managing Distributed File System Using a Centralized Store
 
Ready to go
Ready to goReady to go
Ready to go
 
Glusterd_thread_synchronization_using_urcu_lca2016
Glusterd_thread_synchronization_using_urcu_lca2016Glusterd_thread_synchronization_using_urcu_lca2016
Glusterd_thread_synchronization_using_urcu_lca2016
 
Gluster d2.0
Gluster d2.0Gluster d2.0
Gluster d2.0
 
Thread synchronization in GlusterD using URCU
Thread synchronization in GlusterD using URCUThread synchronization in GlusterD using URCU
Thread synchronization in GlusterD using URCU
 
GlusterD - Daemon refactoring
GlusterD - Daemon refactoringGlusterD - Daemon refactoring
GlusterD - Daemon refactoring
 
Gluster fs architecture_&_roadmap_atin_punemeetup_2015
Gluster fs architecture_&_roadmap_atin_punemeetup_2015Gluster fs architecture_&_roadmap_atin_punemeetup_2015
Gluster fs architecture_&_roadmap_atin_punemeetup_2015
 

Recently uploaded

The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
Larry Smarr
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
Vijayananda Mohire
 
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
Safe Software
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
Kief Morris
 
Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024
BookNet Canada
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
ArgaBisma
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
Liveplex
 
Best Programming Language for Civil Engineers
Best Programming Language for Civil EngineersBest Programming Language for Civil Engineers
Best Programming Language for Civil Engineers
Awais Yaseen
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
Stephanie Beckett
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
RaminGhanbari2
 
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
Toru Tamaki
 
20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf
Sally Laouacheria
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
SynapseIndia
 
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Bert Blevins
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
KAMAL CHOUDHARY
 
Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
Mark Billinghurst
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
Neo4j
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
ishalveerrandhawa1
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
ScyllaDB
 

Recently uploaded (20)

The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
 
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
 
Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
 
Best Programming Language for Civil Engineers
Best Programming Language for Civil EngineersBest Programming Language for Civil Engineers
Best Programming Language for Civil Engineers
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
 
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
 
20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
 
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
 
Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
 

Consensus algo with_distributed_key_value_store_in_distributed_system

  • 1. Using consensus algorithm and distributed store in designing distributed system Atin Mukherjee GlusterFS Hacker @mukherjee_atin
  • 2. Topics ● What is consensus in distributed system? ● What is CAP theorem in distributed system ● Different distributed system design approaches ● Challenges in design of a distributed system ● What is RAFT algorithm and how it works ● Distributed store ● Combining RAFT & distributed store – in the form of technologies like consul/etcd/zookeeper etc ● Q & A
  • 3. What is consensus in distributed system ● Consensus – An agrement but for what and between whom? ● For what → the op/transaction to be committed or not ● Between whom → Answer is pretty simple, the nodes forming the distributed system ● Quorum – (n/2) + 1
  • 4. CAP theorem ● Any two of the following three gurantees – Consistency (all nodes see the same data at the same time) – Availability (a guarantee that every request receives a response about whether it succeeded or failed) – Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)
  • 5. Design approaches of distributed system ● No meta data – all nodes share across their data ● Meta data server – One node holds data where others fetches from it So which one is better??? Probably none of them? Ask yourself for a minute....
  • 6. Challenges in design of a distributed system ● No meta data – N * N exchange of Network messages – Not scalable when N is probably in hundreds or thousands – Initialization time can be very high – Can end up in a situation like “whom to believe, whom not to” - popularly known as split brain – How to undo a transaction locally
  • 7. Challenges in design of a distributed system contd... ● MDS (Meta data server) – SPOF Ahh!! so is this the only drawback?? – How about having replicas and then replica count?? – Additional N/W hop, lower performance
  • 8. RAFT – A consensus algorithm ● Key features – Leader followers based model – Leader election – Normal operation – Safety and consistency after leader changes – Neutralizing old leaders – Client interactions – Configuration changes
  • 9. RAFT : Server states ● Server states transition
  • 10. RAFT : Terms ● Divided into two parts – Election – Normal operation ● At most 1 leader per term ● Failed election ● Split vote ● Each server maintains current term value
  • 11. RAFT : Replicated state machine ● A picture says thousand words...
  • 12. RAFT : Different RPCs ● RequestVote RPCs – Candidate sends to other nodes for electing itself as leader ● AppendEntries RPCs – Normal operation workload ● AppendEntries RPCs with no message - Heart beat messages – Leader sends to all followers to make its presence
  • 13. RAFT : Leader Election ● current_term++ ● Follower->Candidate ● Self vote ● Send request vote RPCs to all other servers, retry until either: – Receive votes from majority of server – Receive RPC from valid leader – Election time out elapses – increment term ● Election properties – Safety – allow at most one winner per term – Liveness – some candidate must eventually win
  • 14. RAFT : Picking the best leader ● Candidate include log info in RequestVote RPCs with index & term of last log entry ● Voting server V denies vote if its log is more complete by (votingServerLastTerm > candidateLastTerm || ((votingServerLastTerm == candidateLastTerm) && (votingServerLastIndex > candidateLastIndex)) ● But is this enough to have crash consistency?
  • 15. RAFT : New commitment rules ● For a leader to decide an entry is committed: – Must be stored on a majority of server & – At least one new entry from leader's term must also be stored on majority of servers
  • 16. RAFT : Log inconsistency ● Leader repairs log entries by – Delete extraneous entries – Fill in missing entries from the leader
  • 17. RAFT : Neutralizing old leaders ● Sender sends its term over RPC ● If sender's term in older than receiver's term RPC is rejected else it receiver steps down to follower, updates its term and process the RPC
  • 18. RAFT : Client protocol ● Send commands to leader – If leader is unknown, send to anyone – If contacted server is not leader, it will redirect to leader ● Client gets back the response after the full cycle at leader ● Req- timeout – Re-issues command to other server – Unique id for each command at client to avoid duplicate execution
  • 19. Joint consensus phase ● 2 phase approach ● Need majority of both old and new configurations for election and commitment ● Configuration change is just a log entry, applied immediately on receipt (committed or not) ● Once joint consensus is committed, begin replicate log entry for final configuration
  • 20. Distributed store ● A common store which can be shared by different nodes ● In the form of key value pair for ease of use ● Such distributed key value store implementations are available.
  • 21. etcd ● Named as /etc distributed ● Open source distributed consistent key value store ● Highly available and reliable ● Sequentially consistent ● Watchable ● Exposed via HTTP ● Runtime reconfigurable (Saling feature) ● Durable (snapshot backup/restore) ● Time to live keys (have a time out)
  • 22. etcd cond.. ● Bootstraping using RAFT ● Proxy mode in node ● Cluster configuration – etcdctl member add/remove/list ● Similar projects like consul, zookeeper are also available.
  • 23. Why etcd ● Vibrant community ● 500+ applications like kubernetes, cloud foundry using it ● 150+ developers ● Stable releases
  • 25. Q & A