Elasticsearch cluster deep dive

Elasticsearch
Cluster deep dive

NoSQL: Text Search and Document

Distributed
Client Nodes
Data Nodes
Master Nodes
Ingest Nodes

Today view of the cluster
Other Nodes
Master Nodes

What happen when a node starts?
Starting

E
D
A
B
C
Starting
1. Get a list of nodes to ping from config
Master

E
D
A
B
C
Starting
2. Each response contains:
a. cluster name
b. node details
c. master node details
d. cluster state version

E
D
A
B
C
Starting
2. Each response contains:
a. cluster name
b. node details
c. master node details
d. cluster state version
3. Only keeps master eligible responses
based on
discovery.zen.master_election.i
gnore_non_master_pings

E
D
A
B
C
Starting
● List of master nodes: [C, C]
● List of eligible master nodes: [A, B, C]

E
D
A
B
C
Starting
1. Join master node (C) sending:
internal:discovery/zen/join

E
D
A
B
C
Starting
2. Master validates join sending:
internal:discovery/zen/join/val
idate

E
D
A
B
C
Starting
idate
3. Master update the cluster state with
the new node

E
D
A
B
C
Starting
idate
the new node
4. Master waits for
discovery.zen.minimum_master_no
des master eligible to respond

E
D
A
B
C
Starting
idate
the new node
4. Master waits for
discovery.zen.minimum_master_no
des master eligible to respond
5. Change commited and confirmation
sent

E
D
A
B
C
Starting
1. New node check the received state for
a. new master node
b. no master node in the state

Master fault detection
E
D
F A
B
C
Started
● Every discovery.zen.fd.ping_interval
nodes ping master (default 1s)
● Timeout is
discovery.zen.fd.ping_timeout (default
30s)
● Retry is discovery.zen.fd.ping_retries
(default is 3)

Node fault detection
E
D
F A
B
C
Started
● Every discovery.zen.fd.ping_interval
nodes ping master (default 1s)
● Timeout is
discovery.zen.fd.ping_timeout (default
30s)
● Retry is discovery.zen.fd.ping_retries
(default is 3)

Network Partition
E
D
F A
B
C
Master election cannot
happen, master steps
down

Network Partition
E
D
F A
B
C
Master fault detection
triggers new master
election

Master election
1. Based on the list of master eligible nodes it chooses in priority:
a. The node with the higher cluster state version (part of the ping response)
b. Master eligible node
c. Sort alphabetically the id of the remaining a take the first
2. Sends a join to this new master. In the meantime it accumulates join requests
If the current node elected itself as master it waits for the minimum join requests
to declare itself as master (discovery.zen.minimum_master_nodes)
In case of master failure detection, each node removes the failed master from the
candidates.

Lost update partially fixed in 5.0 found by jepsen test
E
D
F A
B
C v18
v18
v18

E
D
F A
B
C v18
v19
v19

E
D
F A
B
C v18
v20
v20

E
D
F A
B
C v19
v20
v20

E
D
F A
B
C v19
v20
v20
Cannot become
the master

Shard assigned to new node
1. Master will rebalance shard allocation to have:
a. same average number of shard per node
b. same average of shard per index per node avoiding 2 shard with the
same id on the same node
2. Uses deciders to decide which shard goes where based on
a. Hot/Warm setup (time based indices)
b. Disk usage allocation (low watermark and high watermark)
c. Throttling (node is already recovering, master might again later)

Shard initialization (Primary)
1. Master communicate through cluster state a new shard assignment
2. Node initialize an empty shard
3. Node notify the master
4. Master mark the shard as started
5. If this is the first shard with a specific id, it is marked as primary is
receives requests

Shard initialization (Replica)
1. Master communicate through cluster state a new shard assignment
2. Node initialize recovery from the primary
3. Node notify the master
4. Master mark the replica as started
5. Node activate the replica

Shard
S1S2S3
DISK
Memory
S1S2S3
Commit point
In memory buffer
Translog

Recovery from primary
Node with Primary Node with Replica
Start Recovery
1. Validate request
2. Prevent translog from deletion
3. Snapshot Lucene

Start Recovery
1. Validate request
3. Snapshot Lucene
Segments

Start Recovery
1. Validate request
3. Snapshot Lucene
Segments
Translog

Start Recovery
1. Validate request
3. Snapshot Lucene
Segments
Translog
Notifies master

Elasticsearch cluster deep dive

More Related Content

Elasticsearch cluster deep dive