This is a 10 minutes talk about how Elasticsearch manages its cluster. It goes over, master election, fault detection, cluster state update protocol, network partitionning, shard allocation and shard recovery.
8. What happen when a node starts?
E
D
A
B
C
Starting
1. Get a list of nodes to ping from config
Master
9. What happen when a node starts?
E
D
A
B
C
Starting
1. Get a list of nodes to ping from config
2. Each response contains:
a. cluster name
b. node details
c. master node details
d. cluster state version
10. What happen when a node starts?
E
D
A
B
C
Starting
1. Get a list of nodes to ping from config
2. Each response contains:
a. cluster name
b. node details
c. master node details
d. cluster state version
3. Only keeps master eligible responses
based on
discovery.zen.master_election.i
gnore_non_master_pings
11. What happen when a node starts?
E
D
A
B
C
Starting
● List of master nodes: [C, C]
● List of eligible master nodes: [A, B, C]
12. What happen when a node starts?
E
D
A
B
C
Starting
1. Join master node (C) sending:
internal:discovery/zen/join
13. What happen when a node starts?
E
D
A
B
C
Starting
1. Join master node (C) sending:
internal:discovery/zen/join
2. Master validates join sending:
internal:discovery/zen/join/val
idate
15. What happen when a node starts?
E
D
A
B
C
Starting
1. Join master node (C) sending:
internal:discovery/zen/join
2. Master validates join sending:
internal:discovery/zen/join/val
idate
3. Master update the cluster state with
the new node
16. What happen when a node starts?
E
D
A
B
C
Starting
1. Join master node (C) sending:
internal:discovery/zen/join
2. Master validates join sending:
internal:discovery/zen/join/val
idate
3. Master update the cluster state with
the new node
4. Master waits for
discovery.zen.minimum_master_no
des master eligible to respond
17. What happen when a node starts?
E
D
A
B
C
Starting
1. Join master node (C) sending:
internal:discovery/zen/join
2. Master validates join sending:
internal:discovery/zen/join/val
idate
3. Master update the cluster state with
the new node
4. Master waits for
discovery.zen.minimum_master_no
des master eligible to respond
5. Change commited and confirmation
sent
18. What happen when a node starts?
E
D
A
B
C
Starting
1. Join master node (C) sending:
internal:discovery/zen/join
2. Master validates join sending:
internal:discovery/zen/join/val
idate
3. Master update the cluster state with
the new node
4. Master waits for
discovery.zen.minimum_master_no
des master eligible to respond
5. Change commited and confirmation
sent
19. What happen when a node starts?
E
D
A
B
C
Starting
1. New node check the received state for
a. new master node
b. no master node in the state
20. Master fault detection
E
D
F A
B
C
Started
● Every discovery.zen.fd.ping_interval
nodes ping master (default 1s)
● Timeout is
discovery.zen.fd.ping_timeout (default
30s)
● Retry is discovery.zen.fd.ping_retries
(default is 3)
21. Node fault detection
E
D
F A
B
C
Started
● Every discovery.zen.fd.ping_interval
nodes ping master (default 1s)
● Timeout is
discovery.zen.fd.ping_timeout (default
30s)
● Retry is discovery.zen.fd.ping_retries
(default is 3)
27. Master election
1. Based on the list of master eligible nodes it chooses in priority:
a. The node with the higher cluster state version (part of the ping response)
b. Master eligible node
c. Sort alphabetically the id of the remaining a take the first
2. Sends a join to this new master. In the meantime it accumulates join requests
If the current node elected itself as master it waits for the minimum join requests
to declare itself as master (discovery.zen.minimum_master_nodes)
In case of master failure detection, each node removes the failed master from the
candidates.
37. Shard assigned to new node
1. Master will rebalance shard allocation to have:
a. same average number of shard per node
b. same average of shard per index per node avoiding 2 shard with the
same id on the same node
2. Uses deciders to decide which shard goes where based on
a. Hot/Warm setup (time based indices)
b. Disk usage allocation (low watermark and high watermark)
c. Throttling (node is already recovering, master might again later)
38. Shard initialization (Primary)
1. Master communicate through cluster state a new shard assignment
2. Node initialize an empty shard
3. Node notify the master
4. Master mark the shard as started
5. If this is the first shard with a specific id, it is marked as primary is
receives requests
39. Shard initialization (Replica)
1. Master communicate through cluster state a new shard assignment
2. Node initialize recovery from the primary
3. Node notify the master
4. Master mark the replica as started
5. Node activate the replica