SlideShare a Scribd company logo
Elasticsearch
Cluster deep dive
NoSQL: Text Search and Document
Elasticsearch cluster
Cluster documentation
Distributed
Client Nodes
Data Nodes
Master Nodes
Ingest Nodes
Today view of the cluster
Other Nodes
Master Nodes
What happen when a node starts?
Starting
What happen when a node starts?
E
D
A
B
C
Starting
1. Get a list of nodes to ping from config
Master
What happen when a node starts?
E
D
A
B
C
Starting
1. Get a list of nodes to ping from config
2. Each response contains:
a. cluster name
b. node details
c. master node details
d. cluster state version
What happen when a node starts?
E
D
A
B
C
Starting
1. Get a list of nodes to ping from config
2. Each response contains:
a. cluster name
b. node details
c. master node details
d. cluster state version
3. Only keeps master eligible responses
based on
discovery.zen.master_election.i
gnore_non_master_pings
What happen when a node starts?
E
D
A
B
C
Starting
● List of master nodes: [C, C]
● List of eligible master nodes: [A, B, C]
What happen when a node starts?
E
D
A
B
C
Starting
1. Join master node (C) sending:
internal:discovery/zen/join
What happen when a node starts?
E
D
A
B
C
Starting
1. Join master node (C) sending:
internal:discovery/zen/join
2. Master validates join sending:
internal:discovery/zen/join/val
idate
Cluster state update
What happen when a node starts?
E
D
A
B
C
Starting
1. Join master node (C) sending:
internal:discovery/zen/join
2. Master validates join sending:
internal:discovery/zen/join/val
idate
3. Master update the cluster state with
the new node
What happen when a node starts?
E
D
A
B
C
Starting
1. Join master node (C) sending:
internal:discovery/zen/join
2. Master validates join sending:
internal:discovery/zen/join/val
idate
3. Master update the cluster state with
the new node
4. Master waits for
discovery.zen.minimum_master_no
des master eligible to respond
What happen when a node starts?
E
D
A
B
C
Starting
1. Join master node (C) sending:
internal:discovery/zen/join
2. Master validates join sending:
internal:discovery/zen/join/val
idate
3. Master update the cluster state with
the new node
4. Master waits for
discovery.zen.minimum_master_no
des master eligible to respond
5. Change commited and confirmation
sent
What happen when a node starts?
E
D
A
B
C
Starting
1. Join master node (C) sending:
internal:discovery/zen/join
2. Master validates join sending:
internal:discovery/zen/join/val
idate
3. Master update the cluster state with
the new node
4. Master waits for
discovery.zen.minimum_master_no
des master eligible to respond
5. Change commited and confirmation
sent
What happen when a node starts?
E
D
A
B
C
Starting
1. New node check the received state for
a. new master node
b. no master node in the state
Master fault detection
E
D
F A
B
C
Started
● Every discovery.zen.fd.ping_interval
nodes ping master (default 1s)
● Timeout is
discovery.zen.fd.ping_timeout (default
30s)
● Retry is discovery.zen.fd.ping_retries
(default is 3)
Node fault detection
E
D
F A
B
C
Started
● Every discovery.zen.fd.ping_interval
nodes ping master (default 1s)
● Timeout is
discovery.zen.fd.ping_timeout (default
30s)
● Retry is discovery.zen.fd.ping_retries
(default is 3)
Master election
Minimum of
candidate
required
Master election
E
D
F A
B
C
Network Partition
E
D
F A
B
C
Master election cannot
happen, master steps
down
Network Partition
E
D
F A
B
C
Master fault detection
triggers new master
election
Master election
1. Based on the list of master eligible nodes it chooses in priority:
a. The node with the higher cluster state version (part of the ping response)
b. Master eligible node
c. Sort alphabetically the id of the remaining a take the first
2. Sends a join to this new master. In the meantime it accumulates join requests
If the current node elected itself as master it waits for the minimum join requests
to declare itself as master (discovery.zen.minimum_master_nodes)
In case of master failure detection, each node removes the failed master from the
candidates.
Latest cluster
version
Lost update partially fixed in 5.0 found by jepsen test
E
D
F A
B
C v18
v18
v18
Lost update partially fixed in 5.0 found by jepsen test
E
D
F A
B
C v18
v19
v19
Lost update partially fixed in 5.0 found by jepsen test
E
D
F A
B
C v18
v20
v20
Lost update partially fixed in 5.0 found by jepsen test
E
D
F A
B
C v18
v20
v20
Lost update partially fixed in 5.0 found by jepsen test
E
D
F A
B
C v18
v20
v20
Lost update partially fixed in 5.0 found by jepsen test
E
D
F A
B
C v19
v20
v20
Lost update partially fixed in 5.0 found by jepsen test
E
D
F A
B
C v19
v20
v20
Cannot become
the master
Shard allocation
Shard assigned to new node
1. Master will rebalance shard allocation to have:
a. same average number of shard per node
b. same average of shard per index per node avoiding 2 shard with the
same id on the same node
2. Uses deciders to decide which shard goes where based on
a. Hot/Warm setup (time based indices)
b. Disk usage allocation (low watermark and high watermark)
c. Throttling (node is already recovering, master might again later)
Shard initialization (Primary)
1. Master communicate through cluster state a new shard assignment
2. Node initialize an empty shard
3. Node notify the master
4. Master mark the shard as started
5. If this is the first shard with a specific id, it is marked as primary is
receives requests
Shard initialization (Replica)
1. Master communicate through cluster state a new shard assignment
2. Node initialize recovery from the primary
3. Node notify the master
4. Master mark the replica as started
5. Node activate the replica
Shard recovery
Shard
S1S2S3
DISK
Memory
S1S2S3
Commit point
In memory buffer
Translog
Recovery from primary
Node with Primary Node with Replica
Start Recovery
1. Validate request
2. Prevent translog from deletion
3. Snapshot Lucene
Recovery from primary
Node with Primary Node with Replica
Start Recovery
1. Validate request
2. Prevent translog from deletion
3. Snapshot Lucene
Segments
Recovery from primary
Node with Primary Node with Replica
Start Recovery
1. Validate request
2. Prevent translog from deletion
3. Snapshot Lucene
Segments
Translog
Recovery from primary
Node with Primary Node with Replica
Start Recovery
1. Validate request
2. Prevent translog from deletion
3. Snapshot Lucene
Segments
Translog
Notifies master
Thank you !

More Related Content

Elasticsearch cluster deep dive

  • 2. NoSQL: Text Search and Document
  • 6. Today view of the cluster Other Nodes Master Nodes
  • 7. What happen when a node starts? Starting
  • 8. What happen when a node starts? E D A B C Starting 1. Get a list of nodes to ping from config Master
  • 9. What happen when a node starts? E D A B C Starting 1. Get a list of nodes to ping from config 2. Each response contains: a. cluster name b. node details c. master node details d. cluster state version
  • 10. What happen when a node starts? E D A B C Starting 1. Get a list of nodes to ping from config 2. Each response contains: a. cluster name b. node details c. master node details d. cluster state version 3. Only keeps master eligible responses based on discovery.zen.master_election.i gnore_non_master_pings
  • 11. What happen when a node starts? E D A B C Starting ● List of master nodes: [C, C] ● List of eligible master nodes: [A, B, C]
  • 12. What happen when a node starts? E D A B C Starting 1. Join master node (C) sending: internal:discovery/zen/join
  • 13. What happen when a node starts? E D A B C Starting 1. Join master node (C) sending: internal:discovery/zen/join 2. Master validates join sending: internal:discovery/zen/join/val idate
  • 15. What happen when a node starts? E D A B C Starting 1. Join master node (C) sending: internal:discovery/zen/join 2. Master validates join sending: internal:discovery/zen/join/val idate 3. Master update the cluster state with the new node
  • 16. What happen when a node starts? E D A B C Starting 1. Join master node (C) sending: internal:discovery/zen/join 2. Master validates join sending: internal:discovery/zen/join/val idate 3. Master update the cluster state with the new node 4. Master waits for discovery.zen.minimum_master_no des master eligible to respond
  • 17. What happen when a node starts? E D A B C Starting 1. Join master node (C) sending: internal:discovery/zen/join 2. Master validates join sending: internal:discovery/zen/join/val idate 3. Master update the cluster state with the new node 4. Master waits for discovery.zen.minimum_master_no des master eligible to respond 5. Change commited and confirmation sent
  • 18. What happen when a node starts? E D A B C Starting 1. Join master node (C) sending: internal:discovery/zen/join 2. Master validates join sending: internal:discovery/zen/join/val idate 3. Master update the cluster state with the new node 4. Master waits for discovery.zen.minimum_master_no des master eligible to respond 5. Change commited and confirmation sent
  • 19. What happen when a node starts? E D A B C Starting 1. New node check the received state for a. new master node b. no master node in the state
  • 20. Master fault detection E D F A B C Started ● Every discovery.zen.fd.ping_interval nodes ping master (default 1s) ● Timeout is discovery.zen.fd.ping_timeout (default 30s) ● Retry is discovery.zen.fd.ping_retries (default is 3)
  • 21. Node fault detection E D F A B C Started ● Every discovery.zen.fd.ping_interval nodes ping master (default 1s) ● Timeout is discovery.zen.fd.ping_timeout (default 30s) ● Retry is discovery.zen.fd.ping_retries (default is 3)
  • 25. Network Partition E D F A B C Master election cannot happen, master steps down
  • 26. Network Partition E D F A B C Master fault detection triggers new master election
  • 27. Master election 1. Based on the list of master eligible nodes it chooses in priority: a. The node with the higher cluster state version (part of the ping response) b. Master eligible node c. Sort alphabetically the id of the remaining a take the first 2. Sends a join to this new master. In the meantime it accumulates join requests If the current node elected itself as master it waits for the minimum join requests to declare itself as master (discovery.zen.minimum_master_nodes) In case of master failure detection, each node removes the failed master from the candidates.
  • 29. Lost update partially fixed in 5.0 found by jepsen test E D F A B C v18 v18 v18
  • 30. Lost update partially fixed in 5.0 found by jepsen test E D F A B C v18 v19 v19
  • 31. Lost update partially fixed in 5.0 found by jepsen test E D F A B C v18 v20 v20
  • 32. Lost update partially fixed in 5.0 found by jepsen test E D F A B C v18 v20 v20
  • 33. Lost update partially fixed in 5.0 found by jepsen test E D F A B C v18 v20 v20
  • 34. Lost update partially fixed in 5.0 found by jepsen test E D F A B C v19 v20 v20
  • 35. Lost update partially fixed in 5.0 found by jepsen test E D F A B C v19 v20 v20 Cannot become the master
  • 37. Shard assigned to new node 1. Master will rebalance shard allocation to have: a. same average number of shard per node b. same average of shard per index per node avoiding 2 shard with the same id on the same node 2. Uses deciders to decide which shard goes where based on a. Hot/Warm setup (time based indices) b. Disk usage allocation (low watermark and high watermark) c. Throttling (node is already recovering, master might again later)
  • 38. Shard initialization (Primary) 1. Master communicate through cluster state a new shard assignment 2. Node initialize an empty shard 3. Node notify the master 4. Master mark the shard as started 5. If this is the first shard with a specific id, it is marked as primary is receives requests
  • 39. Shard initialization (Replica) 1. Master communicate through cluster state a new shard assignment 2. Node initialize recovery from the primary 3. Node notify the master 4. Master mark the replica as started 5. Node activate the replica
  • 42. Recovery from primary Node with Primary Node with Replica Start Recovery 1. Validate request 2. Prevent translog from deletion 3. Snapshot Lucene
  • 43. Recovery from primary Node with Primary Node with Replica Start Recovery 1. Validate request 2. Prevent translog from deletion 3. Snapshot Lucene Segments
  • 44. Recovery from primary Node with Primary Node with Replica Start Recovery 1. Validate request 2. Prevent translog from deletion 3. Snapshot Lucene Segments Translog
  • 45. Recovery from primary Node with Primary Node with Replica Start Recovery 1. Validate request 2. Prevent translog from deletion 3. Snapshot Lucene Segments Translog Notifies master