SlideShare a Scribd company logo
Topology on Raft:
An Inside Look
Konstantin Osipov, Director of Engineering @ ScyllaDB
Konstantin Osipov
■ Seasoned database geek
■ Certified Buteyko breather
■ Muscovite and a father of three
■ Raft recap
■ ScyllaDB path to consistency:
■ Schema
■ Topology
■ Manageability
Presentation Agenda
Previous Episodes
Problem Overview
Strong vs Eventual Consistency
Strong consistency
Node 1 Node 2
1. Write from
client
4. Acknowledged
to client
2. Write propagated
through cluster
3.Internal
acknowledgement
Eventual consistency
Node 1 Node 2
1. Write from
client
2. Acknowledged
to client
3. Eventual write
propagation
● requires a live majority
● always returns latest write
● highly available
● writes must commute
Data vs metadata
- metadata - data
Schema information: table,
view, type definitions
Topology information:
nodes, tokens
Static and regular rows,
counters
Replicated everywhere Partitioned
Not commutative Commutative
Changes rarely Changes frequently
Consistency of Metadata
1
2 3
3
1 2
replication_factor=2
ScyllaDB cluster
Elements of the Raft State
Topology
9
Schema
keyspaces
Backward
compatibility
topology peers
cdc_generations
columns
tables
tablets scylla_local
local
topology_requests
auth
5.2
5.2
5.2
6.0
6.0
6.0
6.0
6.0 3.0
3.0
3.0
service_levels
6.0
■ Runs alongside Raft leader
■ Highly available
■ Drives the progress
■ Performs linearizable reads and writes of the topology
■ Request coordinators still use the local view on topology
■ No extra coordination when executing user requests
The Centralized Topology Coordinator
Linearizable topology changes
bootstrap bootstrap
tablet
migration
backup repair
+ Simplicity
+ Safety
Automatic Coordinator Failover
Further improvements
in schema changes
Dedicated commit log on shard 0
No need to FLUSH entire schema after changing it
10x less IO with large schemas!
shard 6 shard 7 shard 8
shard 3 shard 5 shard 5
shard 0 shard 1 shard 2
Node 1
shard 6 shard 7 shard 8
shard 3 shard 5 shard 5
shard 0 shard 1 shard 2
Node 2
Schema
commit log
Schema
commit log
Linearizable schema version
No re-hash of the entire schema on change
10x less CPU with large schemas.
TimeUUID-based Schema version
Hash-based schema version
5.x: 6.x:
Authentication and service levels on Raft
ScyllaDB 5.x Manual:
Set the system_auth keyspace replication factor to the number of nodes in the datacenter.
For production environments use only NetworkTopologyStrategy.
ScyllaDB 6.x:
■ Automatically replicated on every node
■ Linearizable with CREATE/DROP
■ No denial of service if a node is down
Systems
we moved
Features on Raft
Can
I join?
Can
I join?
ok
CDC generations on Raft
■ Quick & reliable propagation of CDC data at boot
■ The topology coordinator is responsible for changing the ring
■ Prerequisite for quick and concurrent boot
Automated cleanup
■ No need to run nodetool cleanup - automatic after topology op
■ Automatic repair is planned with tablets
You should run nodetool cleanup whenever you scale-out
(expand) your cluster, and new nodes are added to the same DC.
UUID based host identification
■ Token metadata
■ Hints
Increased safety:
■ Removed nodes are banned from the cluster
■ Live nodes can’t be removed, only decommissioned
Fast and concurrent bootstrap
■ bootstrap as many nodes as you want, simultaneously
■ New cluster assembly takes seconds, not minutes/hours
# DEPRECATED/IGNORED
skip_wait_for_gossip_to_settle: 30
Manageability
improvements
New system table for Raft state
cqlsh> select * from system.raft_state;
group_id | disposition | server_id | can_vote
--------------------------------------+-------------+--------------------------------------+----------
7b818380-e9f8-11ed-9316-7c72c96b4bfa | CURRENT | c3b8f01d-e87f-487f-8e6c-e2c86f8b898b | True
New rest APIs
■ localhost:9000/storage_service/cleanup_all
■ localhost:9000/raft/trigger_snapshot/{group_id}
Maintenance mode
./scylla --maintenance-mode=true --maintenance-socket=workdir
kostja@hulk:~/work/scylla/db$ cqlsh ./cql.m
Connected to at ./cql.m:9042
[cqlsh 6.2.0 | Scylla 5.5.0~dev-0.20240130.0cbf8f75f016 | CQL spec 3.3.1 |
Native protocol v4]
Use HELP for help.
cqlsh>
Enabling Raft
■ In 6.0 and up Raft is ALWAYS ON
# DEPRECATED/IGNORED
consistent_cluster_management: true
Stay in Touch
Konstantin Osipov
kostja@scylladb.com
@kostja_osipov
@kostja
https://www.linkedin.com/in/kostja/
ScyllaDB Topology on Raft: An Inside Look
ScyllaDB Summit 2024 Styles
2024 Summit color palette
#1B58EF #05CEE8 #00EFB6
#F244CD #8158FF #EEEEEE
#FFA522
#4D4D4D
The default body font is Roboto Condensed.
You can adjust the size as needed.
You can also use Roboto (the uncondensed version).
For code you should use Roboto Mono and you can set it on
this dark background
ScyllaDB Logo
ScyllaDB Products Mascots
Scylla Open Source Scylla Enterprise Scylla Cloud
Scylla Manager
Scylla
Drivers
Scylla Operator
Scylla Monitoring
Scylla Alternator
ScyllaDB Monsters
Your Big Slide Title
Goes Here
Your Big Slide Title
Goes Here
Your Title Goes Here
Your Slide Title in Title Case
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum
dictum ex leo, ac blandit arcu convallis et.
■ Donec faucibus porttitor lorem vitae luctus
■ Vestibulum ante ipsum primis in faucibus
■ Orci luctus et ultrices posuere cubilia curae
■ Donec pharetra turpis eu interdum fermentum
■ Nulla facilisi
■ Lacus est finibus ligula
Section Title

More Related Content

ScyllaDB Topology on Raft: An Inside Look

  • 1. Topology on Raft: An Inside Look Konstantin Osipov, Director of Engineering @ ScyllaDB
  • 2. Konstantin Osipov ■ Seasoned database geek ■ Certified Buteyko breather ■ Muscovite and a father of three
  • 3. ■ Raft recap ■ ScyllaDB path to consistency: ■ Schema ■ Topology ■ Manageability Presentation Agenda
  • 6. Strong vs Eventual Consistency Strong consistency Node 1 Node 2 1. Write from client 4. Acknowledged to client 2. Write propagated through cluster 3.Internal acknowledgement Eventual consistency Node 1 Node 2 1. Write from client 2. Acknowledged to client 3. Eventual write propagation ● requires a live majority ● always returns latest write ● highly available ● writes must commute
  • 7. Data vs metadata - metadata - data Schema information: table, view, type definitions Topology information: nodes, tokens Static and regular rows, counters Replicated everywhere Partitioned Not commutative Commutative Changes rarely Changes frequently Consistency of Metadata 1 2 3 3 1 2 replication_factor=2 ScyllaDB cluster
  • 8. Elements of the Raft State Topology 9 Schema keyspaces Backward compatibility topology peers cdc_generations columns tables tablets scylla_local local topology_requests auth 5.2 5.2 5.2 6.0 6.0 6.0 6.0 6.0 3.0 3.0 3.0 service_levels 6.0
  • 9. ■ Runs alongside Raft leader ■ Highly available ■ Drives the progress ■ Performs linearizable reads and writes of the topology ■ Request coordinators still use the local view on topology ■ No extra coordination when executing user requests The Centralized Topology Coordinator
  • 10. Linearizable topology changes bootstrap bootstrap tablet migration backup repair + Simplicity + Safety
  • 13. Dedicated commit log on shard 0 No need to FLUSH entire schema after changing it 10x less IO with large schemas! shard 6 shard 7 shard 8 shard 3 shard 5 shard 5 shard 0 shard 1 shard 2 Node 1 shard 6 shard 7 shard 8 shard 3 shard 5 shard 5 shard 0 shard 1 shard 2 Node 2 Schema commit log Schema commit log
  • 14. Linearizable schema version No re-hash of the entire schema on change 10x less CPU with large schemas. TimeUUID-based Schema version Hash-based schema version 5.x: 6.x:
  • 15. Authentication and service levels on Raft ScyllaDB 5.x Manual: Set the system_auth keyspace replication factor to the number of nodes in the datacenter. For production environments use only NetworkTopologyStrategy. ScyllaDB 6.x: ■ Automatically replicated on every node ■ Linearizable with CREATE/DROP ■ No denial of service if a node is down
  • 17. Features on Raft Can I join? Can I join? ok
  • 18. CDC generations on Raft ■ Quick & reliable propagation of CDC data at boot ■ The topology coordinator is responsible for changing the ring ■ Prerequisite for quick and concurrent boot
  • 19. Automated cleanup ■ No need to run nodetool cleanup - automatic after topology op ■ Automatic repair is planned with tablets You should run nodetool cleanup whenever you scale-out (expand) your cluster, and new nodes are added to the same DC.
  • 20. UUID based host identification ■ Token metadata ■ Hints Increased safety: ■ Removed nodes are banned from the cluster ■ Live nodes can’t be removed, only decommissioned
  • 21. Fast and concurrent bootstrap ■ bootstrap as many nodes as you want, simultaneously ■ New cluster assembly takes seconds, not minutes/hours # DEPRECATED/IGNORED skip_wait_for_gossip_to_settle: 30
  • 23. New system table for Raft state cqlsh> select * from system.raft_state; group_id | disposition | server_id | can_vote --------------------------------------+-------------+--------------------------------------+---------- 7b818380-e9f8-11ed-9316-7c72c96b4bfa | CURRENT | c3b8f01d-e87f-487f-8e6c-e2c86f8b898b | True
  • 24. New rest APIs ■ localhost:9000/storage_service/cleanup_all ■ localhost:9000/raft/trigger_snapshot/{group_id}
  • 25. Maintenance mode ./scylla --maintenance-mode=true --maintenance-socket=workdir kostja@hulk:~/work/scylla/db$ cqlsh ./cql.m Connected to at ./cql.m:9042 [cqlsh 6.2.0 | Scylla 5.5.0~dev-0.20240130.0cbf8f75f016 | CQL spec 3.3.1 | Native protocol v4] Use HELP for help. cqlsh>
  • 26. Enabling Raft ■ In 6.0 and up Raft is ALWAYS ON # DEPRECATED/IGNORED consistent_cluster_management: true
  • 27. Stay in Touch Konstantin Osipov kostja@scylladb.com @kostja_osipov @kostja https://www.linkedin.com/in/kostja/
  • 29. ScyllaDB Summit 2024 Styles 2024 Summit color palette #1B58EF #05CEE8 #00EFB6 #F244CD #8158FF #EEEEEE #FFA522 #4D4D4D The default body font is Roboto Condensed. You can adjust the size as needed. You can also use Roboto (the uncondensed version). For code you should use Roboto Mono and you can set it on this dark background
  • 31. ScyllaDB Products Mascots Scylla Open Source Scylla Enterprise Scylla Cloud Scylla Manager Scylla Drivers Scylla Operator Scylla Monitoring Scylla Alternator
  • 33. Your Big Slide Title Goes Here
  • 34. Your Big Slide Title Goes Here
  • 36. Your Slide Title in Title Case Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum dictum ex leo, ac blandit arcu convallis et. ■ Donec faucibus porttitor lorem vitae luctus ■ Vestibulum ante ipsum primis in faucibus ■ Orci luctus et ultrices posuere cubilia curae ■ Donec pharetra turpis eu interdum fermentum ■ Nulla facilisi ■ Lacus est finibus ligula