Scylla Summit 2017: Cry in the Dojo, Laugh in the Battlefield: How We Constantly Try to Bring Scylla to its Knees

PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Cry in the dojo, laugh in the
battlefield: how we constantly
try to bring Scylla to its knees so
you don't have to.
QA Manager, Scylla
Roy Dahan

AND ON TWO LINES
First and last name
Position, company
Roy Dahan
2
Roy has over of 10 years of experience testing
large-scale distributed systems, with a focus on
storage/data systems, and managing small to large
teams responsible for all testing aspects using a
highly automated approach.

AND ON TWO LINES
First and last name
Position, company
Our Goal
▪ Achieving Highest Levels of System Stability & Availability
▪ Maintaining Data Integrity
▪ Prevent Performance Degradations Over Time
▪ Increase Users Confidence
All of the above, even when BAD THINGS happen on
“Production-like Environments”
3

AND ON TWO LINES
First and last name
Position, company
How We Test Scylla
4
Scylla
Testing
Unit
✓ scylla-unittest
Functional
✓ dtest
Compatibility
✓ dtest
✓ Driver Tests
Integration
✓ Janus-Graph
Tests
✓ Titan-test
✓ Spark
Scale /
Performance
✓ S-C-T
Stress / Load
✓ S-C-T
✓ Cassandra
Stress
System /
Longevity
✓ S-C-T
✓ Jepsen

AND ON TWO LINES
First and last name
Position, company
Distributed Tests (dtest)
▪ Functional “Black Box” Tests
▪ Verifies our Compatibility with Cassandra
▪ Enhanced & Extended to Catch Scylla Regressions
▪ Around 10% (208) of the Reported Issues on the Scylla Project
reference a dtest - (Detected/Reproduced by dtest)
▪ About 675 Tests Runs Regularly as part of “Regression Suite”
5

AND ON TWO LINES
First and last name
Position, company
Scylla-Cluster-Tests (SCT)
▪ Automation Library and Test Collection for Scylla & Cassandra
Clusters
▪ Supports Multiple Backends such as: AWS / GCE / OpenStack /
Libvirt
▪ Tests are Based on Chaos Engineering Principles:
o Build a Hypothesis around Steady State Behavior
o Vary Real-world Events
o Automate Experiments to Run Continuously
▪ Around 4% (105) of the Reported Issues on the Scylla Project
Reference SCT test - (Detected/Reproduced by SCT test)
6

AND ON TWO LINES
First and last name
Position, company
SCT Longevity Testing
7
Test Setup (Our Defaults):
▪ Cluster of N Scylla DB nodes (N=6)
▪ Set of X Loaders Nodes (x=2)
▪ Scylla Monitoring Server
client
Cluster of nodes
client

AND ON TWO LINES
First and last name
Position, company
8
Test Setup - Example on GCE:
▪

AND ON TWO LINES
First and last name
Position, company
9
The Test flow:
▪ Client Side Loaders Run Workloads
(Set of Cassandra-Stress loads run on the loaders (Write,
Mixed, Counters, User Profiles)
▪ During X hours / days / weeks
▪ A “Nemesis” Out of the Predefined List is
Randomly Selected
o Some Nemesis Disrupts Nodes in the
Cluster.
o Someone Runs Standard Cluster
Operations
Current Nemesis types:
StopStartService
StopWaitStartService
Drainer
Decommission
CorruptThenRepair
CorruptThenRebuild
NoCorruptRepair
Refresh
MajorCompaction
ModifyTableProperties
Enospc

AND ON TWO LINES
First and last name
Position, company
10
Test Fixture Example:
test_duration: 5760
stress_cmd:
["cassandra-stress write cl=QUORUM duration=5760m -schema 'replication(factor=3)
compaction(strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=1000
-pop seq=1..100000000 -log interval=5",
"cassandra-stress counter_write cl=QUORUM duration=5760m -schema 'replication(factor=3)
compaction(strategy=DateTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=1000
-pop seq=1..1000000",
"cassandra-stress user profile=/tmp/cs_mv_profile.yaml ops'(insert=3,read1=1,read2=1,read3=1)'
cl=QUORUM duration=5760m -port jmx=6868 -mode cql3 native -rate threads=100"]
n_db_nodes: 6
n_loaders: 2
n_monitor_nodes: 1
nemesis_class_name: 'ChaosMonkey'
nemesis_interval: 5
failure_post_behavior: keep
space_node_threshold: 644245094
ip_ssh_connections: 'private'
experimental: 'true'

AND ON TWO LINES
First and last name
Position, company
11
Test Fixture Example:
test_duration: 5760
stress_cmd:
["cassandra-stress write cl=QUORUM duration=5760m -schema 'replication(factor=3)
compaction(strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=1000
-pop seq=1..100000000 -log interval=5",
"cassandra-stress counter_write cl=QUORUM duration=5760m -schema 'replication(factor=3)
compaction(strategy=DateTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=1000
-pop seq=1..1000000",
"cassandra-stress user profile=/tmp/cs_mv_profile.yaml ops'(insert=3,read1=1,read2=1,read3=1)'
cl=QUORUM duration=5760m -port jmx=6868 -mode cql3 native -rate threads=100"]
n_db_nodes: 6
n_loaders: 2
n_monitor_nodes: 1
nemesis_class_name: 'ChaosMonkey'
nemesis_interval: 5
failure_post_behavior: keep
space_node_threshold: 644245094
ip_ssh_connections: 'private'
experimental: 'true'

AND ON TWO LINES
First and last name
Position, company
12
Nemesis Code Examples:
def disrupt_destroy_data_then_repair(self):
self._set_current_disruption('CorruptThenRepair %s' % self.target_node)
# Delete set of sstables from data directory
self._destroy_data()
# Try to save the node
self.repair_nodetool_repair()
def disrupt_stop_wait_start_scylla_server(self, sleep_time=300):
self._set_current_disruption('StopWaitStartService %s' % self.target_node)
self.target_node.remoter.run('sudo systemctl stop scylla-server.service')
self.target_node.wait_db_down()
self.log.info("Sleep for %s seconds", sleep_time)
time.sleep(sleep_time)
self.target_node.remoter.run('sudo systemctl start scylla-server.service')
self.target_node.wait_db_up()

AND ON TWO LINES
First and last name
Position, company
13
Test Verification & Analysis:
▪ Application Load (cassandra-stress) Doesn’t Stop
▪ Auto Detection of:
• Coredumps
• Errors
• Exceptions
• Operations failures (repair, add node, refresh, compaction, etc.)
▪ Auto Detection of Performance Degradations (unexpected lower throughput
/ higher latencies due to operations)
▪ Compare Nemesis Execution Durations Across Builds to Detect Possible
Regressions

AND ON TWO LINES
First and last name
Position, company
14
Longevity monitoring example:
“Total Requests Served” (op/s) correlated with Nemesis executions.

AND ON TWO LINES
First and last name
Position, company
15
“Requests Rate Served” (op/s per instance) correlated with Nemesis executions.

AND ON TWO LINES
First and last name
Position, company
16
“CPU utilization” (% per instance) correlated with Nemesis executions.

AND ON TWO LINES
First and last name
Position, company
17
Test Summary Output - Nemesis Execution:
50GB DataSet Test: (Nemesis every 5 minutes, 4 days)
--------------------------------------------
| Nemesis Type |Count | Avg Time(s) |
-------------------------------------------
| CorruptThenRebuild | 103 | 93.79 |
| Decommission | 111 | 231.89 |
| Drainer | 109 | 48.27 |
| CorruptThenRepair | 113 | 285.71 |
| Refresh | 95 | 7.72 |
| NoCorruptRepair | 97 | 331.73 |
| StopStartService | 133 | 26.92 |
| MajorCompaction | 134 | 20.63 |
| ModifyTable | 197 | 1.50 |
| Enospc | 114 | 26.33 |
| StopWaitStartService| 98 | 66.30 |
--------------------------------------------
1TB DataSet Test: (Nemesis every 30 minutes, 6 days)
--------------------------------------------
| Nemesis Type |Count | Avg Time(s) |
-------------------------------------------
| CorruptThenRebuild | 2 | 732.50 |
| Decommission | 7 | 2913.86 |
| Drainer | 6 | 213.00 |
| CorruptThenRepair | 5 | 4942.60 |
| Refresh | 6 | 10.50 |
| NoCorruptRepair | 3 | 2835.33 |
| StopStartService | 2 | 195.00 |
| MajorCompaction | 3 | 663.33 |
| ModifyTable | 6 | 4.67 |
| Enospc | 6 | 221.00 |
| StopWaitStartService| 6 | 492.17 |
--------------------------------------------

AND ON TWO LINES
First and last name
Position, company
18
Nemesis Execution Analysis:
Auto-analysis and reports based on test
statistics stored automatically in ElasticSearch

AND ON TWO LINES
First and last name
Position, company
Example of Issue detected by Longevity
19

AND ON TWO LINES
First and last name
Position, company
Example of Nemesis Added due to Issue
20

AND ON TWO LINES
First and last name
Position, company
Example of Nemesis Added due to Issue
21
def disrupt_modify_table_comment(self):
self._set_current_disruption('ModifyTableProperties %s' % self.target_node)
comment = ''.join(random.choice(string.ascii_letters) for i in xrange(24))
cmd = "ALTER TABLE keyspace1.standard1 with comment = '{}';".format(comment)
self.target_node.remoter.run('cqlsh -e "{}" {}'.format(cmd, self.target_node.private_ip_address),
verbose=True)
def disrupt_modify_table_gc_grace_time(self):
self._set_current_disruption('ModifyTableProperties %s' % self.target_node)
gc_grace_seconds = random.choice(xrange(216000, 864000))
cmd = "ALTER TABLE keyspace1.standard1 with comment = 'gc_grace_seconds changed' AND"
" gc_grace_seconds = {};".format(gc_grace_seconds)
self.target_node.remoter.run('cqlsh -e "{}" {}'.format(cmd, self.target_node.private_ip_address),
verbose=True)

AND ON TWO LINES
First and last name
Position, company
Multi DC Longevity - The plot thickens
22
Test Setup (Our Defaults):
▪ Cluster of N Scylla DB nodes (N=15)
▪ Across M “Data Centers” (M=3)
▪ Set of X Loaders nodes. (X=3)
▪ Scylla Monitoring Server.
▪ Set of Cassandra-Stress commands
running on the loaders (Write,
Mixed, Counters, User Profiles).
The tc utility is being used to impose random network delays,
packet drops and reorder packets between Data Centers.
DC1
client
DC2
client
DC3
client

AND ON TWO LINES
First and last name
Position, company
Performance Regression
23
▪ Set of Predefined Workloads & Setups
○ Write
○ Read
○ Mixed
○ Customers Workloads
▪ Storing Results (Op/s, Throughput, Latency) in ElasticSearch
▪ Master Daily Regression Suite - Automatically Compare Results
with a Previous Build & “Best” Build
▪ Release Regression Suite - Automatically Compare Results with
Previous Releases (including RCs)

AND ON TWO LINES
First and last name
Position, company
24
Test-Write - Total Op rate (op/s) by Release:

AND ON TWO LINES
First and last name
Position, company
25
Test-Write - 99th Percentile Latency (ms) by Release:

AND ON TWO LINES
First and last name
Position, company
Large Scale Tests
26
▪ 100’s of Nodes Clusters
▪ 10’s TB DataSets
▪ Multi-Core Scylla nodes
▪ Many sstables
Sample of 101 nodes Scylla cluster running on AWS.

AND ON TWO LINES
First and last name
Position, company
On QA Roadmap
Longevity:
▪ Embed CharybdeFS (fault injection FS) in Longevity
▪ Extend workload types
▪ Two+ Nemesis in Parallel
▪ Adding more “Sudden Death” Types of Nemesis
▪ Enable “sstables integrity checker”
Load & Scale
▪ XXL Clusters Sizes (1000+ nodes)
▪ Enhance Load Testing to More Server Dimensions (network, Disk)
27

AND ON TWO LINES
First and last name
Position, company
On QA Roadmap
Performance:
▪ Add more “Real World Workloads” to Daily Regressions
▪ Performance Impact Per Operation (e.g. repair, majorCompaction)
▪ Collecting Latency Histograms for Various Load Types
3rd Party Integration:
▪ Spark & Titan Integration Suites
▪ Java & Golang Driver Integration Suites
Tools & Infrastructure:
▪ Enhance auto analysis based on Statistics in ElasticSearch
▪ Running SCT using an Existing Env
28

AND ON TWO LINES
First and last name
Position, company
THANK YOU
Roy@scylladb.com
Please stay in touch
Any questions?

Scylla Summit 2017: Cry in the Dojo, Laugh in the Battlefield: How We Constantly Try to Bring Scylla to its Knees

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Scylla Summit 2017: Cry in the Dojo, Laugh in the Battlefield: How We Constantly Try to Bring Scylla to its Knees

Similar to Scylla Summit 2017: Cry in the Dojo, Laugh in the Battlefield: How We Constantly Try to Bring Scylla to its Knees (20)

More from ScyllaDB

More from ScyllaDB (20)

Recently uploaded

Recently uploaded (20)

Scylla Summit 2017: Cry in the Dojo, Laugh in the Battlefield: How We Constantly Try to Bring Scylla to its Knees