SlideShare a Scribd company logo
The Right Compaction
Strategy Can Boost
Your Performance
Raphael Carvalho, Software Engineer at ScyllaDB
Raphael Carvalho
■ Syslinux (bootloader)
■ OSv (unikernel)
■ Seastar (ScyllaDB’s heart)
■ ScyllaDB (the best db in the world)
■ LSM tree
■ What is Compaction? Why is it Needed?
■ Read, Write and Space Amplification
■ Different Compaction Strategies
■ When to Use Each One
■ SAG in ICS
■ Time Series Compaction Strategy (TWCS)
Presentation Agenda
What is LSM-tree Compaction?
LSM storage engine’s write path:
Writes
commit log
What is LSM-tree Compaction?
LSM storage engine’s write path:
commit log
Writes
What is LSM-tree Compaction?
LSM storage engine’s write path:
commit log
Writes
What is LSM-tree Compaction?
LSM storage engine’s write path:
Writes
commit log
compaction
What is LSM-tree Compaction?
LSM storage engine’s write path:
Writes
commit log
compaction
What is LSM-tree Compaction?
LSM storage engine’s write path:
Writes
commit log
What is Compaction? (cont.)
■ This technique of keeping sorted files and merging them is well-known
and often called Log-Structured Merge (LSM) Tree
■ Published in 1996, earliest popular application that I know of is the
Lucene search engine, 1999
■ High performance write.
■ Immediately readable.
■ Reasonable performance for read.
Compaction Strategy
(a.k.a. File picking policy)
■ Which files to Compact, and When?
■ This is called the Compaction Strategy
■ The Goal of the Strategy is Low Amplification:
■ Avoid read requests needing many sstables
■ Read Amplification
■ Avoid overwritten/deleted/expired data staying on disk
■ Avoid excessive temporary disk space needs (scary!)
■ Space Amplification
■ Avoid compacting the same data again and again
■ Write Amplification
Which compaction
strategy shall I
choose?
Read, Write and Space Amplification
Make a choice!
■ This choice is well known in distributed databases like with CAP, etc.
■ The RUM Conjecture states:
■ We cannot design an access method for a storage system that is
optimal in all the following three aspects - Reads, Updates, and
Memory.
■ Impossible to decrease Read, Write & Space Amplification, all at once
■ A strategy can e.g. optimize for Write, while sacrificing Read & Space
■ Whereas another can optimize for Space and Read, while sacrificing
Writes
READ
WRITE SPACE
TIERED
(STCS,
ICS)
LEVELED
Read, Write and Space Amplification
Make a choice!
Compaction Strategies History
Cassandra and ScyllaDB
■ Starts with Size Tiered Compaction Strategy
■ Efficient Write performance
■ Inconsistent Read performance
■ Substantial waste of disk space = bad space amplification (due
to slow GC)
■ To fix Read / Space issues in Tiered Compaction, Leveled
Compaction is introduced
■ Fixes Read & Space issues
■ BUT it introduces a new problem - Write Amplification
Strategy #1: Size-Tiered Compaction
■ Cassandra’s oldest and still default Compaction Strategy
■ Dates back to Google’s BigTable paper (2006)
■ Idea used even earlier (e.g., Lucene, 1999)
Size-Tiered Compaction Strategy
Size-Tiered Compaction Strategy
Compact N similar-sized
files together, with result
being placed into next tier
Size-Tiered Compaction Strategy
Compacted N similar-sized
files together, with result
placed into next tier
OUTPUT
Size-Tiered Compaction Strategy
Compacting N similar-sized
files together, with result
placed into next tier
Size-Tiered Compaction Strategy
Compacted N similar-sized
files together, with result
placed into next tier
OUTPUT
Size-Tiered Compaction - Amplification
■ Write Amplification: O(logN)
■ Where “N” is (data size) / (flushed sstable size).
■ Most data is in highest tier - needed to pass through O(logN) tiers
■ This is asymptotically optimal
Size-Tiered Compaction - Amplification
What is Read Amplification? O(logN) sstables, but:
■ If workload writes a partition once and never modifies it:
■ Eventually each partition’s data will be compacted into one sstable
■ In-memory bloom filter will usually allow reading only one sstable
■ Optimal
■ But if workload continues to update a partition:
■ All sstables will contain updates to the same partition
■ O(logN) reads per read request
■ Reasonable, but not great
Size-Tiered Compaction - Amplification
■ Space amplification
Strategy #2: Leveled Compaction
■ Introduced in Cassandra 1.0, in 2011
■ Based on Google’s LevelDB (itself based on Google’s BigTable)
■ No longer has size-tiered's huge sstables
■ Instead have runs:
■ A run is a collection of small (160 MB by default) SSTables
■ Have non-overlapping key ranges
■ A huge SSTable must be rewritten as a whole, but in a run we can modify only parts
of it (individual sstables) while keeping the disjoint key requirement
Leveled Compaction Strategy
Level 0
Level 1
(run of 10
sstables) Level 2
(run of 100
sstables)
...
Leveled Compaction Strategy
Level 0
Level 1
(run of 10
sstables) Level 2
(run of 100
sstables)
...
COMPACTING LEVEL 0
INTO ALL SSTABLES FROM
LEVEL 1, DUE TO KEY
RANGE OVERLAPPING
Leveled Compaction Strategy
Level 0
Level 1
(run of 10
sstables) Level 2
(run of 100
sstables)
...
OUTPUT IS PLACED INTO
LEVEL 1, WHICH MAY
HAVE EXCEEDED ITS
CAPACITY… MAY NEED TO
COMPACT LEVEL 1 INTO 2
Leveled Compaction Strategy
Level 0
Level 1
(run of 10
sstables) Level 2
(run of 100
sstables)
...
PICKS 1 EXCEEDING FROM
LEVEL 1 AND COMPACT
WITH OVERLAPPING IN
LEVEL 2 (ABOUT ~10 DUE
TO DEFAULT FAN-OUT OF
10)
Leveled Compaction Strategy
Level 0
Level 1
(run of 10
sstables) Level 2
(run of 100
sstables)
...
INPUT IS REMOVED FROM
LEVEL 1 AND OUTPUT
PLACED INTO LEVEL 2,
WITHOUT BREAKING KEY
DISJOINTNESS IN LEVEL 2
Leveled Compaction - Amplification
■ Space Amplification:
■ Because of sstable counts, 90% of the data is in the deepest level (if full!)
■ These sstables do not overlap, so it can’t have duplicate data!
■ So at most, 10% of the space is wasted
■ Also, each compaction needs a constant (~12*160MB) temporary space
■ Nearly optimal
Leveled Compaction - Amplification
■ Read Amplification:
■ We have O(N) tables!
■ But in each level sstables have disjoint ranges (cached in memory)
■ Worst-case, O(logN) sstables relevant to a partition - plus L0 size.
■ Under some assumptions (update complete rows, of similar sizes)
space amplification implies: 90% of the reads will need just one sstable!
■ Nearly optimal
Leveled Compaction - Amplification
■ Write Amplification:
Example 1 - Write-Heavy Workload
■ Size-tiered compaction:
At some points needs twice the disk space
■ In ScyllaDB with many shards, “usually” maximum space
use is not concurrent
■ Level-tiered compaction:
More than double the amount of disk I/O
■ Test used smaller-than default sstables (10 MB) to
illustrate the problem
■ Same problem with default sstable size (160 MB) - with
larger workloads
Example 1 (Space Amplification)
constant multiple of
flushed memtable &
sstable size
x2 space
amplification
Example 2 - Overwrite Workload
■ Write 15 times the same 4 million partitions
■ cassandra-stress write n=4000000 -pop seq=1..4000000 -schema
"replication(strategy=org.apache.cassandra.locator.SimpleStrategy,factor=1)"
■ In this test cassandra-stress is not rate limited
■ Again, small (10MB) LCS tables
■ Necessary amount of sstable data: 1.2 GB
■ STCS space amplification: x7.7 !
■ LCS space amplification lower, constant multiple of sstable size
■ Incremental will be around x2 (if it decides to compact fewer files)
Example 2 (Space Amplification)
x7.7
space
amplification
Strategy #3: Incremental Compaction
■ Size-tiered Compaction needs temporary space because we only remove
a huge SSTable after we fully compact it.
■ Let’s split each huge sstable into a run (a la LCS) of “fragments”:
■ Treat the entire run (not individual SSTables) as a file for STCS
■ Remove individual sstables as compacted. Low temporary space.
Incremental Compaction - Amplification
■ Space Amplification:
■ Small constant temporary space needs, even smaller than LCS
(M*S per parallel compaction, e.g., M=4, S=160 MB)
■ Overwrite-mostly still a worst-case, but 2-fold instead of 5-fold
■ Optimal.
■ Write Amplification:
■ O(logN), small constant — same as Size-Tiered compaction
■ Read Amplification:
■ Like Size-Tiered, at worst O(logN) if updating the same partitions
Example 1 - Size Tiered vs Incremental
Incremental
compaction
Is it Enough?
■ Space overhead problem was efficiently fixed in Incremental (ICS), however…
■ Incremental (ICS) and size-tiered (STCS) strategies share the same space
amplification (~2-4x) when facing overwrite intensive workloads, where:
■ They cover a similar region in the three-dimensional efficiency space (RUM
trade-offs):
READ
WRITE SPACE
STC
S
ICS
Space Amplification Goal (SAG)
■ Leveled and Size-Tiered (or ICS) cover different regions
■ Interesting regions cannot be reached with either strategies.
■ But interesting regions can be reached by combining data layout of both strategies
■ i.e. a Hybrid (Tiered+Leveled) approach:
READ
WRITE SPACE
STC
S
ICS
LCS
■ Space Amplification Goal (SAG) is a property to control the size ratio of
the largest and the second largest-tier
■ It’s a value between 1 and 2 (defined in table’s schema). Value of 1.5
implies Cross-Tier Compaction when second-largest is half the size of
largest.
■ Effectively, helps controlling Space Amplification. Not an upper bound,
but results show that compaction will be working towards reducing the
actual SA to below the configured value.
■ The lower the SAG value the lower the SA but the higher the WA. A good
initial value is 1.5, and then decrease conservatively.
Further on ICS + SAG
ALTER TABLE foo.bar
WITH compaction = {
'class': 'IncrementalCompactionStrategy',
'space_amplification_goal': '1.5'
};
Schema for ICS’ SAG
ICS with Space Amplification Goal (SAG)
■ Accumulation of tombstone records is a known problem in LSM trees
■ Makes queries slower
■ Read amplification
■ More CPU work (preserve latest)
■ Employs a SAG-like, but with focus on expired data, rather than space.
■ Enabled by default, can be controlled with usual params
■ tombstone_compaction_interval (defaults to 864000 (10 days))
■ tombstone_threshold (defaults to 0.2)
ICS has more efficient GC
Strategy #4: Time Window (TWCS)
■ Designed for time series workloads
■ Groups data of similar age together
■ Helps with:
■ Garbage collecting expired data
■ as data with similar age will be expired roughly at the same time
■ Read performance
■ Queries using a time range will find the data in a few number of files
■ Common Anti patterns:
■ Not having every cell TTLd (recommendation is to use default_time_to_live)
■ Deletions, overwrites (not well supported, major is usually needed after)
■ Keep number of windows to a small constant. Recommendation: 20.
Compaction Strategies Summary
Workload Size-Tiered Leveled Incremental Time-Window
Write-only 2x peak space 2x writes Best -
Overwrite Huge peak
space
write
amplification
SAG helps -
Read-mostly,
few updates
read
amplification
Best read
amplification
-
Read-mostly,
but a lot of
updates
read and space
amplification
write
amplification
may overwhelm
read
amplification,
again SAG
helps
-
Time series write, read, and
space ampl.
write and space
amplification
write and read
amplification
Best
Stay in Touch
Raphael S. Carvalho
raphaelsc@scylladb.com
@raphael_scarv
@raphaelsc
https://www.linkedin.com/in/raphaelscarvalho
/

More Related Content

Similar to Balancing Compaction Principles and Practices

Cassandra in production
Cassandra in productionCassandra in production
Cassandra in production
valstadsve
 
Webinar: Using Control Theory to Keep Compactions Under Control
Webinar: Using Control Theory to Keep Compactions Under ControlWebinar: Using Control Theory to Keep Compactions Under Control
Webinar: Using Control Theory to Keep Compactions Under Control
ScyllaDB
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Monitoring Cassandra With An EYE
Monitoring Cassandra With An EYEMonitoring Cassandra With An EYE
Monitoring Cassandra With An EYE
Knoldus Inc.
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
Subhas Kumar Ghosh
 
Ceph Day Chicago - Ceph at work at Bloomberg
Ceph Day Chicago - Ceph at work at Bloomberg Ceph Day Chicago - Ceph at work at Bloomberg
Ceph Day Chicago - Ceph at work at Bloomberg
Ceph Community
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guide
larsgeorge
 
Db2 performance tuning for dummies
Db2 performance tuning for dummiesDb2 performance tuning for dummies
Db2 performance tuning for dummies
Angel Dueñas Neyra
 
AWS Activate webinar - Scalable databases for fast growing startups
AWS Activate webinar - Scalable databases for fast growing startupsAWS Activate webinar - Scalable databases for fast growing startups
AWS Activate webinar - Scalable databases for fast growing startups
Amazon Web Services
 
Modeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQLModeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQL
ScyllaDB
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
ScyllaDB
 
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld
 
DB2 TABLESPACES
DB2 TABLESPACESDB2 TABLESPACES
DB2 TABLESPACES
Rahul Anand
 
Scaling RDBMS on AWS- ClustrixDB @AWS Meetup 20160711
Scaling RDBMS on AWS- ClustrixDB @AWS Meetup 20160711Scaling RDBMS on AWS- ClustrixDB @AWS Meetup 20160711
Scaling RDBMS on AWS- ClustrixDB @AWS Meetup 20160711
Dave Anselmi
 
Optimizing columnar stores
Optimizing columnar storesOptimizing columnar stores
Optimizing columnar stores
Istvan Szukacs
 
Optimizing columnar stores
Optimizing columnar storesOptimizing columnar stores
Optimizing columnar stores
Istvan Szukacs
 
006 performance tuningandclusteradmin
006 performance tuningandclusteradmin006 performance tuningandclusteradmin
006 performance tuningandclusteradmin
Scott Miao
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
ScyllaDB
 
Percona Live 2014 - Scaling MySQL in AWS
Percona Live 2014 - Scaling MySQL in AWSPercona Live 2014 - Scaling MySQL in AWS
Percona Live 2014 - Scaling MySQL in AWS
Pythian
 
Jvm & Garbage collection tuning for low latencies application
Jvm & Garbage collection tuning for low latencies applicationJvm & Garbage collection tuning for low latencies application
Jvm & Garbage collection tuning for low latencies application
Quentin Ambard
 

Similar to Balancing Compaction Principles and Practices (20)

Cassandra in production
Cassandra in productionCassandra in production
Cassandra in production
 
Webinar: Using Control Theory to Keep Compactions Under Control
Webinar: Using Control Theory to Keep Compactions Under ControlWebinar: Using Control Theory to Keep Compactions Under Control
Webinar: Using Control Theory to Keep Compactions Under Control
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 
Monitoring Cassandra With An EYE
Monitoring Cassandra With An EYEMonitoring Cassandra With An EYE
Monitoring Cassandra With An EYE
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
Ceph Day Chicago - Ceph at work at Bloomberg
Ceph Day Chicago - Ceph at work at Bloomberg Ceph Day Chicago - Ceph at work at Bloomberg
Ceph Day Chicago - Ceph at work at Bloomberg
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guide
 
Db2 performance tuning for dummies
Db2 performance tuning for dummiesDb2 performance tuning for dummies
Db2 performance tuning for dummies
 
AWS Activate webinar - Scalable databases for fast growing startups
AWS Activate webinar - Scalable databases for fast growing startupsAWS Activate webinar - Scalable databases for fast growing startups
AWS Activate webinar - Scalable databases for fast growing startups
 
Modeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQLModeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQL
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
 
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
 
DB2 TABLESPACES
DB2 TABLESPACESDB2 TABLESPACES
DB2 TABLESPACES
 
Scaling RDBMS on AWS- ClustrixDB @AWS Meetup 20160711
Scaling RDBMS on AWS- ClustrixDB @AWS Meetup 20160711Scaling RDBMS on AWS- ClustrixDB @AWS Meetup 20160711
Scaling RDBMS on AWS- ClustrixDB @AWS Meetup 20160711
 
Optimizing columnar stores
Optimizing columnar storesOptimizing columnar stores
Optimizing columnar stores
 
Optimizing columnar stores
Optimizing columnar storesOptimizing columnar stores
Optimizing columnar stores
 
006 performance tuningandclusteradmin
006 performance tuningandclusteradmin006 performance tuningandclusteradmin
006 performance tuningandclusteradmin
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
 
Percona Live 2014 - Scaling MySQL in AWS
Percona Live 2014 - Scaling MySQL in AWSPercona Live 2014 - Scaling MySQL in AWS
Percona Live 2014 - Scaling MySQL in AWS
 
Jvm & Garbage collection tuning for low latencies application
Jvm & Garbage collection tuning for low latencies applicationJvm & Garbage collection tuning for low latencies application
Jvm & Garbage collection tuning for low latencies application
 

More from ScyllaDB

Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...
Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...
Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...
ScyllaDB
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
ScyllaDB
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
ScyllaDB
 
Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...
Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...
Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...
ScyllaDB
 
Noise Canceling RUM by Tim Vereecke, Akamai
Noise Canceling RUM by Tim Vereecke, AkamaiNoise Canceling RUM by Tim Vereecke, Akamai
Noise Canceling RUM by Tim Vereecke, Akamai
ScyllaDB
 
Running a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU ImpactsRunning a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU Impacts
ScyllaDB
 
Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...
Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...
Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...
ScyllaDB
 
Performance Budgets for the Real World by Tammy Everts
Performance Budgets for the Real World by Tammy EvertsPerformance Budgets for the Real World by Tammy Everts
Performance Budgets for the Real World by Tammy Everts
ScyllaDB
 
Using Libtracecmd to Analyze Your Latency and Performance Troubles
Using Libtracecmd to Analyze Your Latency and Performance TroublesUsing Libtracecmd to Analyze Your Latency and Performance Troubles
Using Libtracecmd to Analyze Your Latency and Performance Troubles
ScyllaDB
 
Reducing P99 Latencies with Generational ZGC
Reducing P99 Latencies with Generational ZGCReducing P99 Latencies with Generational ZGC
Reducing P99 Latencies with Generational ZGC
ScyllaDB
 
5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X
5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X
5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X
ScyllaDB
 
How Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global ScaleHow Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global Scale
ScyllaDB
 
Conquering Load Balancing: Experiences from ScyllaDB Drivers
Conquering Load Balancing: Experiences from ScyllaDB DriversConquering Load Balancing: Experiences from ScyllaDB Drivers
Conquering Load Balancing: Experiences from ScyllaDB Drivers
ScyllaDB
 
Interaction Latency: Square's User-Centric Mobile Performance Metric
Interaction Latency: Square's User-Centric Mobile Performance MetricInteraction Latency: Square's User-Centric Mobile Performance Metric
Interaction Latency: Square's User-Centric Mobile Performance Metric
ScyllaDB
 
How to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory ModelHow to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory Model
ScyllaDB
 
99.99% of Your Traces are Trash by Paige Cruz
99.99% of Your Traces are Trash by Paige Cruz99.99% of Your Traces are Trash by Paige Cruz
99.99% of Your Traces are Trash by Paige Cruz
ScyllaDB
 
Square's Lessons Learned from Implementing a Key-Value Store with Raft
Square's Lessons Learned from Implementing a Key-Value Store with RaftSquare's Lessons Learned from Implementing a Key-Value Store with Raft
Square's Lessons Learned from Implementing a Key-Value Store with Raft
ScyllaDB
 
Making Python 100x Faster with Less Than 100 Lines of Rust
Making Python 100x Faster with Less Than 100 Lines of RustMaking Python 100x Faster with Less Than 100 Lines of Rust
Making Python 100x Faster with Less Than 100 Lines of Rust
ScyllaDB
 
A Deep Dive Into Concurrent React by Matheus Albuquerque
A Deep Dive Into Concurrent React by Matheus AlbuquerqueA Deep Dive Into Concurrent React by Matheus Albuquerque
A Deep Dive Into Concurrent React by Matheus Albuquerque
ScyllaDB
 
The Latency Stack: Discovering Surprising Sources of Latency
The Latency Stack: Discovering Surprising Sources of LatencyThe Latency Stack: Discovering Surprising Sources of Latency
The Latency Stack: Discovering Surprising Sources of Latency
ScyllaDB
 

More from ScyllaDB (20)

Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...
Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...
Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
 
Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...
Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...
Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...
 
Noise Canceling RUM by Tim Vereecke, Akamai
Noise Canceling RUM by Tim Vereecke, AkamaiNoise Canceling RUM by Tim Vereecke, Akamai
Noise Canceling RUM by Tim Vereecke, Akamai
 
Running a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU ImpactsRunning a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU Impacts
 
Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...
Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...
Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...
 
Performance Budgets for the Real World by Tammy Everts
Performance Budgets for the Real World by Tammy EvertsPerformance Budgets for the Real World by Tammy Everts
Performance Budgets for the Real World by Tammy Everts
 
Using Libtracecmd to Analyze Your Latency and Performance Troubles
Using Libtracecmd to Analyze Your Latency and Performance TroublesUsing Libtracecmd to Analyze Your Latency and Performance Troubles
Using Libtracecmd to Analyze Your Latency and Performance Troubles
 
Reducing P99 Latencies with Generational ZGC
Reducing P99 Latencies with Generational ZGCReducing P99 Latencies with Generational ZGC
Reducing P99 Latencies with Generational ZGC
 
5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X
5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X
5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X
 
How Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global ScaleHow Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global Scale
 
Conquering Load Balancing: Experiences from ScyllaDB Drivers
Conquering Load Balancing: Experiences from ScyllaDB DriversConquering Load Balancing: Experiences from ScyllaDB Drivers
Conquering Load Balancing: Experiences from ScyllaDB Drivers
 
Interaction Latency: Square's User-Centric Mobile Performance Metric
Interaction Latency: Square's User-Centric Mobile Performance MetricInteraction Latency: Square's User-Centric Mobile Performance Metric
Interaction Latency: Square's User-Centric Mobile Performance Metric
 
How to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory ModelHow to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory Model
 
99.99% of Your Traces are Trash by Paige Cruz
99.99% of Your Traces are Trash by Paige Cruz99.99% of Your Traces are Trash by Paige Cruz
99.99% of Your Traces are Trash by Paige Cruz
 
Square's Lessons Learned from Implementing a Key-Value Store with Raft
Square's Lessons Learned from Implementing a Key-Value Store with RaftSquare's Lessons Learned from Implementing a Key-Value Store with Raft
Square's Lessons Learned from Implementing a Key-Value Store with Raft
 
Making Python 100x Faster with Less Than 100 Lines of Rust
Making Python 100x Faster with Less Than 100 Lines of RustMaking Python 100x Faster with Less Than 100 Lines of Rust
Making Python 100x Faster with Less Than 100 Lines of Rust
 
A Deep Dive Into Concurrent React by Matheus Albuquerque
A Deep Dive Into Concurrent React by Matheus AlbuquerqueA Deep Dive Into Concurrent React by Matheus Albuquerque
A Deep Dive Into Concurrent React by Matheus Albuquerque
 
The Latency Stack: Discovering Surprising Sources of Latency
The Latency Stack: Discovering Surprising Sources of LatencyThe Latency Stack: Discovering Surprising Sources of Latency
The Latency Stack: Discovering Surprising Sources of Latency
 

Recently uploaded

Supplier Sourcing Presentation - Gay De La Cruz.pdf
Supplier Sourcing Presentation - Gay De La Cruz.pdfSupplier Sourcing Presentation - Gay De La Cruz.pdf
Supplier Sourcing Presentation - Gay De La Cruz.pdf
gaydlc2513
 
Churchgate Call Girls 👑VIP — Mumbai ☎️ 9910780858 🎀Niamh@ Churchgate Call Gi...
Churchgate Call Girls  👑VIP — Mumbai ☎️ 9910780858 🎀Niamh@ Churchgate Call Gi...Churchgate Call Girls  👑VIP — Mumbai ☎️ 9910780858 🎀Niamh@ Churchgate Call Gi...
Churchgate Call Girls 👑VIP — Mumbai ☎️ 9910780858 🎀Niamh@ Churchgate Call Gi...
shardda patel
 
The "Zen" of Python Exemplars - OTel Community Day
The "Zen" of Python Exemplars - OTel Community DayThe "Zen" of Python Exemplars - OTel Community Day
The "Zen" of Python Exemplars - OTel Community Day
Paige Cruz
 
Product Listing Optimization Presentation - Gay De La Cruz.pdf
Product Listing Optimization Presentation - Gay De La Cruz.pdfProduct Listing Optimization Presentation - Gay De La Cruz.pdf
Product Listing Optimization Presentation - Gay De La Cruz.pdf
gaydlc2513
 
Summer24-ReleaseOverviewDeck - Stephen Stanley 27 June 2024.pdf
Summer24-ReleaseOverviewDeck - Stephen Stanley 27 June 2024.pdfSummer24-ReleaseOverviewDeck - Stephen Stanley 27 June 2024.pdf
Summer24-ReleaseOverviewDeck - Stephen Stanley 27 June 2024.pdf
Anna Loughnan Colquhoun
 
Building a Semantic Layer of your Data Platform
Building a Semantic Layer of your Data PlatformBuilding a Semantic Layer of your Data Platform
Building a Semantic Layer of your Data Platform
Enterprise Knowledge
 
Multivendor cloud production with VSF TR-11 - there and back again
Multivendor cloud production with VSF TR-11 - there and back againMultivendor cloud production with VSF TR-11 - there and back again
Multivendor cloud production with VSF TR-11 - there and back again
Kieran Kunhya
 
ASIMOV: Enterprise RAG at Dialog Axiata PLC
ASIMOV: Enterprise RAG at Dialog Axiata PLCASIMOV: Enterprise RAG at Dialog Axiata PLC
ASIMOV: Enterprise RAG at Dialog Axiata PLC
Zilliz
 
“Efficiency Unleashed: The Next-gen NXP i.MX 95 Applications Processor for Em...
“Efficiency Unleashed: The Next-gen NXP i.MX 95 Applications Processor for Em...“Efficiency Unleashed: The Next-gen NXP i.MX 95 Applications Processor for Em...
“Efficiency Unleashed: The Next-gen NXP i.MX 95 Applications Processor for Em...
Edge AI and Vision Alliance
 
Dev Dives: Mining your data with AI-powered Continuous Discovery
Dev Dives: Mining your data with AI-powered Continuous DiscoveryDev Dives: Mining your data with AI-powered Continuous Discovery
Dev Dives: Mining your data with AI-powered Continuous Discovery
UiPathCommunity
 
CNSCon 2024 Lightning Talk: Don’t Make Me Impersonate My Identity
CNSCon 2024 Lightning Talk: Don’t Make Me Impersonate My IdentityCNSCon 2024 Lightning Talk: Don’t Make Me Impersonate My Identity
CNSCon 2024 Lightning Talk: Don’t Make Me Impersonate My Identity
Cynthia Thomas
 
Call Girls Firozabad ☎️ +91-7426014248 😍 Firozabad Call Girl Beauty Girls Fir...
Call Girls Firozabad ☎️ +91-7426014248 😍 Firozabad Call Girl Beauty Girls Fir...Call Girls Firozabad ☎️ +91-7426014248 😍 Firozabad Call Girl Beauty Girls Fir...
Call Girls Firozabad ☎️ +91-7426014248 😍 Firozabad Call Girl Beauty Girls Fir...
jiaulalam7655
 
Metadata Lakes for Next-Gen AI/ML - Datastrato
Metadata Lakes for Next-Gen AI/ML - DatastratoMetadata Lakes for Next-Gen AI/ML - Datastrato
Metadata Lakes for Next-Gen AI/ML - Datastrato
Zilliz
 
Database Management Myths for Developers
Database Management Myths for DevelopersDatabase Management Myths for Developers
Database Management Myths for Developers
John Sterrett
 
Kubernetes Cloud Native Indonesia Meetup - June 2024
Kubernetes Cloud Native Indonesia Meetup - June 2024Kubernetes Cloud Native Indonesia Meetup - June 2024
Kubernetes Cloud Native Indonesia Meetup - June 2024
Prasta Maha
 
intra-mart Accel series 2024 Spring updates_En
intra-mart Accel series 2024 Spring updates_Enintra-mart Accel series 2024 Spring updates_En
intra-mart Accel series 2024 Spring updates_En
NTTDATA INTRAMART
 
Cassandra to ScyllaDB: Technical Comparison and the Path to Success
Cassandra to ScyllaDB: Technical Comparison and the Path to SuccessCassandra to ScyllaDB: Technical Comparison and the Path to Success
Cassandra to ScyllaDB: Technical Comparison and the Path to Success
ScyllaDB
 
Chapter 5 - Managing Test Activities V4.0
Chapter 5 - Managing Test Activities V4.0Chapter 5 - Managing Test Activities V4.0
Chapter 5 - Managing Test Activities V4.0
Neeraj Kumar Singh
 
Multimodal Retrieval Augmented Generation (RAG) with Milvus
Multimodal Retrieval Augmented Generation (RAG) with MilvusMultimodal Retrieval Augmented Generation (RAG) with Milvus
Multimodal Retrieval Augmented Generation (RAG) with Milvus
Zilliz
 
Introduction to ThousandEyes AMER Webinar
Introduction  to ThousandEyes AMER WebinarIntroduction  to ThousandEyes AMER Webinar
Introduction to ThousandEyes AMER Webinar
ThousandEyes
 

Recently uploaded (20)

Supplier Sourcing Presentation - Gay De La Cruz.pdf
Supplier Sourcing Presentation - Gay De La Cruz.pdfSupplier Sourcing Presentation - Gay De La Cruz.pdf
Supplier Sourcing Presentation - Gay De La Cruz.pdf
 
Churchgate Call Girls 👑VIP — Mumbai ☎️ 9910780858 🎀Niamh@ Churchgate Call Gi...
Churchgate Call Girls  👑VIP — Mumbai ☎️ 9910780858 🎀Niamh@ Churchgate Call Gi...Churchgate Call Girls  👑VIP — Mumbai ☎️ 9910780858 🎀Niamh@ Churchgate Call Gi...
Churchgate Call Girls 👑VIP — Mumbai ☎️ 9910780858 🎀Niamh@ Churchgate Call Gi...
 
The "Zen" of Python Exemplars - OTel Community Day
The "Zen" of Python Exemplars - OTel Community DayThe "Zen" of Python Exemplars - OTel Community Day
The "Zen" of Python Exemplars - OTel Community Day
 
Product Listing Optimization Presentation - Gay De La Cruz.pdf
Product Listing Optimization Presentation - Gay De La Cruz.pdfProduct Listing Optimization Presentation - Gay De La Cruz.pdf
Product Listing Optimization Presentation - Gay De La Cruz.pdf
 
Summer24-ReleaseOverviewDeck - Stephen Stanley 27 June 2024.pdf
Summer24-ReleaseOverviewDeck - Stephen Stanley 27 June 2024.pdfSummer24-ReleaseOverviewDeck - Stephen Stanley 27 June 2024.pdf
Summer24-ReleaseOverviewDeck - Stephen Stanley 27 June 2024.pdf
 
Building a Semantic Layer of your Data Platform
Building a Semantic Layer of your Data PlatformBuilding a Semantic Layer of your Data Platform
Building a Semantic Layer of your Data Platform
 
Multivendor cloud production with VSF TR-11 - there and back again
Multivendor cloud production with VSF TR-11 - there and back againMultivendor cloud production with VSF TR-11 - there and back again
Multivendor cloud production with VSF TR-11 - there and back again
 
ASIMOV: Enterprise RAG at Dialog Axiata PLC
ASIMOV: Enterprise RAG at Dialog Axiata PLCASIMOV: Enterprise RAG at Dialog Axiata PLC
ASIMOV: Enterprise RAG at Dialog Axiata PLC
 
“Efficiency Unleashed: The Next-gen NXP i.MX 95 Applications Processor for Em...
“Efficiency Unleashed: The Next-gen NXP i.MX 95 Applications Processor for Em...“Efficiency Unleashed: The Next-gen NXP i.MX 95 Applications Processor for Em...
“Efficiency Unleashed: The Next-gen NXP i.MX 95 Applications Processor for Em...
 
Dev Dives: Mining your data with AI-powered Continuous Discovery
Dev Dives: Mining your data with AI-powered Continuous DiscoveryDev Dives: Mining your data with AI-powered Continuous Discovery
Dev Dives: Mining your data with AI-powered Continuous Discovery
 
CNSCon 2024 Lightning Talk: Don’t Make Me Impersonate My Identity
CNSCon 2024 Lightning Talk: Don’t Make Me Impersonate My IdentityCNSCon 2024 Lightning Talk: Don’t Make Me Impersonate My Identity
CNSCon 2024 Lightning Talk: Don’t Make Me Impersonate My Identity
 
Call Girls Firozabad ☎️ +91-7426014248 😍 Firozabad Call Girl Beauty Girls Fir...
Call Girls Firozabad ☎️ +91-7426014248 😍 Firozabad Call Girl Beauty Girls Fir...Call Girls Firozabad ☎️ +91-7426014248 😍 Firozabad Call Girl Beauty Girls Fir...
Call Girls Firozabad ☎️ +91-7426014248 😍 Firozabad Call Girl Beauty Girls Fir...
 
Metadata Lakes for Next-Gen AI/ML - Datastrato
Metadata Lakes for Next-Gen AI/ML - DatastratoMetadata Lakes for Next-Gen AI/ML - Datastrato
Metadata Lakes for Next-Gen AI/ML - Datastrato
 
Database Management Myths for Developers
Database Management Myths for DevelopersDatabase Management Myths for Developers
Database Management Myths for Developers
 
Kubernetes Cloud Native Indonesia Meetup - June 2024
Kubernetes Cloud Native Indonesia Meetup - June 2024Kubernetes Cloud Native Indonesia Meetup - June 2024
Kubernetes Cloud Native Indonesia Meetup - June 2024
 
intra-mart Accel series 2024 Spring updates_En
intra-mart Accel series 2024 Spring updates_Enintra-mart Accel series 2024 Spring updates_En
intra-mart Accel series 2024 Spring updates_En
 
Cassandra to ScyllaDB: Technical Comparison and the Path to Success
Cassandra to ScyllaDB: Technical Comparison and the Path to SuccessCassandra to ScyllaDB: Technical Comparison and the Path to Success
Cassandra to ScyllaDB: Technical Comparison and the Path to Success
 
Chapter 5 - Managing Test Activities V4.0
Chapter 5 - Managing Test Activities V4.0Chapter 5 - Managing Test Activities V4.0
Chapter 5 - Managing Test Activities V4.0
 
Multimodal Retrieval Augmented Generation (RAG) with Milvus
Multimodal Retrieval Augmented Generation (RAG) with MilvusMultimodal Retrieval Augmented Generation (RAG) with Milvus
Multimodal Retrieval Augmented Generation (RAG) with Milvus
 
Introduction to ThousandEyes AMER Webinar
Introduction  to ThousandEyes AMER WebinarIntroduction  to ThousandEyes AMER Webinar
Introduction to ThousandEyes AMER Webinar
 

Balancing Compaction Principles and Practices

  • 1. The Right Compaction Strategy Can Boost Your Performance Raphael Carvalho, Software Engineer at ScyllaDB
  • 2. Raphael Carvalho ■ Syslinux (bootloader) ■ OSv (unikernel) ■ Seastar (ScyllaDB’s heart) ■ ScyllaDB (the best db in the world)
  • 3. ■ LSM tree ■ What is Compaction? Why is it Needed? ■ Read, Write and Space Amplification ■ Different Compaction Strategies ■ When to Use Each One ■ SAG in ICS ■ Time Series Compaction Strategy (TWCS) Presentation Agenda
  • 4. What is LSM-tree Compaction? LSM storage engine’s write path: Writes commit log
  • 5. What is LSM-tree Compaction? LSM storage engine’s write path: commit log Writes
  • 6. What is LSM-tree Compaction? LSM storage engine’s write path: commit log Writes
  • 7. What is LSM-tree Compaction? LSM storage engine’s write path: Writes commit log compaction
  • 8. What is LSM-tree Compaction? LSM storage engine’s write path: Writes commit log compaction
  • 9. What is LSM-tree Compaction? LSM storage engine’s write path: Writes commit log
  • 10. What is Compaction? (cont.) ■ This technique of keeping sorted files and merging them is well-known and often called Log-Structured Merge (LSM) Tree ■ Published in 1996, earliest popular application that I know of is the Lucene search engine, 1999 ■ High performance write. ■ Immediately readable. ■ Reasonable performance for read.
  • 11. Compaction Strategy (a.k.a. File picking policy) ■ Which files to Compact, and When? ■ This is called the Compaction Strategy ■ The Goal of the Strategy is Low Amplification: ■ Avoid read requests needing many sstables ■ Read Amplification ■ Avoid overwritten/deleted/expired data staying on disk ■ Avoid excessive temporary disk space needs (scary!) ■ Space Amplification ■ Avoid compacting the same data again and again ■ Write Amplification Which compaction strategy shall I choose?
  • 12. Read, Write and Space Amplification Make a choice! ■ This choice is well known in distributed databases like with CAP, etc. ■ The RUM Conjecture states: ■ We cannot design an access method for a storage system that is optimal in all the following three aspects - Reads, Updates, and Memory. ■ Impossible to decrease Read, Write & Space Amplification, all at once ■ A strategy can e.g. optimize for Write, while sacrificing Read & Space ■ Whereas another can optimize for Space and Read, while sacrificing Writes
  • 13. READ WRITE SPACE TIERED (STCS, ICS) LEVELED Read, Write and Space Amplification Make a choice!
  • 14. Compaction Strategies History Cassandra and ScyllaDB ■ Starts with Size Tiered Compaction Strategy ■ Efficient Write performance ■ Inconsistent Read performance ■ Substantial waste of disk space = bad space amplification (due to slow GC) ■ To fix Read / Space issues in Tiered Compaction, Leveled Compaction is introduced ■ Fixes Read & Space issues ■ BUT it introduces a new problem - Write Amplification
  • 15. Strategy #1: Size-Tiered Compaction ■ Cassandra’s oldest and still default Compaction Strategy ■ Dates back to Google’s BigTable paper (2006) ■ Idea used even earlier (e.g., Lucene, 1999)
  • 17. Size-Tiered Compaction Strategy Compact N similar-sized files together, with result being placed into next tier
  • 18. Size-Tiered Compaction Strategy Compacted N similar-sized files together, with result placed into next tier OUTPUT
  • 19. Size-Tiered Compaction Strategy Compacting N similar-sized files together, with result placed into next tier
  • 20. Size-Tiered Compaction Strategy Compacted N similar-sized files together, with result placed into next tier OUTPUT
  • 21. Size-Tiered Compaction - Amplification ■ Write Amplification: O(logN) ■ Where “N” is (data size) / (flushed sstable size). ■ Most data is in highest tier - needed to pass through O(logN) tiers ■ This is asymptotically optimal
  • 22. Size-Tiered Compaction - Amplification What is Read Amplification? O(logN) sstables, but: ■ If workload writes a partition once and never modifies it: ■ Eventually each partition’s data will be compacted into one sstable ■ In-memory bloom filter will usually allow reading only one sstable ■ Optimal ■ But if workload continues to update a partition: ■ All sstables will contain updates to the same partition ■ O(logN) reads per read request ■ Reasonable, but not great
  • 23. Size-Tiered Compaction - Amplification ■ Space amplification
  • 24. Strategy #2: Leveled Compaction ■ Introduced in Cassandra 1.0, in 2011 ■ Based on Google’s LevelDB (itself based on Google’s BigTable) ■ No longer has size-tiered's huge sstables ■ Instead have runs: ■ A run is a collection of small (160 MB by default) SSTables ■ Have non-overlapping key ranges ■ A huge SSTable must be rewritten as a whole, but in a run we can modify only parts of it (individual sstables) while keeping the disjoint key requirement
  • 25. Leveled Compaction Strategy Level 0 Level 1 (run of 10 sstables) Level 2 (run of 100 sstables) ...
  • 26. Leveled Compaction Strategy Level 0 Level 1 (run of 10 sstables) Level 2 (run of 100 sstables) ... COMPACTING LEVEL 0 INTO ALL SSTABLES FROM LEVEL 1, DUE TO KEY RANGE OVERLAPPING
  • 27. Leveled Compaction Strategy Level 0 Level 1 (run of 10 sstables) Level 2 (run of 100 sstables) ... OUTPUT IS PLACED INTO LEVEL 1, WHICH MAY HAVE EXCEEDED ITS CAPACITY… MAY NEED TO COMPACT LEVEL 1 INTO 2
  • 28. Leveled Compaction Strategy Level 0 Level 1 (run of 10 sstables) Level 2 (run of 100 sstables) ... PICKS 1 EXCEEDING FROM LEVEL 1 AND COMPACT WITH OVERLAPPING IN LEVEL 2 (ABOUT ~10 DUE TO DEFAULT FAN-OUT OF 10)
  • 29. Leveled Compaction Strategy Level 0 Level 1 (run of 10 sstables) Level 2 (run of 100 sstables) ... INPUT IS REMOVED FROM LEVEL 1 AND OUTPUT PLACED INTO LEVEL 2, WITHOUT BREAKING KEY DISJOINTNESS IN LEVEL 2
  • 30. Leveled Compaction - Amplification ■ Space Amplification: ■ Because of sstable counts, 90% of the data is in the deepest level (if full!) ■ These sstables do not overlap, so it can’t have duplicate data! ■ So at most, 10% of the space is wasted ■ Also, each compaction needs a constant (~12*160MB) temporary space ■ Nearly optimal
  • 31. Leveled Compaction - Amplification ■ Read Amplification: ■ We have O(N) tables! ■ But in each level sstables have disjoint ranges (cached in memory) ■ Worst-case, O(logN) sstables relevant to a partition - plus L0 size. ■ Under some assumptions (update complete rows, of similar sizes) space amplification implies: 90% of the reads will need just one sstable! ■ Nearly optimal
  • 32. Leveled Compaction - Amplification ■ Write Amplification:
  • 33. Example 1 - Write-Heavy Workload ■ Size-tiered compaction: At some points needs twice the disk space ■ In ScyllaDB with many shards, “usually” maximum space use is not concurrent ■ Level-tiered compaction: More than double the amount of disk I/O ■ Test used smaller-than default sstables (10 MB) to illustrate the problem ■ Same problem with default sstable size (160 MB) - with larger workloads
  • 34. Example 1 (Space Amplification) constant multiple of flushed memtable & sstable size x2 space amplification
  • 35. Example 2 - Overwrite Workload ■ Write 15 times the same 4 million partitions ■ cassandra-stress write n=4000000 -pop seq=1..4000000 -schema "replication(strategy=org.apache.cassandra.locator.SimpleStrategy,factor=1)" ■ In this test cassandra-stress is not rate limited ■ Again, small (10MB) LCS tables ■ Necessary amount of sstable data: 1.2 GB ■ STCS space amplification: x7.7 ! ■ LCS space amplification lower, constant multiple of sstable size ■ Incremental will be around x2 (if it decides to compact fewer files)
  • 36. Example 2 (Space Amplification) x7.7 space amplification
  • 37. Strategy #3: Incremental Compaction ■ Size-tiered Compaction needs temporary space because we only remove a huge SSTable after we fully compact it. ■ Let’s split each huge sstable into a run (a la LCS) of “fragments”: ■ Treat the entire run (not individual SSTables) as a file for STCS ■ Remove individual sstables as compacted. Low temporary space.
  • 38. Incremental Compaction - Amplification ■ Space Amplification: ■ Small constant temporary space needs, even smaller than LCS (M*S per parallel compaction, e.g., M=4, S=160 MB) ■ Overwrite-mostly still a worst-case, but 2-fold instead of 5-fold ■ Optimal. ■ Write Amplification: ■ O(logN), small constant — same as Size-Tiered compaction ■ Read Amplification: ■ Like Size-Tiered, at worst O(logN) if updating the same partitions
  • 39. Example 1 - Size Tiered vs Incremental Incremental compaction
  • 40. Is it Enough? ■ Space overhead problem was efficiently fixed in Incremental (ICS), however… ■ Incremental (ICS) and size-tiered (STCS) strategies share the same space amplification (~2-4x) when facing overwrite intensive workloads, where: ■ They cover a similar region in the three-dimensional efficiency space (RUM trade-offs): READ WRITE SPACE STC S ICS
  • 41. Space Amplification Goal (SAG) ■ Leveled and Size-Tiered (or ICS) cover different regions ■ Interesting regions cannot be reached with either strategies. ■ But interesting regions can be reached by combining data layout of both strategies ■ i.e. a Hybrid (Tiered+Leveled) approach: READ WRITE SPACE STC S ICS LCS
  • 42. ■ Space Amplification Goal (SAG) is a property to control the size ratio of the largest and the second largest-tier ■ It’s a value between 1 and 2 (defined in table’s schema). Value of 1.5 implies Cross-Tier Compaction when second-largest is half the size of largest. ■ Effectively, helps controlling Space Amplification. Not an upper bound, but results show that compaction will be working towards reducing the actual SA to below the configured value. ■ The lower the SAG value the lower the SA but the higher the WA. A good initial value is 1.5, and then decrease conservatively. Further on ICS + SAG
  • 43. ALTER TABLE foo.bar WITH compaction = { 'class': 'IncrementalCompactionStrategy', 'space_amplification_goal': '1.5' }; Schema for ICS’ SAG
  • 44. ICS with Space Amplification Goal (SAG)
  • 45. ■ Accumulation of tombstone records is a known problem in LSM trees ■ Makes queries slower ■ Read amplification ■ More CPU work (preserve latest) ■ Employs a SAG-like, but with focus on expired data, rather than space. ■ Enabled by default, can be controlled with usual params ■ tombstone_compaction_interval (defaults to 864000 (10 days)) ■ tombstone_threshold (defaults to 0.2) ICS has more efficient GC
  • 46. Strategy #4: Time Window (TWCS) ■ Designed for time series workloads ■ Groups data of similar age together ■ Helps with: ■ Garbage collecting expired data ■ as data with similar age will be expired roughly at the same time ■ Read performance ■ Queries using a time range will find the data in a few number of files ■ Common Anti patterns: ■ Not having every cell TTLd (recommendation is to use default_time_to_live) ■ Deletions, overwrites (not well supported, major is usually needed after) ■ Keep number of windows to a small constant. Recommendation: 20.
  • 47. Compaction Strategies Summary Workload Size-Tiered Leveled Incremental Time-Window Write-only 2x peak space 2x writes Best - Overwrite Huge peak space write amplification SAG helps - Read-mostly, few updates read amplification Best read amplification - Read-mostly, but a lot of updates read and space amplification write amplification may overwhelm read amplification, again SAG helps - Time series write, read, and space ampl. write and space amplification write and read amplification Best
  • 48. Stay in Touch Raphael S. Carvalho raphaelsc@scylladb.com @raphael_scarv @raphaelsc https://www.linkedin.com/in/raphaelscarvalho /