Balancing Compaction Principles and Practices

The Right Compaction
Strategy Can Boost
Your Performance
Raphael Carvalho, Software Engineer at ScyllaDB

Raphael Carvalho
■ Syslinux (bootloader)
■ OSv (unikernel)
■ Seastar (ScyllaDB’s heart)
■ ScyllaDB (the best db in the world)

■ LSM tree
■ What is Compaction? Why is it Needed?
■ Read, Write and Space Amplification
■ Different Compaction Strategies
■ When to Use Each One
■ SAG in ICS
■ Time Series Compaction Strategy (TWCS)
Presentation Agenda

What is LSM-tree Compaction?
LSM storage engine’s write path:
Writes
commit log

commit log
Writes

Writes
commit log
compaction

What is Compaction? (cont.)
■ This technique of keeping sorted files and merging them is well-known
and often called Log-Structured Merge (LSM) Tree
■ Published in 1996, earliest popular application that I know of is the
Lucene search engine, 1999
■ High performance write.
■ Immediately readable.
■ Reasonable performance for read.

Compaction Strategy
(a.k.a. File picking policy)
■ Which files to Compact, and When?
■ This is called the Compaction Strategy
■ The Goal of the Strategy is Low Amplification:
■ Avoid read requests needing many sstables
■ Read Amplification
■ Avoid overwritten/deleted/expired data staying on disk
■ Avoid excessive temporary disk space needs (scary!)
■ Space Amplification
■ Avoid compacting the same data again and again
■ Write Amplification
Which compaction
strategy shall I
choose?

Read, Write and Space Amplification
Make a choice!
■ This choice is well known in distributed databases like with CAP, etc.
■ The RUM Conjecture states:
■ We cannot design an access method for a storage system that is
optimal in all the following three aspects - Reads, Updates, and
Memory.
■ Impossible to decrease Read, Write & Space Amplification, all at once
■ A strategy can e.g. optimize for Write, while sacrificing Read & Space
■ Whereas another can optimize for Space and Read, while sacrificing
Writes

READ
WRITE SPACE
TIERED
(STCS,
ICS)
LEVELED
Read, Write and Space Amplification
Make a choice!

Compaction Strategies History
Cassandra and ScyllaDB
■ Starts with Size Tiered Compaction Strategy
■ Efficient Write performance
■ Inconsistent Read performance
■ Substantial waste of disk space = bad space amplification (due
to slow GC)
■ To fix Read / Space issues in Tiered Compaction, Leveled
Compaction is introduced
■ Fixes Read & Space issues
■ BUT it introduces a new problem - Write Amplification

Strategy #1: Size-Tiered Compaction
■ Cassandra’s oldest and still default Compaction Strategy
■ Dates back to Google’s BigTable paper (2006)
■ Idea used even earlier (e.g., Lucene, 1999)

Size-Tiered Compaction Strategy

Compact N similar-sized
files together, with result
being placed into next tier

Compacted N similar-sized
placed into next tier
OUTPUT

Compacting N similar-sized
placed into next tier

Size-Tiered Compaction - Amplification
■ Write Amplification: O(logN)
■ Where “N” is (data size) / (flushed sstable size).
■ Most data is in highest tier - needed to pass through O(logN) tiers
■ This is asymptotically optimal

What is Read Amplification? O(logN) sstables, but:
■ If workload writes a partition once and never modifies it:
■ Eventually each partition’s data will be compacted into one sstable
■ In-memory bloom filter will usually allow reading only one sstable
■ Optimal
■ But if workload continues to update a partition:
■ All sstables will contain updates to the same partition
■ O(logN) reads per read request
■ Reasonable, but not great

■ Space amplification

Strategy #2: Leveled Compaction
■ Introduced in Cassandra 1.0, in 2011
■ Based on Google’s LevelDB (itself based on Google’s BigTable)
■ No longer has size-tiered's huge sstables
■ Instead have runs:
■ A run is a collection of small (160 MB by default) SSTables
■ Have non-overlapping key ranges
■ A huge SSTable must be rewritten as a whole, but in a run we can modify only parts
of it (individual sstables) while keeping the disjoint key requirement

Leveled Compaction Strategy
Level 0
Level 1
(run of 10
sstables) Level 2
(run of 100
sstables)
...

Level 0
Level 1
(run of 10
sstables) Level 2
(run of 100
sstables)
...
COMPACTING LEVEL 0
INTO ALL SSTABLES FROM
LEVEL 1, DUE TO KEY
RANGE OVERLAPPING

Level 0
Level 1
(run of 10
sstables) Level 2
(run of 100
sstables)
...
OUTPUT IS PLACED INTO
LEVEL 1, WHICH MAY
HAVE EXCEEDED ITS
CAPACITY… MAY NEED TO
COMPACT LEVEL 1 INTO 2

Level 0
Level 1
(run of 10
sstables) Level 2
(run of 100
sstables)
...
PICKS 1 EXCEEDING FROM
LEVEL 1 AND COMPACT
WITH OVERLAPPING IN
LEVEL 2 (ABOUT ~10 DUE
TO DEFAULT FAN-OUT OF
10)

Level 0
Level 1
(run of 10
sstables) Level 2
(run of 100
sstables)
...
INPUT IS REMOVED FROM
LEVEL 1 AND OUTPUT
PLACED INTO LEVEL 2,
WITHOUT BREAKING KEY
DISJOINTNESS IN LEVEL 2

Leveled Compaction - Amplification
■ Space Amplification:
■ Because of sstable counts, 90% of the data is in the deepest level (if full!)
■ These sstables do not overlap, so it can’t have duplicate data!
■ So at most, 10% of the space is wasted
■ Also, each compaction needs a constant (~12*160MB) temporary space
■ Nearly optimal

■ Read Amplification:
■ We have O(N) tables!
■ But in each level sstables have disjoint ranges (cached in memory)
■ Worst-case, O(logN) sstables relevant to a partition - plus L0 size.
■ Under some assumptions (update complete rows, of similar sizes)
space amplification implies: 90% of the reads will need just one sstable!
■ Nearly optimal

■ Write Amplification:

Example 1 - Write-Heavy Workload
■ Size-tiered compaction:
At some points needs twice the disk space
■ In ScyllaDB with many shards, “usually” maximum space
use is not concurrent
■ Level-tiered compaction:
More than double the amount of disk I/O
■ Test used smaller-than default sstables (10 MB) to
illustrate the problem
■ Same problem with default sstable size (160 MB) - with
larger workloads

Example 1 (Space Amplification)
constant multiple of
flushed memtable &
sstable size
x2 space
amplification

Example 2 - Overwrite Workload
■ Write 15 times the same 4 million partitions
■ cassandra-stress write n=4000000 -pop seq=1..4000000 -schema
"replication(strategy=org.apache.cassandra.locator.SimpleStrategy,factor=1)"
■ In this test cassandra-stress is not rate limited
■ Again, small (10MB) LCS tables
■ Necessary amount of sstable data: 1.2 GB
■ STCS space amplification: x7.7 !
■ LCS space amplification lower, constant multiple of sstable size
■ Incremental will be around x2 (if it decides to compact fewer files)

Example 2 (Space Amplification)
x7.7
space
amplification

Strategy #3: Incremental Compaction
■ Size-tiered Compaction needs temporary space because we only remove
a huge SSTable after we fully compact it.
■ Let’s split each huge sstable into a run (a la LCS) of “fragments”:
■ Treat the entire run (not individual SSTables) as a file for STCS
■ Remove individual sstables as compacted. Low temporary space.

Incremental Compaction - Amplification
■ Space Amplification:
■ Small constant temporary space needs, even smaller than LCS
(M*S per parallel compaction, e.g., M=4, S=160 MB)
■ Overwrite-mostly still a worst-case, but 2-fold instead of 5-fold
■ Optimal.
■ Write Amplification:
■ O(logN), small constant — same as Size-Tiered compaction
■ Read Amplification:
■ Like Size-Tiered, at worst O(logN) if updating the same partitions

Example 1 - Size Tiered vs Incremental
Incremental
compaction

Is it Enough?
■ Space overhead problem was efficiently fixed in Incremental (ICS), however…
■ Incremental (ICS) and size-tiered (STCS) strategies share the same space
amplification (~2-4x) when facing overwrite intensive workloads, where:
■ They cover a similar region in the three-dimensional efficiency space (RUM
trade-offs):
READ
WRITE SPACE
STC
S
ICS

Space Amplification Goal (SAG)
■ Leveled and Size-Tiered (or ICS) cover different regions
■ Interesting regions cannot be reached with either strategies.
■ But interesting regions can be reached by combining data layout of both strategies
■ i.e. a Hybrid (Tiered+Leveled) approach:
READ
WRITE SPACE
STC
S
ICS
LCS

■ Space Amplification Goal (SAG) is a property to control the size ratio of
the largest and the second largest-tier
■ It’s a value between 1 and 2 (defined in table’s schema). Value of 1.5
implies Cross-Tier Compaction when second-largest is half the size of
largest.
■ Effectively, helps controlling Space Amplification. Not an upper bound,
but results show that compaction will be working towards reducing the
actual SA to below the configured value.
■ The lower the SAG value the lower the SA but the higher the WA. A good
initial value is 1.5, and then decrease conservatively.
Further on ICS + SAG

ALTER TABLE foo.bar
WITH compaction = {
'class': 'IncrementalCompactionStrategy',
'space_amplification_goal': '1.5'
};
Schema for ICS’ SAG

ICS with Space Amplification Goal (SAG)

■ Accumulation of tombstone records is a known problem in LSM trees
■ Makes queries slower
■ Read amplification
■ More CPU work (preserve latest)
■ Employs a SAG-like, but with focus on expired data, rather than space.
■ Enabled by default, can be controlled with usual params
■ tombstone_compaction_interval (defaults to 864000 (10 days))
■ tombstone_threshold (defaults to 0.2)
ICS has more efficient GC

Strategy #4: Time Window (TWCS)
■ Designed for time series workloads
■ Groups data of similar age together
■ Helps with:
■ Garbage collecting expired data
■ as data with similar age will be expired roughly at the same time
■ Read performance
■ Queries using a time range will find the data in a few number of files
■ Common Anti patterns:
■ Not having every cell TTLd (recommendation is to use default_time_to_live)
■ Deletions, overwrites (not well supported, major is usually needed after)
■ Keep number of windows to a small constant. Recommendation: 20.

Compaction Strategies Summary
Workload Size-Tiered Leveled Incremental Time-Window
Write-only 2x peak space 2x writes Best -
Overwrite Huge peak
space
write
amplification
SAG helps -
Read-mostly,
few updates
read
amplification
Best read
amplification
-
Read-mostly,
but a lot of
updates
read and space
amplification
write
amplification
may overwhelm
read
amplification,
again SAG
helps
-
Time series write, read, and
space ampl.
write and space
amplification
write and read
amplification
Best

Stay in Touch
Raphael S. Carvalho
raphaelsc@scylladb.com
@raphael_scarv
@raphaelsc
https://www.linkedin.com/in/raphaelscarvalho
/

Balancing Compaction Principles and Practices

Recommended

Recommended

More Related Content

Similar to Balancing Compaction Principles and Practices

Similar to Balancing Compaction Principles and Practices (20)

More from ScyllaDB

More from ScyllaDB (20)

Recently uploaded

Recently uploaded (20)

Balancing Compaction Principles and Practices