ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)

SriSatish Ambati
Performance, Riptano, Cassandra
Azul Systems & OpenJDK
Twitter: @srisatish
srisatish.ambati@gmail.com
Cache & Concurrency considerations
for a high performance Cassandra

Trail ahead
Elements of Cache Performance
Metrics, Monitors
JVM goes to BigData Land!
Examples
Lucandra, Twissandra
Cassandra Performance with JVM
Commentary
Runtime Views
Non Blocking HashMap
Locking: concurrency
Garbage Collection

A feather in the CAP
• Eventual
Consistency
– Levels
– Doesn’t mean data
loss (journaled)
• SEDA
– Partitioning, Cluster
& Failure detection,
Storage engine mod
– Event driven & non-
blocking io
– Pure Java

Count what is countable, measure what is measurable, and what is not
measurable, make measurable
-Galileo

Elements of Cache Performance
Metrics
• Operations:
– Ops/s: Puts/sec, Gets/sec, updates/sec
– Latencies, percentiles
– Indexing
• # of nodes – scale, elasticity
• Replication
– Synchronous, Asynchronous (fast writes)
• Tuneable Consistency
• Durability/Persistence
• Size & Number of Objects, Size of Cache
• # of user clients

Elements of Cache Performance:
“Think Locality”
• Hot or Not: The 80/20 rule.
– A small set of objects are very popular!
– What is the most RT tweet?
• Hit or Miss: Hit Ratio
– How effective is your cache?
– LRU, LFU, FIFO.. Expiration
• Long-lived objects lead to better locality.
• Spikes happen
– Cascading events
– Cache Thrash: full table scans

Real World Performance
• Facebook Inbox
– Writes:0.12ms, Reads:15ms @ 50GB data
• Twitter performance
– Twissandra (simulation)
• Cassandra for Search & Portals
– Lucandra, solandra (simulation)
• ycbs/PNUTS benchmarks
– 5ms read/writes @ 5k ops/s (50/50 Update heavy)
– 8ms reads/5ms writes @ 5k ops/s (95/5 read heavy)
• Lab environment
– ~5k writes per sec per node, <5ms latencies
– ~10k reads per sec per node, <5ms latencies
• Performance has improved in newer versions

yahoo cloud store benchmark
50/50 – Update Heavy

yahoo cloud store benchmark
95/5 – read heavy

JVM in BigData Land!
Limits for scale
• Locks : synchronized
– Can’t use all my multi-cores!
– java.util.collections also hold locks
– Use non-blocking collections!
• (de)Serialization is expensive
– Hampers object portability
– Use avro, thrift!
• Object overhead
– average enterprise collection has 3 elements!
– Use byte[ ], primitives where possible!
• Garbage Collection
– Can’t throw memory at the problem!
– Mitigate, Monitor, Measure foot print

Tools
• What is the JVM doing:
– dtrace, hprof, introscope, jconsole,
visualvm, yourkit, azul zvision
• Invasive JVM observation tools
– bci, jvmti, jvmdi/pi agents, jmx, logging
• What is the OS doing:
– dtrace, oprofile, vtune
• What is the network disk doing:
– Ganglia, iostat, lsof, netstat, nagios

furiously fast writes
• Append only writes
– Sequential disk access
• No locks in critical path
• Key based atomicity
client
issues
write
n1
partitioner
commit log
apply to
memory
n2
find node

furiously fast writes
• Use separate disks for commitlog
– Don’t forget to size them well
– Isolation difficult in the cloud..
• Memtable/SSTable sizes
– Delicately balanced with GC
• memtable_throughput_in_mb

Cassandra on EC2 cloud
*Corey Hulen, EC2

ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)

Compactions
K1 < Serialized data >
--
--
--
Sorted
--
--
--
Sorted
--
--
--
Sorted
MERGE SORT
Loaded in memory
Sorted
K1 Offset
K5 Offset
K30 Offset
Bloom Filter
Index File
Data File
D E L E T E D

Compactions
• Intense disk io & mem churn
• Triggers GC for tombstones
• Minor/Major Compactions
• Reduce priority for better reads
• Other Parameters -
– CompactionManager.
minimumCompactionThreshold=xxxx

Example: compaction in
realworld, cloudkick

reads performance
• BloomFilter used to identify the right file
• Maintain column indices to look up columns
– Which can span different SSTables
• Less io than typical b-tree
• Cold read: Two seeks
– One for Key lookup, another row lookup
• Key Cache
– Optimized in latest cassandra
• Row Cache
– Improves read performance
– GC sensitive for large rows.
• Most (google) applications require single row
transactions*
*Sanjay G, BigTable Design, Google.

Client Performance
Marshal Arts:
Ser/Deserialization
• Clients dominated by Thrift, Avro
– Hector, Pelops
• Thrift: upgrade to latest: 0.5, 0.4
• No news: java.io.Serializable is S.L..O.…W
• Use “transient”
• avro, thrift, proto-buf
• Common Patterns of Doom:
– Death by a million gets

Serialization + Deserialization
uBench• http://code.google.com/p/thrift-protobuf-compare/wiki/BenchmarkingV2

Adding Nodes
• New nodes
– Add themselves to busiest node
– And then Split its Range
• Busy Node starts transmit to new node
• Bootstrap logic initiated from any node, cli, web
• Each node capable of ~40MB/s
– Multiple replicas to parallelize bootstrap
• UDP for control messages
• TCP for request routing

inter-node comm
• Gossip Protocol
– It’s exponential
– (epidemic algorithm)
• Failure Detector
– Accrual rate phi
• Anti-Entropy
– Bringing replicas to uptodate

Bloom Filter: in full bloom
• “constant” time
• size:compact
• false positives
• Single lookup
for key in file
• Deletion
• Improve
– Counting BF
– Bloomier filters

Birthdays, Collisions &
Hashing functions
• Birthday Paradox
For the N=21 people in this room
Probability that at least 2 of them share same birthday is
~0.47
• Collisions are real!
• An unbalanced HashMap behaves like a list O(n) retrieval
• Chaining & Linear probing
• Performance Degrades
• with 80% table density
•

CFS
• All in the
family!
• denormalize

Memtable
• In-memory
• ColumnFamily specific
• throughput
determines size before
flush
• Larger memtables can
improve reads

SSTable
• MemTable “flushes”
to a SSTable
• Immutable after
• Read: Multiple
SSTable lookups
possible
• Chief Execs:
– SSTableWriter
– SSTableReader

U U I D
• java.util.UUID is slow
– static use leads to contention
SecureRandom
• Uses /dev/urandom for seed initialization
-Djava.security.egd=file:/dev/urandom
• PRNG without file is atleast 20%-40% better.
• Use TimeUUIDs where possible – much faster
• JUG – java.uuid.generator
• http://github.com/cowtowncoder/java-uuid-generator
• http://jug.safehaus.org/
• http://johannburkard.de/blog/programming/java/Java-UUID-generators-compared.html

synchronized
• Coarse grained locks
• io under lock
• Stop signal on a highway
• java.util.concurrent does not mean no
locks
• Non Blocking, Lock free, Wait free
collections

Scalable Lock-Free Coding Style
• Big Array to hold Data
• Concurrent writes via: CAS & Finite State
Machine
– No locks, no volatile
– Much faster than locking under heavy load
– Directly reach main data array in 1 step
• Resize as needed
– Copy Array to a larger Array on demand
– Use State Machine to help copy
– “ Mark” old Array words to avoid missing late
updates

Non-Blocking HashMap
0 100 200 300 400 500 600 700 800
0
200
400
600
800
1000
1200
Threads
M-ops/sec
0 100 200 300 400 500 600 700 800
0
200
400
600
800
1000
1200
Threads
M-ops/sec
NB-99
CHM-99
NB-75
CHM-75
1K Table 1M Table
NB
CHM
Azul Vega2 – 768 cpus

Cassandra uses High Scale
Non-Blocking Hashmap
public class BinaryMemtable implements IFlushable
{
…
private final Map<DecoratedKey,byte[]> columnFamilies =
new NonBlockingHashMap<DecoratedKey, byte[]>();
/* Lock and Condition for notifying new clients about Memtable
switches */
private final Lock lock = new ReentrantLock(); Condition condition;
…
}
public class Table
{
…
private static final Map<String, Table> instances = new
NonBlockingHashMap<String, Table>();
…
}

GC-sensitive elements within
Cassandra
• Compaction triggers System.gc()
– Tombstones from files
• “GCInspector”
• Memtable Threshold, sizes
• SSTable sizes
• Low overhead collection choices

Garbage Collection
• Pause Times
if stop_the_word_FullGC > ttl_of_node
=> failed requests; failure accrual & node repair.
• Allocation Rate
– New object creation, insertion rate
• Live Objects (residency)
– if residency in heap > 50%
– GC overheads dominate.
• Overhead
– space, cpu cycles spent GC
• 64-bit not addressing pause times
– Bigger is not better!

Memory Fragmentation
• Fragmentation
– Performance degrades over time
– Inducing “Full GC” makes problem go away
– Free memory that cannot be used
• Reduce occurrence
– Use a compacting collector
– Promote less often
– Use uniform sized objects
• Solution – unsolved
– Use latest CMS with CR:6631166
– Azul’s Zing JVM & Pauseless GC

Best Practices:
Garbage Collection
• GC Logs are cheap even in
production
-Xloggc:/var/log/cassandra/gc.log
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution
-XX:+PrintHeapAtGC
• Slightly expensive ones:
-XX:PrintFLSStatistics=2 -XX:CMSStatistics=1
-XX:CMSInitiationStatistics

Sizing: Young Generation
• Should we set –Xms == -Xmx ?
• Use –Xmn (fixed eden)
survivor spaces
allocations {new Object();}
eden
promotion
old generation
allocation by jvm
survivor ratio
Tenuring
Threshold

Tuning CMS
• Don’t promote too often!
– Frequent promotion causes fragmentation
• Size the generations
– Min GC times are a function of Live Set
– Old Gen should host steady state comfortably
• Parallelize on multicores:
– -XX:ParallelCMSThreads=4
– -XX:ParallelGCThreads=4
• Avoid CMS Initiating heuristic
– -XX:+UseCMSInitiationOccupanyOnly
• Use Concurrent for System.gc()
– -XX:+ExplicitGCInvokesConcurrent

Summary
Design & Implementation of Cassandra takes advantages
of strengths while avoiding common JVM issues.
• Locks:
– Avoids locks in critical path
– Uses non-blocking collections, TimeUUIDs!
– Still Can’t use all my multi-cores..?
>> Other bottlenecks to find!
• De/Serialization:
– Uses avro, thrift!
• Object overhead
– Uses mostly byte[ ], primitives where possible!
• Garbage Collection
– Mitigate: Monitor, Measure foot print.
– Work in progress by all jvm vendors!
Cassandra starts from a great footing from a JVM standpoint
and will reap the benefits of the platform!

Q&AReferences
• Verner Wogels, Eventually Consistent
http://www.allthingsdistributed.com/2008/12/eventually_consistent.htm
• Bloom, Burton H. (1970), "Space/time trade-offs in hash coding
with allowable errors"
• Avinash Lakshman, http://static.last.fm/johan/nosql-
20090611/cassandra_nosql.pdf
• Eric Brewer, CAP
http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf
• Tony Printzeis, Charlie Hunt, Javaone Talk
http://www.scribd.com/doc/36090475/GC-Tuning-in-the-Java
• http://github.com/digitalreasoning/PyStratus/wiki/Documentation
• http://www.cs.cornell.edu/home/rvr/papers/flowgossip.pdf
• Cassandra on Cloud, http://www.coreyhulen.org/?p=326
• Cliff Click’s, Non-blocking HashMap
http://sourceforge.net/projects/high-scale-lib/
• Brian F. Cooper., Yahoo Cloud Storage Benchmark,
http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf

ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)

Related slideshows

More Related Content

ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)

Editor's Notes