SlideShare a Scribd company logo
SriSatish Ambati
Performance, Riptano, Cassandra
Azul Systems & OpenJDK
Twitter: @srisatish
srisatish.ambati@gmail.com
Cache & Concurrency considerations
for a high performance Cassandra
Trail ahead
Elements of Cache Performance
Metrics, Monitors
JVM goes to BigData Land!
Examples
Lucandra, Twissandra
Cassandra Performance with JVM
Commentary
Runtime Views
Non Blocking HashMap
Locking: concurrency
Garbage Collection
A feather in the CAP
• Eventual
Consistency
– Levels
– Doesn’t mean data
loss (journaled)
• SEDA
– Partitioning, Cluster
& Failure detection,
Storage engine mod
– Event driven & non-
blocking io
– Pure Java
Count what is countable, measure what is measurable, and what is not
measurable, make measurable
-Galileo
Elements of Cache Performance
Metrics
• Operations:
– Ops/s: Puts/sec, Gets/sec, updates/sec
– Latencies, percentiles
– Indexing
• # of nodes – scale, elasticity
• Replication
– Synchronous, Asynchronous (fast writes)
• Tuneable Consistency
• Durability/Persistence
• Size & Number of Objects, Size of Cache
• # of user clients
Elements of Cache Performance:
“Think Locality”
• Hot or Not: The 80/20 rule.
– A small set of objects are very popular!
– What is the most RT tweet?
• Hit or Miss: Hit Ratio
– How effective is your cache?
– LRU, LFU, FIFO.. Expiration
• Long-lived objects lead to better locality.
• Spikes happen
– Cascading events
– Cache Thrash: full table scans
Real World Performance
• Facebook Inbox
– Writes:0.12ms, Reads:15ms @ 50GB data
• Twitter performance
– Twissandra (simulation)
• Cassandra for Search & Portals
– Lucandra, solandra (simulation)
• ycbs/PNUTS benchmarks
– 5ms read/writes @ 5k ops/s (50/50 Update heavy)
– 8ms reads/5ms writes @ 5k ops/s (95/5 read heavy)
• Lab environment
– ~5k writes per sec per node, <5ms latencies
– ~10k reads per sec per node, <5ms latencies
• Performance has improved in newer versions
yahoo cloud store benchmark
50/50 – Update Heavy
yahoo cloud store benchmark
95/5 – read heavy
JVM in BigData Land!
Limits for scale
• Locks : synchronized
– Can’t use all my multi-cores!
– java.util.collections also hold locks
– Use non-blocking collections!
• (de)Serialization is expensive
– Hampers object portability
– Use avro, thrift!
• Object overhead
– average enterprise collection has 3 elements!
– Use byte[ ], primitives where possible!
• Garbage Collection
– Can’t throw memory at the problem!
– Mitigate, Monitor, Measure foot print
Tools
• What is the JVM doing:
– dtrace, hprof, introscope, jconsole,
visualvm, yourkit, azul zvision
• Invasive JVM observation tools
– bci, jvmti, jvmdi/pi agents, jmx, logging
• What is the OS doing:
– dtrace, oprofile, vtune
• What is the network disk doing:
– Ganglia, iostat, lsof, netstat, nagios
furiously fast writes
• Append only writes
– Sequential disk access
• No locks in critical path
• Key based atomicity
client
issues
write
n1
partitioner
commit log
apply to
memory
n2
find node
furiously fast writes
• Use separate disks for commitlog
– Don’t forget to size them well
– Isolation difficult in the cloud..
• Memtable/SSTable sizes
– Delicately balanced with GC
• memtable_throughput_in_mb
Cassandra on EC2 cloud
*Corey Hulen, EC2
Cassandra on EC2 cloud
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
Compactions
K1 < Serialized data >
K2 < Serialized data >
K3 < Serialized data >
--
--
--
Sorted
K2 < Serialized data >
K10 < Serialized data >
K30 < Serialized data >
--
--
--
Sorted
K4 < Serialized data >
K5 < Serialized data >
K10 < Serialized data >
--
--
--
Sorted
MERGE SORT
Loaded in memory
K1 < Serialized data >
K2 < Serialized data >
K3 < Serialized data >
K4 < Serialized data >
K5 < Serialized data >
K10 < Serialized data >
K30 < Serialized data >
Sorted
K1 Offset
K5 Offset
K30 Offset
Bloom Filter
Index File
Data File
D E L E T E D
Compactions
• Intense disk io & mem churn
• Triggers GC for tombstones
• Minor/Major Compactions
• Reduce priority for better reads
• Other Parameters -
– CompactionManager.
minimumCompactionThreshold=xxxx
Example: compaction in
realworld, cloudkick
reads design
reads performance
• BloomFilter used to identify the right file
• Maintain column indices to look up columns
– Which can span different SSTables
• Less io than typical b-tree
• Cold read: Two seeks
– One for Key lookup, another row lookup
• Key Cache
– Optimized in latest cassandra
• Row Cache
– Improves read performance
– GC sensitive for large rows.
• Most (google) applications require single row
transactions*
*Sanjay G, BigTable Design, Google.
Client Performance
Marshal Arts:
Ser/Deserialization
• Clients dominated by Thrift, Avro
– Hector, Pelops
• Thrift: upgrade to latest: 0.5, 0.4
• No news: java.io.Serializable is S.L..O.…W
• Use “transient”
• avro, thrift, proto-buf
• Common Patterns of Doom:
– Death by a million gets
Serialization + Deserialization
uBench• http://code.google.com/p/thrift-protobuf-compare/wiki/BenchmarkingV2
Adding Nodes
• New nodes
– Add themselves to busiest node
– And then Split its Range
• Busy Node starts transmit to new node
• Bootstrap logic initiated from any node, cli, web
• Each node capable of ~40MB/s
– Multiple replicas to parallelize bootstrap
• UDP for control messages
• TCP for request routing
inter-node comm
• Gossip Protocol
– It’s exponential
– (epidemic algorithm)
• Failure Detector
– Accrual rate phi
• Anti-Entropy
– Bringing replicas to uptodate
Bloom Filter: in full bloom
• “constant” time
• size:compact
• false positives
• Single lookup
for key in file
• Deletion
• Improve
– Counting BF
– Bloomier filters
Birthdays, Collisions &
Hashing functions
• Birthday Paradox
For the N=21 people in this room
Probability that at least 2 of them share same birthday is
~0.47
• Collisions are real!
• An unbalanced HashMap behaves like a list O(n) retrieval
• Chaining & Linear probing
• Performance Degrades
• with 80% table density
•
the devil’s in the details
CFS
• All in the
family!
• denormalize
Memtable
• In-memory
• ColumnFamily specific
• throughput
determines size before
flush
• Larger memtables can
improve reads
SSTable
• MemTable “flushes”
to a SSTable
• Immutable after
• Read: Multiple
SSTable lookups
possible
• Chief Execs:
– SSTableWriter
– SSTableReader
Write: Runtime threads
Writes: runtime mem
Example: Java Overheads
writes: monitors
U U I D
• java.util.UUID is slow
– static use leads to contention
SecureRandom
• Uses /dev/urandom for seed initialization
-Djava.security.egd=file:/dev/urandom
• PRNG without file is atleast 20%-40% better.
• Use TimeUUIDs where possible – much faster
• JUG – java.uuid.generator
• http://github.com/cowtowncoder/java-uuid-generator
• http://jug.safehaus.org/
• http://johannburkard.de/blog/programming/java/Java-UUID-generators-compared.html
synchronized
• Coarse grained locks
• io under lock
• Stop signal on a highway
• java.util.concurrent does not mean no
locks
• Non Blocking, Lock free, Wait free
collections
Scalable Lock-Free Coding Style
• Big Array to hold Data
• Concurrent writes via: CAS & Finite State
Machine
– No locks, no volatile
– Much faster than locking under heavy load
– Directly reach main data array in 1 step
• Resize as needed
– Copy Array to a larger Array on demand
– Use State Machine to help copy
– “ Mark” old Array words to avoid missing late
updates
Non-Blocking HashMap
0 100 200 300 400 500 600 700 800
0
200
400
600
800
1000
1200
Threads
M-ops/sec
0 100 200 300 400 500 600 700 800
0
200
400
600
800
1000
1200
Threads
M-ops/sec
NB-99
CHM-99
NB-75
CHM-75
1K Table 1M Table
NB
CHM
Azul Vega2 – 768 cpus
Cassandra uses High Scale
Non-Blocking Hashmap
public class BinaryMemtable implements IFlushable
{
…
private final Map<DecoratedKey,byte[]> columnFamilies =
new NonBlockingHashMap<DecoratedKey, byte[]>();
/* Lock and Condition for notifying new clients about Memtable
switches */
private final Lock lock = new ReentrantLock(); Condition condition;
…
}
public class Table
{
…
private static final Map<String, Table> instances = new
NonBlockingHashMap<String, Table>();
…
}
GC-sensitive elements within
Cassandra
• Compaction triggers System.gc()
– Tombstones from files
• “GCInspector”
• Memtable Threshold, sizes
• SSTable sizes
• Low overhead collection choices
Garbage Collection
• Pause Times
if stop_the_word_FullGC > ttl_of_node
=> failed requests; failure accrual & node repair.
• Allocation Rate
– New object creation, insertion rate
• Live Objects (residency)
– if residency in heap > 50%
– GC overheads dominate.
• Overhead
– space, cpu cycles spent GC
• 64-bit not addressing pause times
– Bigger is not better!
Memory Fragmentation
• Fragmentation
– Performance degrades over time
– Inducing “Full GC” makes problem go away
– Free memory that cannot be used
• Reduce occurrence
– Use a compacting collector
– Promote less often
– Use uniform sized objects
• Solution – unsolved
– Use latest CMS with CR:6631166
– Azul’s Zing JVM & Pauseless GC
CASSANDRA-1014
Best Practices:
Garbage Collection
• GC Logs are cheap even in
production
-Xloggc:/var/log/cassandra/gc.log
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution
-XX:+PrintHeapAtGC
• Slightly expensive ones:
-XX:PrintFLSStatistics=2 -XX:CMSStatistics=1
-XX:CMSInitiationStatistics
Sizing: Young Generation
• Should we set –Xms == -Xmx ?
• Use –Xmn (fixed eden)
survivor spaces
allocations {new Object();}
eden
promotion
old generation
allocation by jvm
survivor ratio
Tenuring
Threshold
Tuning CMS
• Don’t promote too often!
– Frequent promotion causes fragmentation
• Size the generations
– Min GC times are a function of Live Set
– Old Gen should host steady state comfortably
• Parallelize on multicores:
– -XX:ParallelCMSThreads=4
– -XX:ParallelGCThreads=4
• Avoid CMS Initiating heuristic
– -XX:+UseCMSInitiationOccupanyOnly
• Use Concurrent for System.gc()
– -XX:+ExplicitGCInvokesConcurrent
Summary
Design & Implementation of Cassandra takes advantages
of strengths while avoiding common JVM issues.
• Locks:
– Avoids locks in critical path
– Uses non-blocking collections, TimeUUIDs!
– Still Can’t use all my multi-cores..?
>> Other bottlenecks to find!
• De/Serialization:
– Uses avro, thrift!
• Object overhead
– Uses mostly byte[ ], primitives where possible!
• Garbage Collection
– Mitigate: Monitor, Measure foot print.
– Work in progress by all jvm vendors!
Cassandra starts from a great footing from a JVM standpoint
and will reap the benefits of the platform!
Q&AReferences
• Verner Wogels, Eventually Consistent
http://www.allthingsdistributed.com/2008/12/eventually_consistent.htm
• Bloom, Burton H. (1970), "Space/time trade-offs in hash coding
with allowable errors"
• Avinash Lakshman, http://static.last.fm/johan/nosql-
20090611/cassandra_nosql.pdf
• Eric Brewer, CAP
http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf
• Tony Printzeis, Charlie Hunt, Javaone Talk
http://www.scribd.com/doc/36090475/GC-Tuning-in-the-Java
• http://github.com/digitalreasoning/PyStratus/wiki/Documentation
• http://www.cs.cornell.edu/home/rvr/papers/flowgossip.pdf
• Cassandra on Cloud, http://www.coreyhulen.org/?p=326
• Cliff Click’s, Non-blocking HashMap
http://sourceforge.net/projects/high-scale-lib/
• Brian F. Cooper., Yahoo Cloud Storage Benchmark,
http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf

More Related Content

ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)

  • 1. SriSatish Ambati Performance, Riptano, Cassandra Azul Systems & OpenJDK Twitter: @srisatish srisatish.ambati@gmail.com Cache & Concurrency considerations for a high performance Cassandra
  • 2. Trail ahead Elements of Cache Performance Metrics, Monitors JVM goes to BigData Land! Examples Lucandra, Twissandra Cassandra Performance with JVM Commentary Runtime Views Non Blocking HashMap Locking: concurrency Garbage Collection
  • 3. A feather in the CAP • Eventual Consistency – Levels – Doesn’t mean data loss (journaled) • SEDA – Partitioning, Cluster & Failure detection, Storage engine mod – Event driven & non- blocking io – Pure Java
  • 4. Count what is countable, measure what is measurable, and what is not measurable, make measurable -Galileo
  • 5. Elements of Cache Performance Metrics • Operations: – Ops/s: Puts/sec, Gets/sec, updates/sec – Latencies, percentiles – Indexing • # of nodes – scale, elasticity • Replication – Synchronous, Asynchronous (fast writes) • Tuneable Consistency • Durability/Persistence • Size & Number of Objects, Size of Cache • # of user clients
  • 6. Elements of Cache Performance: “Think Locality” • Hot or Not: The 80/20 rule. – A small set of objects are very popular! – What is the most RT tweet? • Hit or Miss: Hit Ratio – How effective is your cache? – LRU, LFU, FIFO.. Expiration • Long-lived objects lead to better locality. • Spikes happen – Cascading events – Cache Thrash: full table scans
  • 7. Real World Performance • Facebook Inbox – Writes:0.12ms, Reads:15ms @ 50GB data • Twitter performance – Twissandra (simulation) • Cassandra for Search & Portals – Lucandra, solandra (simulation) • ycbs/PNUTS benchmarks – 5ms read/writes @ 5k ops/s (50/50 Update heavy) – 8ms reads/5ms writes @ 5k ops/s (95/5 read heavy) • Lab environment – ~5k writes per sec per node, <5ms latencies – ~10k reads per sec per node, <5ms latencies • Performance has improved in newer versions
  • 8. yahoo cloud store benchmark 50/50 – Update Heavy
  • 9. yahoo cloud store benchmark 95/5 – read heavy
  • 10. JVM in BigData Land! Limits for scale • Locks : synchronized – Can’t use all my multi-cores! – java.util.collections also hold locks – Use non-blocking collections! • (de)Serialization is expensive – Hampers object portability – Use avro, thrift! • Object overhead – average enterprise collection has 3 elements! – Use byte[ ], primitives where possible! • Garbage Collection – Can’t throw memory at the problem! – Mitigate, Monitor, Measure foot print
  • 11. Tools • What is the JVM doing: – dtrace, hprof, introscope, jconsole, visualvm, yourkit, azul zvision • Invasive JVM observation tools – bci, jvmti, jvmdi/pi agents, jmx, logging • What is the OS doing: – dtrace, oprofile, vtune • What is the network disk doing: – Ganglia, iostat, lsof, netstat, nagios
  • 12. furiously fast writes • Append only writes – Sequential disk access • No locks in critical path • Key based atomicity client issues write n1 partitioner commit log apply to memory n2 find node
  • 13. furiously fast writes • Use separate disks for commitlog – Don’t forget to size them well – Isolation difficult in the cloud.. • Memtable/SSTable sizes – Delicately balanced with GC • memtable_throughput_in_mb
  • 14. Cassandra on EC2 cloud *Corey Hulen, EC2
  • 17. Compactions K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > -- -- -- Sorted K2 < Serialized data > K10 < Serialized data > K30 < Serialized data > -- -- -- Sorted K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > -- -- -- Sorted MERGE SORT Loaded in memory K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > K30 < Serialized data > Sorted K1 Offset K5 Offset K30 Offset Bloom Filter Index File Data File D E L E T E D
  • 18. Compactions • Intense disk io & mem churn • Triggers GC for tombstones • Minor/Major Compactions • Reduce priority for better reads • Other Parameters - – CompactionManager. minimumCompactionThreshold=xxxx
  • 21. reads performance • BloomFilter used to identify the right file • Maintain column indices to look up columns – Which can span different SSTables • Less io than typical b-tree • Cold read: Two seeks – One for Key lookup, another row lookup • Key Cache – Optimized in latest cassandra • Row Cache – Improves read performance – GC sensitive for large rows. • Most (google) applications require single row transactions* *Sanjay G, BigTable Design, Google.
  • 22. Client Performance Marshal Arts: Ser/Deserialization • Clients dominated by Thrift, Avro – Hector, Pelops • Thrift: upgrade to latest: 0.5, 0.4 • No news: java.io.Serializable is S.L..O.…W • Use “transient” • avro, thrift, proto-buf • Common Patterns of Doom: – Death by a million gets
  • 23. Serialization + Deserialization uBench• http://code.google.com/p/thrift-protobuf-compare/wiki/BenchmarkingV2
  • 24. Adding Nodes • New nodes – Add themselves to busiest node – And then Split its Range • Busy Node starts transmit to new node • Bootstrap logic initiated from any node, cli, web • Each node capable of ~40MB/s – Multiple replicas to parallelize bootstrap • UDP for control messages • TCP for request routing
  • 25. inter-node comm • Gossip Protocol – It’s exponential – (epidemic algorithm) • Failure Detector – Accrual rate phi • Anti-Entropy – Bringing replicas to uptodate
  • 26. Bloom Filter: in full bloom • “constant” time • size:compact • false positives • Single lookup for key in file • Deletion • Improve – Counting BF – Bloomier filters
  • 27. Birthdays, Collisions & Hashing functions • Birthday Paradox For the N=21 people in this room Probability that at least 2 of them share same birthday is ~0.47 • Collisions are real! • An unbalanced HashMap behaves like a list O(n) retrieval • Chaining & Linear probing • Performance Degrades • with 80% table density •
  • 28. the devil’s in the details
  • 29. CFS • All in the family! • denormalize
  • 30. Memtable • In-memory • ColumnFamily specific • throughput determines size before flush • Larger memtables can improve reads
  • 31. SSTable • MemTable “flushes” to a SSTable • Immutable after • Read: Multiple SSTable lookups possible • Chief Execs: – SSTableWriter – SSTableReader
  • 36. U U I D • java.util.UUID is slow – static use leads to contention SecureRandom • Uses /dev/urandom for seed initialization -Djava.security.egd=file:/dev/urandom • PRNG without file is atleast 20%-40% better. • Use TimeUUIDs where possible – much faster • JUG – java.uuid.generator • http://github.com/cowtowncoder/java-uuid-generator • http://jug.safehaus.org/ • http://johannburkard.de/blog/programming/java/Java-UUID-generators-compared.html
  • 37. synchronized • Coarse grained locks • io under lock • Stop signal on a highway • java.util.concurrent does not mean no locks • Non Blocking, Lock free, Wait free collections
  • 38. Scalable Lock-Free Coding Style • Big Array to hold Data • Concurrent writes via: CAS & Finite State Machine – No locks, no volatile – Much faster than locking under heavy load – Directly reach main data array in 1 step • Resize as needed – Copy Array to a larger Array on demand – Use State Machine to help copy – “ Mark” old Array words to avoid missing late updates
  • 39. Non-Blocking HashMap 0 100 200 300 400 500 600 700 800 0 200 400 600 800 1000 1200 Threads M-ops/sec 0 100 200 300 400 500 600 700 800 0 200 400 600 800 1000 1200 Threads M-ops/sec NB-99 CHM-99 NB-75 CHM-75 1K Table 1M Table NB CHM Azul Vega2 – 768 cpus
  • 40. Cassandra uses High Scale Non-Blocking Hashmap public class BinaryMemtable implements IFlushable { … private final Map<DecoratedKey,byte[]> columnFamilies = new NonBlockingHashMap<DecoratedKey, byte[]>(); /* Lock and Condition for notifying new clients about Memtable switches */ private final Lock lock = new ReentrantLock(); Condition condition; … } public class Table { … private static final Map<String, Table> instances = new NonBlockingHashMap<String, Table>(); … }
  • 41. GC-sensitive elements within Cassandra • Compaction triggers System.gc() – Tombstones from files • “GCInspector” • Memtable Threshold, sizes • SSTable sizes • Low overhead collection choices
  • 42. Garbage Collection • Pause Times if stop_the_word_FullGC > ttl_of_node => failed requests; failure accrual & node repair. • Allocation Rate – New object creation, insertion rate • Live Objects (residency) – if residency in heap > 50% – GC overheads dominate. • Overhead – space, cpu cycles spent GC • 64-bit not addressing pause times – Bigger is not better!
  • 43. Memory Fragmentation • Fragmentation – Performance degrades over time – Inducing “Full GC” makes problem go away – Free memory that cannot be used • Reduce occurrence – Use a compacting collector – Promote less often – Use uniform sized objects • Solution – unsolved – Use latest CMS with CR:6631166 – Azul’s Zing JVM & Pauseless GC
  • 45. Best Practices: Garbage Collection • GC Logs are cheap even in production -Xloggc:/var/log/cassandra/gc.log -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+PrintHeapAtGC • Slightly expensive ones: -XX:PrintFLSStatistics=2 -XX:CMSStatistics=1 -XX:CMSInitiationStatistics
  • 46. Sizing: Young Generation • Should we set –Xms == -Xmx ? • Use –Xmn (fixed eden) survivor spaces allocations {new Object();} eden promotion old generation allocation by jvm survivor ratio Tenuring Threshold
  • 47. Tuning CMS • Don’t promote too often! – Frequent promotion causes fragmentation • Size the generations – Min GC times are a function of Live Set – Old Gen should host steady state comfortably • Parallelize on multicores: – -XX:ParallelCMSThreads=4 – -XX:ParallelGCThreads=4 • Avoid CMS Initiating heuristic – -XX:+UseCMSInitiationOccupanyOnly • Use Concurrent for System.gc() – -XX:+ExplicitGCInvokesConcurrent
  • 48. Summary Design & Implementation of Cassandra takes advantages of strengths while avoiding common JVM issues. • Locks: – Avoids locks in critical path – Uses non-blocking collections, TimeUUIDs! – Still Can’t use all my multi-cores..? >> Other bottlenecks to find! • De/Serialization: – Uses avro, thrift! • Object overhead – Uses mostly byte[ ], primitives where possible! • Garbage Collection – Mitigate: Monitor, Measure foot print. – Work in progress by all jvm vendors! Cassandra starts from a great footing from a JVM standpoint and will reap the benefits of the platform!
  • 49. Q&AReferences • Verner Wogels, Eventually Consistent http://www.allthingsdistributed.com/2008/12/eventually_consistent.htm • Bloom, Burton H. (1970), "Space/time trade-offs in hash coding with allowable errors" • Avinash Lakshman, http://static.last.fm/johan/nosql- 20090611/cassandra_nosql.pdf • Eric Brewer, CAP http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf • Tony Printzeis, Charlie Hunt, Javaone Talk http://www.scribd.com/doc/36090475/GC-Tuning-in-the-Java • http://github.com/digitalreasoning/PyStratus/wiki/Documentation • http://www.cs.cornell.edu/home/rvr/papers/flowgossip.pdf • Cassandra on Cloud, http://www.coreyhulen.org/?p=326 • Cliff Click’s, Non-blocking HashMap http://sourceforge.net/projects/high-scale-lib/ • Brian F. Cooper., Yahoo Cloud Storage Benchmark, http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf

Editor's Notes

  1. Typical write operation involves a write into a commit log for durability and recoverability and an update into an in-memory data structure. The write into the in-memory data structure is performed only after a successful write into the commit log. We have a dedicated disk on each machine for the commit log since all writes into the commit log are sequential and so we can maximize disk throughput. When the in-memory data structure crosses a certain threshold, calculated based on data size and number of objects, it dumps itself to disk. This write is performed on one of many commodity disks that machines are equipped with. All writes are sequential to disk and also generate an index for ecient lookup based on row key. These indices are also persisted along with the data le. Over time many such les could exist on disk and a merge process runs in the background to collate the different les into one le. This process is very similar to the compaction process that happens in the Bigtable system
  2. “A typical read operation rst queries the in-memory data structure before looking into the les on disk. The files are looked at in the order of newest to oldest. When a disk lookup occurs we could be looking up a key in multiple les on disk. In order to prevent lookups into les that do not contain the key, a bloom lter, summarizing the keys in the le, is also stored in each data le and also kept in memory. This bloom lter is rst consulted to check if the key being looked up does indeed exist in the given le. A key in a column family could have many columns. Some special indexing is required to retrieve columns which are further away from the key. In order to prevent scanning of every column on disk we maintain column indices which allow us to jump to the right chunk on disk for column retrieval. As the columns for a given key are being serialized and written out to disk we generate indices at every 256K chunk boundary. This boundary is congurable, but we have found 256K to work well for us in our production workloads.”
  3. Description of Graph Shows the average number of cache misses expected when inserting into a hash table with various collision resolution mechanisms; on modern machines, this is a good estimate of actual clock time required. This seems to confirm the common heuristic that performance begins to degrade at about 80% table density. It is based on a simulated model of a hash table where the hash function chooses indexes for each insertion uniformly at random. The parameters of the model were: You may be curious what happens in the case where no cache exists. In other words, how does the number of probes (number of reads, number of comparisons) rise as the table fills? The curve is similar in shape to the one above, but shifted left: it requiresan average of 24 probes for an 80% full table, and you have to go down to a 50% full table for only 3 probes to be required on average. This suggests that in the absence of a cache, ideally your hash table should be about twice as large for probing as for chaining.