SlideShare a Scribd company logo
Wikimedia Content API:
A Cassandra Use-case
Eric Evans <eevans@wikimedia.org>
@jericevans
Apache Bigdata | May 10, 2016
Wikimedia Content API: A Cassandra Use-case
Our Vision:
A world in which every single human can freely
share in the sum of all knowledge.
About:
● Global movement
● Largest collection of free, collaborative knowledge in human history
● 16 projects
● 16.5 billion total page views per month
● 58.9 million unique devices per day
● More than 13k new editors each month
● More than 75k active editors month-to-month
About: Wikipedia
● More than 38 million articles in 290 languages
● Over 10k new articles added per day
● 13 million edits per month
● Ranked #6 globally in web traffic
Wikimedia Architecture
LAMP
THE ARCHITECTURE
ALL OF IT
Wikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-case
Wikitext
= Star Wars: The Force Awakens =
Star Wars: The Force Awakens is a 2015 American epic space opera
film directed, co-produced, and co-written by [[J. J. Abrams]].
HTML
<h1>
Star Wars: The Force Awakens
</h1>
<p>
Star Wars: The Force Awakens is a 2015 American epic space opera
film directed, co-produced, and co-written by
<a href="/wiki/J._J._Abrams" title="J. J. Abrams">
J. J. Abrams
</a>
</p>
Wikitext
HTML
Conversion
wikitext html
WYSIWYG
Conversion
wikitext html
Character-based diffs
Metadata
[[Foo|bar]]
<a rel="mw:WikiLink" href="./Foo">bar</a>
Metadata
[[Foo|{{echo|bar}}]]
<a rel="mw:WikiLink" href="./Foo">
<span about="#mwt1" typeof="mw:Object/Template"
data-parsoid="{...}" >bar</span>
</a>
Parsoid
● Node.js service
● Converts wikitext to HTML/RDFa
● Converts HTML/RDFa to wikitext
● Semantics, and syntax (avoid dirty diffs)!
● Expensive (slow)
● Resulting output is large
RESTBase
● Services aggregator / proxy (REST)
● Durable cache (Cassandra)
● Wikimedia’s content API (e.g. https://en.wikipedia.org/api/rest_v1?doc)
Cassandra
RESTBase
RESTBase
Parsoid ... ...
Other use-cases
● Mobile content service
● Math formula rendering service
● Dumps
● ...
Cassandra
Environment
● 2 datacenters
● 3 racks per datacenter
● 18 hosts (16 core, 128G, SSDs)
● 36 nodes
● Deflate compression (~14-18%)
● 31T storage (~206T uncompressed)
● Cassandra 2.1.13 (moving to 2.2.6)
● Read-heavy workload (5:1)
Data model
Data model
CREATE TABLE data (
domain text,
title text,
rev int,
tid timeuuid,
value blob,
PRIMARY KEY ((domain, title), rev, tid)
) WITH CLUSTERING ORDER BY (rev DESC, tid DESC)
Data model
en.wikipedia.org + Star_Wars:_The_Force_Awakens
717862573 717873822
...97466b12...7c7a913d3d8a1f2dd66c...7c7a913d3d8a
...
09877568...7c7a913d3d8a
bdebc9a6...7c7a913d3d8a827e2ec2...7c7a913d3d8a
Compression
Compression
chunk_length_kb
Compression
chunk_length_kb
Brotli compression
● Brought to you by the folks at Google; Successor to deflate
● Cassandra implementation (https://github.com/eevans/cassandra-brotli)
● Initial results very promising
● Better compression, lower cost (apples-apples)
● And, wider windows are possible (apples-oranges)
○ GC/memory permitting
○ Example: level=1, lgblock=4096, chunk_length_kb=4096, yields 1.73% compressed size!
○ https://phabricator.wikimedia.org/T122028
● Stay tuned!
Compaction
Compaction
● The cost of having log-structured storage
● Asynchronously (post-write) optimize data on disk for reads
● At a minimum, reorganize into fewer files
○ Dropping what is obsolete
○ Expiring TTLs
○ Removing deleted (aka tombstoned) data (after a fashion)
● Reorganize data so results are nearer each other
Compaction strategies
● Size-tiered
○ Combines tables of similar size
○ Oblivious to column distribution; Works best for workloads with no overwrites/deletes
○ Needs working space for largest possible compaction (100%)
○ Minimal IO
● Leveled
○ Small, fixed size files in levels of exponentially increasing size
○ Files have non-overlapping ranges within a level
○ Amortized; Lots of continuous compaction
○ Very efficient reads, but also quite IO intensive
● Date-tiered
Date-tiered compaction
● Newest of the bunch; Introduced in 2.1
● For append-only workloads (no overwrites, no deletes)
● Where data is ordered sort-ascending, (and not written out-of-order)
● Windows of time, arranged in tiers
● Avoids mixing “old” data with “new”
● Cold data eventually ceases to be compacted
1
1 2
2 31
2-51
2 3 41
1 2
min_threshold (4)
base_time_seconds
3-6
Date-tiered compaction
● Newest of the bunch; Introduced in 2.1
● For append-only workloads (no overwrites, no deletes)
● Where data is ordered sort-ascending, (and not written out-of-order)
● Windows of time, arranged in tiers
● Avoids mixing “old” data with “new”
● Cold data eventually ceases to be compacted
● Hard to reason about
● Optimizations are easily defeated
● https://phabricator.wikimedia.org/T126221
Wikimedia Content API: A Cassandra Use-case
DTCS: So now what?
● Size-tiered compaction? Might as well.
● TimeWindowCompactionStrategy (https://github.com/jeffjirsa/twcs)?
Maybe...
● Reduce node density?
Garbage Collection
GC
● Early adopters of G1 (Garbage 1st)
● Successor to Concurrent Mark-sweep (CMS)
● Incremental parallel compacting collector
● More predictable than CMS
● Configurable pause-time target
Concurrent Mark-sweep
Heap
Old generation
Young generation
Eden Survivor
G1
Survivor Old-Gen Eden Eden
Survivor
Old-Gen Old-Gen
Eden Survivor
Old-Gen
Humongous objects
● Anything >= ½ region size is classified as Humongous
● Humongous objects are allocated into Humongous Regions
● Only one object for a region (wastes space, creates fragmentation)
● Until 1.8u40, humongous regions collected only during full collections (Bad)
● Since 1.8u40, end of the marking cycle, during the cleanup phase (Better)
● Treated as exceptions, so should be exceptional
○ For us, that means 8MB regions
● Enable GC logging and have a look!
Multi-instance
“Many smaller-sized Cassandra nodes is
always better than fewer, dense ones.”
— Everyone
Motivation
● Compaction
● GC
What we do
● Processes (yup)
● Puppetized configuration
○ /etc/cassandra-a/
○ /etc/cassandra-b/
○ systemd units
○ Etc
● Shared RAID-0
What we should have done
● Virtualization
● Containers
● Blades
The Good
● Fault-tolerance
● Availability
● Datacenter / rack awareness
● Nice, helpful people (tickets, IRC, etc)
The Bad
● Usability
○ Compaction
○ Streaming
○ JMX
● Vertical scaling
● JVM
The Ugly
● Release process
● Upgrades
Wikimedia Content API: A Cassandra Use-case

More Related Content

What's hot

Cassandra meetup slides - Oct 15 Santa Monica Coloft
Cassandra meetup slides - Oct 15 Santa Monica ColoftCassandra meetup slides - Oct 15 Santa Monica Coloft
Cassandra meetup slides - Oct 15 Santa Monica Coloft
Jon Haddad
 
Google Drive vs. Dropbox
Google Drive vs. DropboxGoogle Drive vs. Dropbox
Google Drive vs. Dropbox
Fireworks Websites
 
SqliteToRealm
SqliteToRealmSqliteToRealm
SqliteToRealm
Pluu love
 
Object Storage in a Cloud-Native Container Envirnoment
Object Storage in a Cloud-Native Container EnvirnomentObject Storage in a Cloud-Native Container Envirnoment
Object Storage in a Cloud-Native Container Envirnoment
Minio
 
Life as a GlusterFS Consultant with Ivan Rossi
Life as a GlusterFS Consultant with Ivan RossiLife as a GlusterFS Consultant with Ivan Rossi
Life as a GlusterFS Consultant with Ivan Rossi
Gluster.org
 
My Learnings on Setting up a Kubernetes Cluster on AWS using Kubernetes Opera...
My Learnings on Setting up a Kubernetes Cluster on AWS using Kubernetes Opera...My Learnings on Setting up a Kubernetes Cluster on AWS using Kubernetes Opera...
My Learnings on Setting up a Kubernetes Cluster on AWS using Kubernetes Opera...
Sathyajith Bhat
 
Unleashing k8 s to reduce complexities of an entire middleware platform
Unleashing k8 s to reduce complexities of an entire middleware platformUnleashing k8 s to reduce complexities of an entire middleware platform
Unleashing k8 s to reduce complexities of an entire middleware platform
Lakmal Warusawithana
 
Game DDOS Prevention
Game DDOS PreventionGame DDOS Prevention
Game DDOS Prevention
Walter Liu
 
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax AstraApache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra
Anant Corporation
 
Introduction to Bizur
Introduction to BizurIntroduction to Bizur
Introduction to Bizur
Akira Hayakawa
 
Cassandra Lunch #59 Functions in Cassandra
Cassandra Lunch #59  Functions in CassandraCassandra Lunch #59  Functions in Cassandra
Cassandra Lunch #59 Functions in Cassandra
Anant Corporation
 
PTD and beyond
PTD and beyondPTD and beyond
PTD and beyond
Johan Gustavsson
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
Adam Doyle
 
Corwin on containers
Corwin on containersCorwin on containers
Corwin on containers
Corwin Brown
 
Marble talk at akademy 2008
Marble talk  at akademy 2008Marble talk  at akademy 2008
Marble talk at akademy 2008
Marble Virtual Globe
 
Memcached
MemcachedMemcached
Memcached
Dori Waldman
 
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding -  patterns & antipatterns, Константин Осипов, Алексей РыбакSharding -  patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Ontico
 
Cloud storage: the right way OSS EU 2018
Cloud storage: the right way OSS EU 2018Cloud storage: the right way OSS EU 2018
Cloud storage: the right way OSS EU 2018
Orit Wasserman
 
Scaling Islandora
Scaling IslandoraScaling Islandora
Scaling Islandora
Erin Tripp
 
The Rise of Cloud Computing Systems
The Rise of Cloud Computing SystemsThe Rise of Cloud Computing Systems
The Rise of Cloud Computing Systems
Daehyeok Kim
 

What's hot (20)

Cassandra meetup slides - Oct 15 Santa Monica Coloft
Cassandra meetup slides - Oct 15 Santa Monica ColoftCassandra meetup slides - Oct 15 Santa Monica Coloft
Cassandra meetup slides - Oct 15 Santa Monica Coloft
 
Google Drive vs. Dropbox
Google Drive vs. DropboxGoogle Drive vs. Dropbox
Google Drive vs. Dropbox
 
SqliteToRealm
SqliteToRealmSqliteToRealm
SqliteToRealm
 
Object Storage in a Cloud-Native Container Envirnoment
Object Storage in a Cloud-Native Container EnvirnomentObject Storage in a Cloud-Native Container Envirnoment
Object Storage in a Cloud-Native Container Envirnoment
 
Life as a GlusterFS Consultant with Ivan Rossi
Life as a GlusterFS Consultant with Ivan RossiLife as a GlusterFS Consultant with Ivan Rossi
Life as a GlusterFS Consultant with Ivan Rossi
 
My Learnings on Setting up a Kubernetes Cluster on AWS using Kubernetes Opera...
My Learnings on Setting up a Kubernetes Cluster on AWS using Kubernetes Opera...My Learnings on Setting up a Kubernetes Cluster on AWS using Kubernetes Opera...
My Learnings on Setting up a Kubernetes Cluster on AWS using Kubernetes Opera...
 
Unleashing k8 s to reduce complexities of an entire middleware platform
Unleashing k8 s to reduce complexities of an entire middleware platformUnleashing k8 s to reduce complexities of an entire middleware platform
Unleashing k8 s to reduce complexities of an entire middleware platform
 
Game DDOS Prevention
Game DDOS PreventionGame DDOS Prevention
Game DDOS Prevention
 
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax AstraApache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra
 
Introduction to Bizur
Introduction to BizurIntroduction to Bizur
Introduction to Bizur
 
Cassandra Lunch #59 Functions in Cassandra
Cassandra Lunch #59  Functions in CassandraCassandra Lunch #59  Functions in Cassandra
Cassandra Lunch #59 Functions in Cassandra
 
PTD and beyond
PTD and beyondPTD and beyond
PTD and beyond
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
 
Corwin on containers
Corwin on containersCorwin on containers
Corwin on containers
 
Marble talk at akademy 2008
Marble talk  at akademy 2008Marble talk  at akademy 2008
Marble talk at akademy 2008
 
Memcached
MemcachedMemcached
Memcached
 
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding -  patterns & antipatterns, Константин Осипов, Алексей РыбакSharding -  patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
 
Cloud storage: the right way OSS EU 2018
Cloud storage: the right way OSS EU 2018Cloud storage: the right way OSS EU 2018
Cloud storage: the right way OSS EU 2018
 
Scaling Islandora
Scaling IslandoraScaling Islandora
Scaling Islandora
 
The Rise of Cloud Computing Systems
The Rise of Cloud Computing SystemsThe Rise of Cloud Computing Systems
The Rise of Cloud Computing Systems
 

Viewers also liked

Wikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-caseWikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-case
Eric Evans
 
Castle enhanced Cassandra
Castle enhanced CassandraCastle enhanced Cassandra
Castle enhanced Cassandra
Eric Evans
 
Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)
Eric Evans
 
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Eric Evans
 
Webinaire Business&Decision - Trifacta
Webinaire  Business&Decision - TrifactaWebinaire  Business&Decision - Trifacta
Webinaire Business&Decision - Trifacta
Victor Coustenoble
 
Webinar Degetel DataStax
Webinar Degetel DataStaxWebinar Degetel DataStax
Webinar Degetel DataStax
Victor Coustenoble
 
DataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoTDataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoT
Victor Coustenoble
 
DataStax Enterprise BBL
DataStax Enterprise BBLDataStax Enterprise BBL
DataStax Enterprise BBL
Victor Coustenoble
 
CQL In Cassandra 1.0 (and beyond)
CQL In Cassandra 1.0 (and beyond)CQL In Cassandra 1.0 (and beyond)
CQL In Cassandra 1.0 (and beyond)
Eric Evans
 
Cassandra by Example: Data Modelling with CQL3
Cassandra by Example:  Data Modelling with CQL3Cassandra by Example:  Data Modelling with CQL3
Cassandra by Example: Data Modelling with CQL3
Eric Evans
 
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in Cassandra
Eric Evans
 
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in Cassandra
Eric Evans
 
CQL: SQL In Cassandra
CQL: SQL In CassandraCQL: SQL In Cassandra
CQL: SQL In Cassandra
Eric Evans
 
It's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRDIt's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRD
Eric Evans
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra
Eric Evans
 
Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)
Eric Evans
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
Victor Coustenoble
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra
Eric Evans
 
Time series storage in Cassandra
Time series storage in CassandraTime series storage in Cassandra
Time series storage in Cassandra
Eric Evans
 
Cassandra 2.2 & 3.0
Cassandra 2.2 & 3.0Cassandra 2.2 & 3.0
Cassandra 2.2 & 3.0
Victor Coustenoble
 

Viewers also liked (20)

Wikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-caseWikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-case
 
Castle enhanced Cassandra
Castle enhanced CassandraCastle enhanced Cassandra
Castle enhanced Cassandra
 
Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)
 
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
 
Webinaire Business&Decision - Trifacta
Webinaire  Business&Decision - TrifactaWebinaire  Business&Decision - Trifacta
Webinaire Business&Decision - Trifacta
 
Webinar Degetel DataStax
Webinar Degetel DataStaxWebinar Degetel DataStax
Webinar Degetel DataStax
 
DataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoTDataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoT
 
DataStax Enterprise BBL
DataStax Enterprise BBLDataStax Enterprise BBL
DataStax Enterprise BBL
 
CQL In Cassandra 1.0 (and beyond)
CQL In Cassandra 1.0 (and beyond)CQL In Cassandra 1.0 (and beyond)
CQL In Cassandra 1.0 (and beyond)
 
Cassandra by Example: Data Modelling with CQL3
Cassandra by Example:  Data Modelling with CQL3Cassandra by Example:  Data Modelling with CQL3
Cassandra by Example: Data Modelling with CQL3
 
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in Cassandra
 
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in Cassandra
 
CQL: SQL In Cassandra
CQL: SQL In CassandraCQL: SQL In Cassandra
CQL: SQL In Cassandra
 
It's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRDIt's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRD
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra
 
Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra
 
Time series storage in Cassandra
Time series storage in CassandraTime series storage in Cassandra
Time series storage in Cassandra
 
Cassandra 2.2 & 3.0
Cassandra 2.2 & 3.0Cassandra 2.2 & 3.0
Cassandra 2.2 & 3.0
 

Similar to Wikimedia Content API: A Cassandra Use-case

Five Lessons in Distributed Databases
Five Lessons  in Distributed DatabasesFive Lessons  in Distributed Databases
Five Lessons in Distributed Databases
jbellis
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
javier ramirez
 
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Daniel Lemire
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
Edward Capriolo
 
OpenEBS hangout #4
OpenEBS hangout #4OpenEBS hangout #4
OpenEBS hangout #4
OpenEBS
 
Softshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with CouchbaseSoftshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with Couchbase
Tugdual Grall
 
It's always sunny with OpenJ9
It's always sunny with OpenJ9It's always sunny with OpenJ9
It's always sunny with OpenJ9
DanHeidinga
 
6 Months Sailing with Docker in Production
6 Months Sailing with Docker in Production 6 Months Sailing with Docker in Production
6 Months Sailing with Docker in Production
Hung Lin
 
Galaxy Big Data with MariaDB
Galaxy Big Data with MariaDBGalaxy Big Data with MariaDB
Galaxy Big Data with MariaDB
MariaDB Corporation
 
#VirtualDesignMaster 3 Challenge 3 - Dennis George
#VirtualDesignMaster 3 Challenge 3 - Dennis George#VirtualDesignMaster 3 Challenge 3 - Dennis George
#VirtualDesignMaster 3 Challenge 3 - Dennis George
vdmchallenge
 
Using Time Window Compaction Strategy For Time Series Workloads
Using Time Window Compaction Strategy For Time Series WorkloadsUsing Time Window Compaction Strategy For Time Series Workloads
Using Time Window Compaction Strategy For Time Series Workloads
Jeff Jirsa
 
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introduction
kanedafromparis
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series Database
All Things Open
 
Docker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12xDocker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12x
rkr10
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM
Joy Rahman
 
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB,  or how we implemented a 10-times faster CassandraSeastar / ScyllaDB,  or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
Tzach Livyatan
 
The 2nd half. Scaling to the next^2
The 2nd half. Scaling to the next^2The 2nd half. Scaling to the next^2
The 2nd half. Scaling to the next^2
Haggai Philip Zagury
 
Cassandra NoSQL Tutorial
Cassandra NoSQL TutorialCassandra NoSQL Tutorial
Cassandra NoSQL Tutorial
Michelle Darling
 
No SQL Technologies
No SQL TechnologiesNo SQL Technologies
No SQL Technologies
Cris Holdorph
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of data
Rostislav Pashuto
 

Similar to Wikimedia Content API: A Cassandra Use-case (20)

Five Lessons in Distributed Databases
Five Lessons  in Distributed DatabasesFive Lessons  in Distributed Databases
Five Lessons in Distributed Databases
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
 
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
OpenEBS hangout #4
OpenEBS hangout #4OpenEBS hangout #4
OpenEBS hangout #4
 
Softshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with CouchbaseSoftshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with Couchbase
 
It's always sunny with OpenJ9
It's always sunny with OpenJ9It's always sunny with OpenJ9
It's always sunny with OpenJ9
 
6 Months Sailing with Docker in Production
6 Months Sailing with Docker in Production 6 Months Sailing with Docker in Production
6 Months Sailing with Docker in Production
 
Galaxy Big Data with MariaDB
Galaxy Big Data with MariaDBGalaxy Big Data with MariaDB
Galaxy Big Data with MariaDB
 
#VirtualDesignMaster 3 Challenge 3 - Dennis George
#VirtualDesignMaster 3 Challenge 3 - Dennis George#VirtualDesignMaster 3 Challenge 3 - Dennis George
#VirtualDesignMaster 3 Challenge 3 - Dennis George
 
Using Time Window Compaction Strategy For Time Series Workloads
Using Time Window Compaction Strategy For Time Series WorkloadsUsing Time Window Compaction Strategy For Time Series Workloads
Using Time Window Compaction Strategy For Time Series Workloads
 
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introduction
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series Database
 
Docker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12xDocker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12x
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM
 
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB,  or how we implemented a 10-times faster CassandraSeastar / ScyllaDB,  or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
 
The 2nd half. Scaling to the next^2
The 2nd half. Scaling to the next^2The 2nd half. Scaling to the next^2
The 2nd half. Scaling to the next^2
 
Cassandra NoSQL Tutorial
Cassandra NoSQL TutorialCassandra NoSQL Tutorial
Cassandra NoSQL Tutorial
 
No SQL Technologies
No SQL TechnologiesNo SQL Technologies
No SQL Technologies
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of data
 

More from Eric Evans

Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3
Eric Evans
 
Cassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQLCassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQL
Eric Evans
 
NoSQL Yes, But YesCQL, No?
NoSQL Yes, But YesCQL, No?NoSQL Yes, But YesCQL, No?
NoSQL Yes, But YesCQL, No?
Eric Evans
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
Eric Evans
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
Eric Evans
 
Outside The Box With Apache Cassnadra
Outside The Box With Apache CassnadraOutside The Box With Apache Cassnadra
Outside The Box With Apache Cassnadra
Eric Evans
 
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed Database
Eric Evans
 
An Introduction To Cassandra
An Introduction To CassandraAn Introduction To Cassandra
An Introduction To Cassandra
Eric Evans
 
Cassandra In A Nutshell
Cassandra In A NutshellCassandra In A Nutshell
Cassandra In A Nutshell
Eric Evans
 

More from Eric Evans (9)

Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3
 
Cassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQLCassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQL
 
NoSQL Yes, But YesCQL, No?
NoSQL Yes, But YesCQL, No?NoSQL Yes, But YesCQL, No?
NoSQL Yes, But YesCQL, No?
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
 
Outside The Box With Apache Cassnadra
Outside The Box With Apache CassnadraOutside The Box With Apache Cassnadra
Outside The Box With Apache Cassnadra
 
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed Database
 
An Introduction To Cassandra
An Introduction To CassandraAn Introduction To Cassandra
An Introduction To Cassandra
 
Cassandra In A Nutshell
Cassandra In A NutshellCassandra In A Nutshell
Cassandra In A Nutshell
 

Recently uploaded

BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
Liveplex
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
Yevgen Sysoyev
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
Vijayananda Mohire
 
Manual | Product | Research Presentation
Manual | Product | Research PresentationManual | Product | Research Presentation
Manual | Product | Research Presentation
welrejdoall
 
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc
 
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Bert Blevins
 
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
Andrey Yasko
 
What's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptxWhat's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptx
Stephanie Beckett
 
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly DetectionAdvanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Bert Blevins
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
ishalveerrandhawa1
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
ScyllaDB
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Erasmo Purificato
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
HackersList
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
SynapseIndia
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
Larry Smarr
 
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Chris Swan
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
ScyllaDB
 
20240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 202420240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 2024
Matthew Sinclair
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
KAMAL CHOUDHARY
 

Recently uploaded (20)

BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
 
Manual | Product | Research Presentation
Manual | Product | Research PresentationManual | Product | Research Presentation
Manual | Product | Research Presentation
 
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
 
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
 
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
 
What's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptxWhat's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptx
 
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly DetectionAdvanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
 
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
 
20240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 202420240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 2024
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
 

Wikimedia Content API: A Cassandra Use-case

  • 1. Wikimedia Content API: A Cassandra Use-case Eric Evans <eevans@wikimedia.org> @jericevans Apache Bigdata | May 10, 2016
  • 3. Our Vision: A world in which every single human can freely share in the sum of all knowledge.
  • 4. About: ● Global movement ● Largest collection of free, collaborative knowledge in human history ● 16 projects ● 16.5 billion total page views per month ● 58.9 million unique devices per day ● More than 13k new editors each month ● More than 75k active editors month-to-month
  • 5. About: Wikipedia ● More than 38 million articles in 290 languages ● Over 10k new articles added per day ● 13 million edits per month ● Ranked #6 globally in web traffic
  • 11. Wikitext = Star Wars: The Force Awakens = Star Wars: The Force Awakens is a 2015 American epic space opera film directed, co-produced, and co-written by [[J. J. Abrams]].
  • 12. HTML <h1> Star Wars: The Force Awakens </h1> <p> Star Wars: The Force Awakens is a 2015 American epic space opera film directed, co-produced, and co-written by <a href="/wiki/J._J._Abrams" title="J. J. Abrams"> J. J. Abrams </a> </p>
  • 14. HTML
  • 20. Metadata [[Foo|{{echo|bar}}]] <a rel="mw:WikiLink" href="./Foo"> <span about="#mwt1" typeof="mw:Object/Template" data-parsoid="{...}" >bar</span> </a>
  • 21. Parsoid ● Node.js service ● Converts wikitext to HTML/RDFa ● Converts HTML/RDFa to wikitext ● Semantics, and syntax (avoid dirty diffs)! ● Expensive (slow) ● Resulting output is large
  • 22. RESTBase ● Services aggregator / proxy (REST) ● Durable cache (Cassandra) ● Wikimedia’s content API (e.g. https://en.wikipedia.org/api/rest_v1?doc)
  • 24. Other use-cases ● Mobile content service ● Math formula rendering service ● Dumps ● ...
  • 26. Environment ● 2 datacenters ● 3 racks per datacenter ● 18 hosts (16 core, 128G, SSDs) ● 36 nodes ● Deflate compression (~14-18%) ● 31T storage (~206T uncompressed) ● Cassandra 2.1.13 (moving to 2.2.6) ● Read-heavy workload (5:1)
  • 28. Data model CREATE TABLE data ( domain text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY ((domain, title), rev, tid) ) WITH CLUSTERING ORDER BY (rev DESC, tid DESC)
  • 29. Data model en.wikipedia.org + Star_Wars:_The_Force_Awakens 717862573 717873822 ...97466b12...7c7a913d3d8a1f2dd66c...7c7a913d3d8a ... 09877568...7c7a913d3d8a bdebc9a6...7c7a913d3d8a827e2ec2...7c7a913d3d8a
  • 33. Brotli compression ● Brought to you by the folks at Google; Successor to deflate ● Cassandra implementation (https://github.com/eevans/cassandra-brotli) ● Initial results very promising ● Better compression, lower cost (apples-apples) ● And, wider windows are possible (apples-oranges) ○ GC/memory permitting ○ Example: level=1, lgblock=4096, chunk_length_kb=4096, yields 1.73% compressed size! ○ https://phabricator.wikimedia.org/T122028 ● Stay tuned!
  • 35. Compaction ● The cost of having log-structured storage ● Asynchronously (post-write) optimize data on disk for reads ● At a minimum, reorganize into fewer files ○ Dropping what is obsolete ○ Expiring TTLs ○ Removing deleted (aka tombstoned) data (after a fashion) ● Reorganize data so results are nearer each other
  • 36. Compaction strategies ● Size-tiered ○ Combines tables of similar size ○ Oblivious to column distribution; Works best for workloads with no overwrites/deletes ○ Needs working space for largest possible compaction (100%) ○ Minimal IO ● Leveled ○ Small, fixed size files in levels of exponentially increasing size ○ Files have non-overlapping ranges within a level ○ Amortized; Lots of continuous compaction ○ Very efficient reads, but also quite IO intensive ● Date-tiered
  • 37. Date-tiered compaction ● Newest of the bunch; Introduced in 2.1 ● For append-only workloads (no overwrites, no deletes) ● Where data is ordered sort-ascending, (and not written out-of-order) ● Windows of time, arranged in tiers ● Avoids mixing “old” data with “new” ● Cold data eventually ceases to be compacted
  • 38. 1 1 2 2 31 2-51 2 3 41 1 2 min_threshold (4) base_time_seconds 3-6
  • 39. Date-tiered compaction ● Newest of the bunch; Introduced in 2.1 ● For append-only workloads (no overwrites, no deletes) ● Where data is ordered sort-ascending, (and not written out-of-order) ● Windows of time, arranged in tiers ● Avoids mixing “old” data with “new” ● Cold data eventually ceases to be compacted ● Hard to reason about ● Optimizations are easily defeated ● https://phabricator.wikimedia.org/T126221
  • 41. DTCS: So now what? ● Size-tiered compaction? Might as well. ● TimeWindowCompactionStrategy (https://github.com/jeffjirsa/twcs)? Maybe... ● Reduce node density?
  • 43. GC ● Early adopters of G1 (Garbage 1st) ● Successor to Concurrent Mark-sweep (CMS) ● Incremental parallel compacting collector ● More predictable than CMS ● Configurable pause-time target
  • 45. G1 Survivor Old-Gen Eden Eden Survivor Old-Gen Old-Gen Eden Survivor Old-Gen
  • 46. Humongous objects ● Anything >= ½ region size is classified as Humongous ● Humongous objects are allocated into Humongous Regions ● Only one object for a region (wastes space, creates fragmentation) ● Until 1.8u40, humongous regions collected only during full collections (Bad) ● Since 1.8u40, end of the marking cycle, during the cleanup phase (Better) ● Treated as exceptions, so should be exceptional ○ For us, that means 8MB regions ● Enable GC logging and have a look!
  • 48. “Many smaller-sized Cassandra nodes is always better than fewer, dense ones.” — Everyone
  • 50. What we do ● Processes (yup) ● Puppetized configuration ○ /etc/cassandra-a/ ○ /etc/cassandra-b/ ○ systemd units ○ Etc ● Shared RAID-0
  • 51. What we should have done ● Virtualization ● Containers ● Blades
  • 52. The Good ● Fault-tolerance ● Availability ● Datacenter / rack awareness ● Nice, helpful people (tickets, IRC, etc)
  • 53. The Bad ● Usability ○ Compaction ○ Streaming ○ JMX ● Vertical scaling ● JVM
  • 54. The Ugly ● Release process ● Upgrades