SlideShare a Scribd company logo
Wikimedia Content API:
A Cassandra Use-case
Eric Evans <eevans@wikimedia.org>
@jericevans
Berlin Buzzwords | June 6, 2016
Wikimedia Content API: A Cassandra Use-case
Our Vision:
A world in which every single human can freely
share in the sum of all knowledge.
About:
● Global movement
● Largest collection of free, collaborative knowledge in human history
● 16 projects
● 16.5 billion total page views per month
● 58.9 million unique devices per day
● More than 13k new editors each month
● More than 75k active editors month-to-month

Recommended for you

Object Storage in a Cloud-Native Container Envirnoment
Object Storage in a Cloud-Native Container EnvirnomentObject Storage in a Cloud-Native Container Envirnoment
Object Storage in a Cloud-Native Container Envirnoment

Frank Wessels for VM Ware meet up. This talk looked at the modern application stack whereby a cloud native application is split into both stateless and stateful containers.

minioobject storagecloud
Unleashing k8 s to reduce complexities of an entire middleware platform
Unleashing k8 s to reduce complexities of an entire middleware platformUnleashing k8 s to reduce complexities of an entire middleware platform
Unleashing k8 s to reduce complexities of an entire middleware platform

WSO2 uses Kubernetes to provide multi-tenancy for its middleware platform. Kubernetes namespaces isolate each tenant's resources, while quotas control how much CPU and memory each tenant can use. Kubernetes also provides health monitoring, rolling updates, secret sharing between pods, and autoscaling that help reduce the complexity of WSO2's platform. WSO2's identity server integrates with Kubernetes to provide access management for tenants and users.

wso2dockerkubernetes
Game DDOS Prevention
Game DDOS PreventionGame DDOS Prevention
Game DDOS Prevention

This document discusses solutions for preventing distributed denial-of-service (DDoS) attacks on game servers at different levels including DNS, network, and application levels. It recommends purchasing anti-DDoS services, using content delivery networks, web application firewalls, blacklisting abnormal IP addresses, and implementing packet marking and filtering techniques. The document also provides references to several commercial anti-DDoS service providers and their pricing.

gameinfrastructureddos
About: Wikipedia
● More than 38 million articles in 290 languages
● Over 10k new articles added per day
● 13 million edits per month
● Ranked #6 globally in web traffic
Wikimedia Architecture
LAMP
THE ARCHITECTURE
ALL OF IT

Recommended for you

Introduction to Bizur
Introduction to BizurIntroduction to Bizur
Introduction to Bizur

Bizur is a consensus algorithm invented by Elastifile to address issues with log-based algorithms like Paxos. It optimizes for strongly consistent distributed key-value stores by having independent buckets for keys that are replicated and use leader election. Reads require a majority and writes succeed with majority acknowledgement. It includes recovery mechanisms like ensuring buckets are up-to-date after leader changes and can reconfigure membership or number of buckets per shard dynamically through techniques like SMART migrations.

paxosconsensus algorithmraft
Cassandra Lunch #59 Functions in Cassandra
Cassandra Lunch #59  Functions in CassandraCassandra Lunch #59  Functions in Cassandra
Cassandra Lunch #59 Functions in Cassandra

In Apache Cassandra Lunch #59: Functions in Cassandra, we discussed the functions that are usable inside of the Cassandra database. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live.

apache cassandracassandra lunchjava
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group

Modern day Data Engineering requires creating reliable data pipelines, architecting distributed systems, designing data stores, and preparing data for other teams. We’ll describe a year in the life of a Data Engineer who is tasked with creating a streaming data pipeline and touch on the skills necessary to set one up using Apache Spark. Slides from the April 2019 meeting of the St. Louis Big Data IDEA meetup.

sparkdataengineeringstreaming
Wikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-case
Wikitext
= Star Wars: The Force Awakens =
Star Wars: The Force Awakens is a 2015 American epic space opera
film directed, co-produced, and co-written by [[J. J. Abrams]].
HTML
<h1>
Star Wars: The Force Awakens
</h1>
<p>
Star Wars: The Force Awakens is a 2015 American epic space opera
film directed, co-produced, and co-written by
<a href="/wiki/J._J._Abrams" title="J. J. Abrams">
J. J. Abrams
</a>
</p>

Recommended for you

PTD and beyond
PTD and beyondPTD and beyond
PTD and beyond

Hadoop is used extensively at TD, with 1000 daily active users running 65K Hive jobs, 180K Yarn apps, and scanning 20 trillion records from 500 petabytes stored in S3. The team runs 5 Hadoop clusters across 3 regions using a patched version of Hadoop 2.7.3 called PTD. They have improved the clusters to boot faster, be more ephemeral by storing data directly in S3, and made changes to reduce failures by enabling circuit breakers and disk quotas. The team is working on migrating to the latest Hive and simplifying configurations, as well as moving to auto-scaling and code deploy for faster operations.

hadoophivetdtech
Corwin on containers
Corwin on containersCorwin on containers
Corwin on containers

A deep dive into the history of containers as well as an introduction to how they work under the covers. This includes a discussion around Control Groups and Process Namespaces, as well as touching on some underlying syscalls, such as Fork and Clone.

dockercontainersdevops
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding -  patterns & antipatterns, Константин Осипов, Алексей РыбакSharding -  patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак

This document discusses sharding patterns and antipatterns for scalable databases. It covers selecting good shard keys like user IDs, routing types like using smart clients or proxies, and approaches for re-sharding like moving data instead of redistributing it. The key topics are sharding functions, routing, and re-sharding strategies to minimize disruption when updating shard configurations.

highload 2014
Wikitext
HTML
Conversion
wikitext html
WYSIWYG

Recommended for you

Cloud storage: the right way OSS EU 2018
Cloud storage: the right way OSS EU 2018Cloud storage: the right way OSS EU 2018
Cloud storage: the right way OSS EU 2018

1. The document discusses cloud object storage, describing its features like multipart uploads, versioning, and lifecycles. It provides examples of using object storage for media and documents. 2. Key aspects of object storage security are covered, including signatures, encryption, access control lists, and policies. Disaster recovery options like geo-replication are also summarized. 3. In the conclusion, the document emphasizes using object storage APIs to access advanced features, ensuring data safety, testing disaster recovery plans, and using Ceph for private cloud object storage.

storage
Memcached
MemcachedMemcached
Memcached

ElasticCache is a caching service that uses Memcached. Memcached is an in-memory key-value store that provides no persistence or replication. It is fast and preferable for caching relatively small static data. At a certain point, implementation knowledge is needed to ensure Memcached is behaving as expected. Production issues can occur if objects do not fit properly into Memcached slabs, which allocate fixed-size chunks of memory. Monitoring tools like "stats slabs" help analyze slab allocation and object eviction patterns.

memcached
Marble talk at akademy 2008
Marble talk  at akademy 2008Marble talk  at akademy 2008
Marble talk at akademy 2008

Marble is a free digital globe and map application for KDE and Qt. It provides an interactive globe and map widget that can be used in various KDE applications. Marble has a small dataset, does not require hardware acceleration, and runs on Linux, Windows and Mac. It supports plugins and new map types can be added. Future plans include vector map tiles, routing support, and using Marble on mobile devices.

kdeakademy2008
Conversion
wikitext html
Character-based diffs
Metadata
[[Foo|bar]]
<a rel="mw:WikiLink" href="./Foo">bar</a>
Metadata
[[Foo|{{echo|bar}}]]
<a rel="mw:WikiLink" href="./Foo">
<span about="#mwt1" typeof="mw:Object/Template"
data-parsoid="{...}" >bar</span>
</a>

Recommended for you

A Hands-on Introduction to MapReduce (in Python)
A Hands-on Introduction to MapReduce (in Python)A Hands-on Introduction to MapReduce (in Python)
A Hands-on Introduction to MapReduce (in Python)

This document provides an overview of MapReduce in Python for analyzing text. It discusses setting up the environment, counting the words in Moby Dick as an example, the mapping, shuffling, and reducing steps of MapReduce, and limitations when processing very large texts. Requirements include a Unix-like system and Python. The example counts words by processing the input text with a mapper, sorting the output, and then reducing the counts with a reducer. Hadoop is also introduced as a MapReduce framework.

pythonmapreducebig data
The Concierge Paradigm
The Concierge ParadigmThe Concierge Paradigm
The Concierge Paradigm

This document discusses the "Concierge Paradigm" for simplifying container management at scale. It proposes using two fundamental components - service discovery and process orchestration - to automate common container operations. This approach leverages small scripts to automatically register, discover, and take corrective actions on containers with minimal overhead. It has been optimized over many years and allows containers to "fly on autopilot" with drastically less management than traditional frameworks.

devopsjoyentconciergeparadigm
Effective Git
Effective GitEffective Git
Effective Git

Git is a distributed version control system created by Linus Torvalds for Linux kernel development. It stores snapshots of files and uses checksums to track file versions. Commits contain a message, author, timestamp and reference to parent commits. Branches are pointers to commits that can be rewritten using rebase, cherry-pick or squash to clean up history. Good practices include writing descriptive commit messages and using rebase instead of merge for pull requests.

mergegiteffective
Parsoid
● Node.js service
● Converts wikitext to HTML/RDFa
● Converts HTML/RDFa to wikitext
● Semantics, and syntax (avoid dirty diffs)!
● Expensive (slow)
● Resulting output is large
RESTBase
● Services aggregator / proxy (REST)
● Durable cache (Cassandra)
● Wikimedia’s content API (e.g. https://en.wikipedia.org/api/rest_v1?doc)
Cassandra
RESTBase
RESTBase
Parsoid ... ...
Other use-cases
● Mobile content service
● Math formula rendering service
● Dumps
● ...

Recommended for you

Memory management
Memory managementMemory management
Memory management

This document discusses memory related issues in Android applications. It explains that each app runs in a separate process with limited memory based on the device. If an app demands more memory than the limit, it will crash. Memory leaks and handling large bitmaps can also cause issues. Tools like logcat, MAT, and adb commands can help debug memory problems by analyzing heap dumps and tracking allocations over time.

androidadbmemory management
KubeVirt 101 Workshop - Containerdays.io 2019
KubeVirt 101 Workshop - Containerdays.io 2019KubeVirt 101 Workshop - Containerdays.io 2019
KubeVirt 101 Workshop - Containerdays.io 2019

This document summarizes a KubeVirt 101 workshop covering: 1. An introductory session and first set of labs on integrating virtual machines with Kubernetes. 2. A short break followed by a second set of labs on more advanced KubeVirt features. 3. An open discussion on common KubeVirt use cases, troubleshooting, and staying engaged with the community. The workshop introduces KubeVirt as a Kubernetes addon for providing virtualization and explains how it uses CustomResourceDefinitions and controllers to integrate virtual machines and their lifecycles with Kubernetes. Hands-on labs demonstrate defining VMs, starting them, and using data volumes for importing disk images.

kuberneteskubevirt
Wikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-caseWikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-case

Among the resources offered by Wikimedia is an API providing low-latency access to full-history content, in many formats. Its results are often the product of computationally intensive transforms, and must be pre-generated and stored to meet latency expectations. Unsurprisingly, there are many challenges to providing low-latency access to such a large data-set, in a demanding, globally distributed environment. This presentation covers the Wikimedia content API and its use of Apache Cassandra as storage for a diverse and growing set of use-cases. Trials, tribulations, and triumphs, of both a development and operational nature will be discussed.

apachewikimediacassandra
Cassandra
Environment
● 2 datacenters
● 3 racks per datacenter
● 18 hosts (16 core, 128G, SSDs)
● 54 nodes
● Deflate compression (~14-18%)
● 31T storage (~206T uncompressed)
● Cassandra 2.1.13 (moving to 2.2.6)
● Read-heavy workload (5:1)
Data model
Data model
CREATE TABLE data (
domain text,
title text,
rev int,
tid timeuuid,
value blob,
PRIMARY KEY ((domain, title), rev, tid)
) WITH CLUSTERING ORDER BY (rev DESC, tid DESC)

Recommended for you

Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)

The Wikimedia Foundation is a non-profit and charitable organization driven by a vision of a world where every human can freely share in the sum of all knowledge. Each month Wikimedia sites serve over 18 billion page views to 500 million unique visitors around the world. Among the many resources offered by Wikimedia is a public-facing API that provides low-latency, programmatic access to full-history content and meta-data, in a variety of formats. Commonly, results from this system are the product of computationally intensive transformations, and must be pre-generated and persisted to meet latency expectations. Unsurprisingly, there are numerous challenges to providing low-latency storage of such a massive data-set, in a demanding, globally distributed environment. This talk covers Wikimedia Content API, and it's use of Apache Cassandra, a massively-scalable distributed database, as storage for a diverse and growing set of use-cases. Trials, tribulations, and triumphs, of both a development and operational nature are discussed.

wikimediacassandranosql
Webinar Degetel DataStax
Webinar Degetel DataStaxWebinar Degetel DataStax
Webinar Degetel DataStax

Webinar Degetel DataStax du 15 octobre 2015 Du SQL au NoSQL : Pourquoi ? Différences ? Comment ça marche ?

cassandradatastaxnosql
Webinaire Business&Decision - Trifacta
Webinaire  Business&Decision - TrifactaWebinaire  Business&Decision - Trifacta
Webinaire Business&Decision - Trifacta

Webinaire Banque / Assurance Reprenez le pouvoir sur vos données

data preparationtrifactabig data
Data model
en.wikipedia.org + Star_Wars:_The_Force_Awakens
717862573 717873822
...97466b12...7c7a913d3d8a1f2dd66c...7c7a913d3d8a
...
09877568...7c7a913d3d8a
bdebc9a6...7c7a913d3d8a827e2ec2...7c7a913d3d8a
Compression
Compression
chunk_length_kb
Compression
chunk_length_kb

Recommended for you

Castle enhanced Cassandra
Castle enhanced CassandraCastle enhanced Cassandra
Castle enhanced Cassandra

Castle is an open-source project that provides an alternative to the lower layers of the storage stack -- RAID and POSIX filesystems -- for big data workloads, and distributed data stores such as Apache Cassandra. This presentation from Berlin Buzzwords 2012 provides a high-level overview of Castle and how it is used with Cassandra to improve performance and predictability.

linuxnosqlbuzzwords
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)

This document discusses using Apache Cassandra to store and retrieve time series data more efficiently than the traditional RRDTool approach. It describes how Cassandra is well-suited for time series data due to its high write throughput, ability to store data sorted on disk, and partitioning and replication. The document also outlines a data model for storing time series metrics in Cassandra and discusses Newts, an open source time series data store built on Cassandra.

time seriesapachecondistributed database
DataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoTDataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoT

Comment maîtriser le flux de données IoT avec DataStax, Apache Cassandra et Apache Spark ? Petit Déjeuner IoT du 19 novembre 2015

sparkdatabasecassandra
Brotli compression
● Brought to you by the folks at Google; Successor to deflate
● Cassandra implementation (https://github.com/eevans/cassandra-brotli)
● Initial results very promising
● Better compression, lower cost (apples-apples)
● And, wider windows are possible (apples-oranges)
○ GC/memory permitting
○ Example: level=1, lgblock=4096, chunk_length_kb=4096, yields 1.73% compressed size!
○ https://phabricator.wikimedia.org/T122028
● Stay tuned!
Compaction
Compaction
● The cost of having log-structured storage
● Asynchronously (post-write) optimize data on disk for reads
● At a minimum, reorganize into fewer files
○ Dropping what is obsolete
○ Expiring TTLs
○ Removing deleted (aka tombstoned) data (after a fashion)
● Reorganize data so results are nearer each other
Compaction strategies
● Size-tiered
○ Combines tables of similar size
○ Oblivious to column distribution; Works best for workloads with no overwrites/deletes
○ Minimal IO
● Leveled
○ Small, fixed size files in levels of exponentially increasing size
○ Files have non-overlapping ranges within a level
○ Very efficient reads, but also quite IO intensive
● Date-tiered
○ For append only, total ordered data
○ Avoids mixing old data with new
○ Cold data eventually ceases to be compacted

Recommended for you

CQL In Cassandra 1.0 (and beyond)
CQL In Cassandra 1.0 (and beyond)CQL In Cassandra 1.0 (and beyond)
CQL In Cassandra 1.0 (and beyond)

This document provides an overview and history of the Cassandra Query Language (CQL) and discusses changes between versions 1.0 and 2.0. It notes that CQL was introduced in Cassandra 0.8.0 to provide a more stable and user-friendly interface than the native Cassandra API. Major changes in CQL 2.0 included data type changes and additional functionality like named keys, counters, and timestamps. The document outlines the roadmap for future CQL features and lists several third-party driver projects supporting CQL connectivity.

nosqldatabaseapache
DataStax Enterprise BBL
DataStax Enterprise BBLDataStax Enterprise BBL
DataStax Enterprise BBL

Découvrez DataStax Enterprise avec ses intégrations de Apache Spark et de Apache Solr, ainsi que son nouveau modèle de données de type Graph.

cassandranosqldatastax
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in Cassandra

A presentation on the recent work to transition Cassandra from its naive 1-partition-per-node distribution, to a proper virtual nodes implementation.

cassandradatabasedistributed database
Compaction strategies
● Size-tiered
○ Combines tables of similar size
○ Oblivious to column distribution; Works best for workloads with no overwrites/deletes
○ Minimal IO
● Leveled
○ Small, fixed size files in levels of exponentially increasing size
○ Files have non-overlapping ranges within a level
○ Very efficient reads, but also quite IO intensive
● Date-tiered
○ For append only, total ordered data
○ Avoids mixing old data with new
○ Cold data eventually ceases to be compacted OMG, THIS!
DTCS: Well...no, actually
● Hard to reason about
● Optimizations easily defeated
● See: https://phabricator.wikimedia.org/T126221
DTCS: So now what?
● Size-tiered compaction? Might as well.
● TimeWindowCompactionStrategy (https://github.com/jeffjirsa/twcs)?
Maybe...
● Reduce node density?
Garbage Collection

Recommended for you

Cassandra by Example: Data Modelling with CQL3
Cassandra by Example:  Data Modelling with CQL3Cassandra by Example:  Data Modelling with CQL3
Cassandra by Example: Data Modelling with CQL3

This document summarizes a presentation about modeling data with Cassandra Query Language (CQL) using examples from a Twitter-like application called Twissandra. It introduces CQL as an alternative to Thrift for querying Cassandra and describes how to model users, followers, tweets, timelines and other social media data structures in Cassandra tables. The presentation emphasizes denormalizing data and using materialized views to optimize queries, and concludes by noting that applications can be built in various languages thanks to Cassandra drivers.

nosqldatabasecassandra
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in Cassandra

The document discusses Cassandra's topology and how it is moving from a single token per node model to a virtual node model where each node is assigned multiple tokens. This improves load balancing and data distribution in the cluster. Specifically, it addresses problems with the single token approach like poor load distribution when nodes fail and inefficient data movement when adding or replacing nodes. The virtual node model with random token assignment provides better scaling properties as the number of nodes and data size increases.

nosqldatabaseapache
CQL: SQL In Cassandra
CQL: SQL In CassandraCQL: SQL In Cassandra
CQL: SQL In Cassandra

This document discusses CQL, the Cassandra Query Language. CQL is designed to be similar to SQL but with some differences to account for Cassandra's data model. The presentation provides an overview of CQL's syntax and capabilities, discusses why CQL was created to provide a more stable interface than Cassandra's native protocol, and analyzes CQL's performance compared to the native protocol. Future roadmap items for CQL are also presented, including prepared statements and custom transports. Available CQL drivers for languages like Java, Python, Ruby, and Node.js are also briefly mentioned.

sqlmosqlyescql
G1GC
● Early adopters of G1 (aka “Garbage 1st”)
● Successor to Concurrent Mark-sweep (CMS)
● Incremental parallel compacting collector
● More predictable performance than CMS
Humongous objects
● Anything >= ½ region size is classified as Humongous
● Humongous objects are allocated into Humongous Regions
● Only one object for a region (wastes space, creates fragmentation)
● Until 1.8u40, humongous regions collected only during full collections (Bad)
● Since 1.8u40, end of the marking cycle, during the cleanup phase (Better)
● Treated as exceptions, so should be exceptional
○ For us, that means 8MB regions
● Enable GC logging and have a look!
Node density
“Many smaller-sized Cassandra nodes is
always better than fewer, dense ones.”
— Everyone

Recommended for you

It's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRDIt's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRD

OpenNMS User Conference Europe presentation on using Apache Cassandra and Newts for time-series data storage.

big datadatabasenetwork management
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra

Whether it's statistics, weather forecasting, astronomy, finance, or network management, time series data plays a critical role in analytics and forecasting. Unfortunately, while many tools exist for time series storage and analysis, few are able to scale past memory limits, or provide rich query and analytics capabilities outside what is necessary to produce simple plots; For those challenged by large volumes of data, there is much room for improvement. Apache Cassandra is a fully distributed second-generation database. Cassandra stores data in key-sorted order making it ideal for time series, and its high throughput and linear scalability make it well suited to very large data sets. This talk will cover some of the requirements and challenges of large scale time series storage and analysis. Cassandra data and query modeling for this use-case will be discussed, and Newts, an open source Cassandra-based time series store under development at The OpenNMS Group will be introduced.

cassandradistributedbbuzz
Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)

The document discusses topology and partitioning in Cassandra distributed hash tables (DHTs). It describes issues with poor load distribution and data distribution in traditional DHT designs. It proposes using virtual nodes, where each physical node is assigned multiple tokens, to better distribute partitions and improve performance. Configuration options for Cassandra are presented that implement virtual nodes using a random token assignment strategy.

distributed databasenosqlvirtual nodes
Motivation
● Compaction
● GC
● ...
What we do
● Processes (yup)
● Puppetized configuration
○ /etc/cassandra-a/
○ /etc/cassandra-b/
○ systemd units
○ Etc
● Shared RAID-0
What we should have done
● Virtualization
● Containers
● Blades
● Not processes
Cassandra: The Good
● Fault-tolerance
● Availability
● Datacenter / rack awareness
● Visibility
● Ubiquity
● Nice, helpful people (tickets, IRC, etc)

Recommended for you

Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark

This document summarizes Spark, an open-source cluster computing framework that is 10-100x faster than Hadoop for interactive queries and stream processing. It discusses how Spark works and its Resilient Distributed Datasets (RDD) API. It then explains how Spark can be used with Cassandra for fast analytics, including reading and writing Cassandra data as RDDs and mapping rows to objects. Finally, it briefly covers the Shark SQL query engine on Spark.

Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra

This document discusses using Apache Cassandra to store and manage time series data in OpenNMS. It describes some limitations of the existing RRDTool-based data storage, such as high I/O requirements for updating and aggregating data. Cassandra is presented as an alternative that is optimized for write throughput, flexible data modeling, high availability, and ability to perform aggregations at read time rather than write time. The Newts project is introduced as a standalone time series data store built on Cassandra that aims to provide fast storage and retrieval of raw samples along with flexible aggregation capabilities.

apachecassandrastrangeloop
Time series storage in Cassandra
Time series storage in CassandraTime series storage in Cassandra
Time series storage in Cassandra

Presented at Cassandra London (April 7, 2014); The challenges of time-series storage and analytics in OpenNMS, with an introduction to Newts, a new Cassandra-based time-series data store.

distributed databasenosqlapache
Cassandra: The Bad
● Usability
○ Compaction
○ Streaming
○ JMX
○ etc
● Vertical scaling
● JVM
Cassandra: The Ugly
● Upgrading
● Release process
Wikimedia Content API: A Cassandra Use-case

More Related Content

What's hot

SqliteToRealm
SqliteToRealmSqliteToRealm
SqliteToRealm
Pluu love
 
My Learnings on Setting up a Kubernetes Cluster on AWS using Kubernetes Opera...
My Learnings on Setting up a Kubernetes Cluster on AWS using Kubernetes Opera...My Learnings on Setting up a Kubernetes Cluster on AWS using Kubernetes Opera...
My Learnings on Setting up a Kubernetes Cluster on AWS using Kubernetes Opera...
Sathyajith Bhat
 
Google Drive vs. Dropbox
Google Drive vs. DropboxGoogle Drive vs. Dropbox
Google Drive vs. Dropbox
Fireworks Websites
 
Object Storage in a Cloud-Native Container Envirnoment
Object Storage in a Cloud-Native Container EnvirnomentObject Storage in a Cloud-Native Container Envirnoment
Object Storage in a Cloud-Native Container Envirnoment
Minio
 
Unleashing k8 s to reduce complexities of an entire middleware platform
Unleashing k8 s to reduce complexities of an entire middleware platformUnleashing k8 s to reduce complexities of an entire middleware platform
Unleashing k8 s to reduce complexities of an entire middleware platform
Lakmal Warusawithana
 
Game DDOS Prevention
Game DDOS PreventionGame DDOS Prevention
Game DDOS Prevention
Walter Liu
 
Introduction to Bizur
Introduction to BizurIntroduction to Bizur
Introduction to Bizur
Akira Hayakawa
 
Cassandra Lunch #59 Functions in Cassandra
Cassandra Lunch #59  Functions in CassandraCassandra Lunch #59  Functions in Cassandra
Cassandra Lunch #59 Functions in Cassandra
Anant Corporation
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
Adam Doyle
 
PTD and beyond
PTD and beyondPTD and beyond
PTD and beyond
Johan Gustavsson
 
Corwin on containers
Corwin on containersCorwin on containers
Corwin on containers
Corwin Brown
 
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыба��
Sharding -  patterns & antipatterns, Константин Осипов, Алексей РыбакSharding -  patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Ontico
 
Cloud storage: the right way OSS EU 2018
Cloud storage: the right way OSS EU 2018Cloud storage: the right way OSS EU 2018
Cloud storage: the right way OSS EU 2018
Orit Wasserman
 
Memcached
MemcachedMemcached
Memcached
Dori Waldman
 
Marble talk at akademy 2008
Marble talk  at akademy 2008Marble talk  at akademy 2008
Marble talk at akademy 2008
Marble Virtual Globe
 
A Hands-on Introduction to MapReduce (in Python)
A Hands-on Introduction to MapReduce (in Python)A Hands-on Introduction to MapReduce (in Python)
A Hands-on Introduction to MapReduce (in Python)
David Massart
 
The Concierge Paradigm
The Concierge ParadigmThe Concierge Paradigm
The Concierge Paradigm
Gareth Brown
 
Effective Git
Effective GitEffective Git
Effective Git
Tejas Bubane
 
Memory management
Memory managementMemory management
Memory management
mitesh_sharma
 
KubeVirt 101 Workshop - Containerdays.io 2019
KubeVirt 101 Workshop - Containerdays.io 2019KubeVirt 101 Workshop - Containerdays.io 2019
KubeVirt 101 Workshop - Containerdays.io 2019
Fabian Deutsch
 

What's hot (20)

SqliteToRealm
SqliteToRealmSqliteToRealm
SqliteToRealm
 
My Learnings on Setting up a Kubernetes Cluster on AWS using Kubernetes Opera...
My Learnings on Setting up a Kubernetes Cluster on AWS using Kubernetes Opera...My Learnings on Setting up a Kubernetes Cluster on AWS using Kubernetes Opera...
My Learnings on Setting up a Kubernetes Cluster on AWS using Kubernetes Opera...
 
Google Drive vs. Dropbox
Google Drive vs. DropboxGoogle Drive vs. Dropbox
Google Drive vs. Dropbox
 
Object Storage in a Cloud-Native Container Envirnoment
Object Storage in a Cloud-Native Container EnvirnomentObject Storage in a Cloud-Native Container Envirnoment
Object Storage in a Cloud-Native Container Envirnoment
 
Unleashing k8 s to reduce complexities of an entire middleware platform
Unleashing k8 s to reduce complexities of an entire middleware platformUnleashing k8 s to reduce complexities of an entire middleware platform
Unleashing k8 s to reduce complexities of an entire middleware platform
 
Game DDOS Prevention
Game DDOS PreventionGame DDOS Prevention
Game DDOS Prevention
 
Introduction to Bizur
Introduction to BizurIntroduction to Bizur
Introduction to Bizur
 
Cassandra Lunch #59 Functions in Cassandra
Cassandra Lunch #59  Functions in CassandraCassandra Lunch #59  Functions in Cassandra
Cassandra Lunch #59 Functions in Cassandra
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
 
PTD and beyond
PTD and beyondPTD and beyond
PTD and beyond
 
Corwin on containers
Corwin on containersCorwin on containers
Corwin on containers
 
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding -  patterns & antipatterns, Константин Осипов, Алексей РыбакSharding -  patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
 
Cloud storage: the right way OSS EU 2018
Cloud storage: the right way OSS EU 2018Cloud storage: the right way OSS EU 2018
Cloud storage: the right way OSS EU 2018
 
Memcached
MemcachedMemcached
Memcached
 
Marble talk at akademy 2008
Marble talk  at akademy 2008Marble talk  at akademy 2008
Marble talk at akademy 2008
 
A Hands-on Introduction to MapReduce (in Python)
A Hands-on Introduction to MapReduce (in Python)A Hands-on Introduction to MapReduce (in Python)
A Hands-on Introduction to MapReduce (in Python)
 
The Concierge Paradigm
The Concierge ParadigmThe Concierge Paradigm
The Concierge Paradigm
 
Effective Git
Effective GitEffective Git
Effective Git
 
Memory management
Memory managementMemory management
Memory management
 
KubeVirt 101 Workshop - Containerdays.io 2019
KubeVirt 101 Workshop - Containerdays.io 2019KubeVirt 101 Workshop - Containerdays.io 2019
KubeVirt 101 Workshop - Containerdays.io 2019
 

Viewers also liked

Wikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-caseWikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-case
Eric Evans
 
Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)
Eric Evans
 
Webinar Degetel DataStax
Webinar Degetel DataStaxWebinar Degetel DataStax
Webinar Degetel DataStax
Victor Coustenoble
 
Webinaire Business&Decision - Trifacta
Webinaire  Business&Decision - TrifactaWebinaire  Business&Decision - Trifacta
Webinaire Business&Decision - Trifacta
Victor Coustenoble
 
Castle enhanced Cassandra
Castle enhanced CassandraCastle enhanced Cassandra
Castle enhanced Cassandra
Eric Evans
 
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Eric Evans
 
DataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoTDataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoT
Victor Coustenoble
 
CQL In Cassandra 1.0 (and beyond)
CQL In Cassandra 1.0 (and beyond)CQL In Cassandra 1.0 (and beyond)
CQL In Cassandra 1.0 (and beyond)
Eric Evans
 
DataStax Enterprise BBL
DataStax Enterprise BBLDataStax Enterprise BBL
DataStax Enterprise BBL
Victor Coustenoble
 
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in Cassandra
Eric Evans
 
Cassandra by Example: Data Modelling with CQL3
Cassandra by Example:  Data Modelling with CQL3Cassandra by Example:  Data Modelling with CQL3
Cassandra by Example: Data Modelling with CQL3
Eric Evans
 
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in Cassandra
Eric Evans
 
CQL: SQL In Cassandra
CQL: SQL In CassandraCQL: SQL In Cassandra
CQL: SQL In Cassandra
Eric Evans
 
It's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRDIt's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRD
Eric Evans
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra
Eric Evans
 
Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)
Eric Evans
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
Victor Coustenoble
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra
Eric Evans
 
Time series storage in Cassandra
Time series storage in CassandraTime series storage in Cassandra
Time series storage in Cassandra
Eric Evans
 
Cassandra 2.2 & 3.0
Cassandra 2.2 & 3.0Cassandra 2.2 & 3.0
Cassandra 2.2 & 3.0
Victor Coustenoble
 

Viewers also liked (20)

Wikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-caseWikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-case
 
Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)
 
Webinar Degetel DataStax
Webinar Degetel DataStaxWebinar Degetel DataStax
Webinar Degetel DataStax
 
Webinaire Business&Decision - Trifacta
Webinaire  Business&Decision - TrifactaWebinaire  Business&Decision - Trifacta
Webinaire Business&Decision - Trifacta
 
Castle enhanced Cassandra
Castle enhanced CassandraCastle enhanced Cassandra
Castle enhanced Cassandra
 
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
 
DataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoTDataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoT
 
CQL In Cassandra 1.0 (and beyond)
CQL In Cassandra 1.0 (and beyond)CQL In Cassandra 1.0 (and beyond)
CQL In Cassandra 1.0 (and beyond)
 
DataStax Enterprise BBL
DataStax Enterprise BBLDataStax Enterprise BBL
DataStax Enterprise BBL
 
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in Cassandra
 
Cassandra by Example: Data Modelling with CQL3
Cassandra by Example:  Data Modelling with CQL3Cassandra by Example:  Data Modelling with CQL3
Cassandra by Example: Data Modelling with CQL3
 
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in Cassandra
 
CQL: SQL In Cassandra
CQL: SQL In CassandraCQL: SQL In Cassandra
CQL: SQL In Cassandra
 
It's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRDIt's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRD
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra
 
Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra
 
Time series storage in Cassandra
Time series storage in CassandraTime series storage in Cassandra
Time series storage in Cassandra
 
Cassandra 2.2 & 3.0
Cassandra 2.2 & 3.0Cassandra 2.2 & 3.0
Cassandra 2.2 & 3.0
 

Similar to Wikimedia Content API: A Cassandra Use-case

Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Seattle Apache Flink Meetup
 
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
Approximate queries and graph streams on Flink, theodore vasiloudis,  seattle...Approximate queries and graph streams on Flink, theodore vasiloudis,  seattle...
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
Bowen Li
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of data
Rostislav Pashuto
 
No SQL Technologies
No SQL TechnologiesNo SQL Technologies
No SQL Technologies
Cris Holdorph
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
Edward Capriolo
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB
 
Softshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with CouchbaseSoftshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with Couchbase
Tugdual Grall
 
Building maps for apps in the cloud - a Softlayer Use Case
Building maps for  apps in the cloud - a Softlayer Use CaseBuilding maps for  apps in the cloud - a Softlayer Use Case
Building maps for apps in the cloud - a Softlayer Use Case
Timan Rebel
 
Attacking the WebKit Heap
Attacking the WebKit HeapAttacking the WebKit Heap
Attacking the WebKit Heap
Michael Scovetta
 
Attacking the Webkit heap [Or how to write Safari exploits]
Attacking the Webkit heap [Or how to write Safari exploits]Attacking the Webkit heap [Or how to write Safari exploits]
Attacking the Webkit heap [Or how to write Safari exploits]
Seguridad Apple
 
Lessons learned from shifting real data around: An ad hoc data challenge from...
Lessons learned from shifting real data around: An ad hoc data challenge from...Lessons learned from shifting real data around: An ad hoc data challenge from...
Lessons learned from shifting real data around: An ad hoc data challenge from...
Jisc
 
OpenEBS hangout #4
OpenEBS hangout #4OpenEBS hangout #4
OpenEBS hangout #4
OpenEBS
 
Using Time Window Compaction Strategy For Time Series Workloads
Using Time Window Compaction Strategy For Time Series WorkloadsUsing Time Window Compaction Strategy For Time Series Workloads
Using Time Window Compaction Strategy For Time Series Workloads
Jeff Jirsa
 
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
Haim Yadid
 
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
Rob Skillington
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architecture
Markus Klems
 
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introduction
kanedafromparis
 
Galaxy Big Data with MariaDB
Galaxy Big Data with MariaDBGalaxy Big Data with MariaDB
Galaxy Big Data with MariaDB
MariaDB Corporation
 
Using Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider DataUsing Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider Data
Rob Gardner
 
Rapid analytic development on near real time data
Rapid analytic development on near real time dataRapid analytic development on near real time data
Rapid analytic development on near real time data
Austin Heyne
 

Similar to Wikimedia Content API: A Cassandra Use-case (20)

Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
 
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
Approximate queries and graph streams on Flink, theodore vasiloudis,  seattle...Approximate queries and graph streams on Flink, theodore vasiloudis,  seattle...
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of data
 
No SQL Technologies
No SQL TechnologiesNo SQL Technologies
No SQL Technologies
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Softshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with CouchbaseSoftshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with Couchbase
 
Building maps for apps in the cloud - a Softlayer Use Case
Building maps for  apps in the cloud - a Softlayer Use CaseBuilding maps for  apps in the cloud - a Softlayer Use Case
Building maps for apps in the cloud - a Softlayer Use Case
 
Attacking the WebKit Heap
Attacking the WebKit HeapAttacking the WebKit Heap
Attacking the WebKit Heap
 
Attacking the Webkit heap [Or how to write Safari exploits]
Attacking the Webkit heap [Or how to write Safari exploits]Attacking the Webkit heap [Or how to write Safari exploits]
Attacking the Webkit heap [Or how to write Safari exploits]
 
Lessons learned from shifting real data around: An ad hoc data challenge from...
Lessons learned from shifting real data around: An ad hoc data challenge from...Lessons learned from shifting real data around: An ad hoc data challenge from...
Lessons learned from shifting real data around: An ad hoc data challenge from...
 
OpenEBS hangout #4
OpenEBS hangout #4OpenEBS hangout #4
OpenEBS hangout #4
 
Using Time Window Compaction Strategy For Time Series Workloads
Using Time Window Compaction Strategy For Time Series WorkloadsUsing Time Window Compaction Strategy For Time Series Workloads
Using Time Window Compaction Strategy For Time Series Workloads
 
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
 
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architecture
 
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introduction
 
Galaxy Big Data with MariaDB
Galaxy Big Data with MariaDBGalaxy Big Data with MariaDB
Galaxy Big Data with MariaDB
 
Using Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider DataUsing Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider Data
 
Rapid analytic development on near real time data
Rapid analytic development on near real time dataRapid analytic development on near real time data
Rapid analytic development on near real time data
 

More from Eric Evans

Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3
Eric Evans
 
Cassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQLCassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQL
Eric Evans
 
NoSQL Yes, But YesCQL, No?
NoSQL Yes, But YesCQL, No?NoSQL Yes, But YesCQL, No?
NoSQL Yes, But YesCQL, No?
Eric Evans
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
Eric Evans
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
Eric Evans
 
Outside The Box With Apache Cassnadra
Outside The Box With Apache CassnadraOutside The Box With Apache Cassnadra
Outside The Box With Apache Cassnadra
Eric Evans
 
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed Database
Eric Evans
 
An Introduction To Cassandra
An Introduction To CassandraAn Introduction To Cassandra
An Introduction To Cassandra
Eric Evans
 
Cassandra In A Nutshell
Cassandra In A NutshellCassandra In A Nutshell
Cassandra In A Nutshell
Eric Evans
 

More from Eric Evans (9)

Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3
 
Cassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQLCassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQL
 
NoSQL Yes, But YesCQL, No?
NoSQL Yes, But YesCQL, No?NoSQL Yes, But YesCQL, No?
NoSQL Yes, But YesCQL, No?
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
 
Outside The Box With Apache Cassnadra
Outside The Box With Apache CassnadraOutside The Box With Apache Cassnadra
Outside The Box With Apache Cassnadra
 
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed Database
 
An Introduction To Cassandra
An Introduction To CassandraAn Introduction To Cassandra
An Introduction To Cassandra
 
Cassandra In A Nutshell
Cassandra In A NutshellCassandra In A Nutshell
Cassandra In A Nutshell
 

Recently uploaded

The Rise of Supernetwork Data Intensive Computing
The Rise of Supernetwork Data Intensive ComputingThe Rise of Supernetwork Data Intensive Computing
The Rise of Supernetwork Data Intensive Computing
Larry Smarr
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
Liveplex
 
Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
Mark Billinghurst
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
Stephanie Beckett
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
ArgaBisma
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
shanthidl1
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
Eric D. Schabell
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
BookNet Canada
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
Yevgen Sysoyev
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
Vijayananda Mohire
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
UiPathCommunity
 
Password Rotation in 2024 is still Relevant
Password Rotation in 2024 is still RelevantPassword Rotation in 2024 is still Relevant
Password Rotation in 2024 is still Relevant
Bert Blevins
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Erasmo Purificato
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 
What's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptxWhat's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptx
Stephanie Beckett
 
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly DetectionAdvanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Bert Blevins
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
RaminGhanbari2
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
jackson110191
 
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc
 
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Chris Swan
 

Recently uploaded (20)

The Rise of Supernetwork Data Intensive Computing
The Rise of Supernetwork Data Intensive ComputingThe Rise of Supernetwork Data Intensive Computing
The Rise of Supernetwork Data Intensive Computing
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
 
Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
 
Password Rotation in 2024 is still Relevant
Password Rotation in 2024 is still RelevantPassword Rotation in 2024 is still Relevant
Password Rotation in 2024 is still Relevant
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 
What's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptxWhat's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptx
 
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly DetectionAdvanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
 
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
 
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
 

Wikimedia Content API: A Cassandra Use-case

  • 1. Wikimedia Content API: A Cassandra Use-case Eric Evans <eevans@wikimedia.org> @jericevans Berlin Buzzwords | June 6, 2016
  • 3. Our Vision: A world in which every single human can freely share in the sum of all knowledge.
  • 4. About: ● Global movement ● Largest collection of free, collaborative knowledge in human history ● 16 projects ● 16.5 billion total page views per month ● 58.9 million unique devices per day ● More than 13k new editors each month ● More than 75k active editors month-to-month
  • 5. About: Wikipedia ● More than 38 million articles in 290 languages ● Over 10k new articles added per day ● 13 million edits per month ● Ranked #6 globally in web traffic
  • 11. Wikitext = Star Wars: The Force Awakens = Star Wars: The Force Awakens is a 2015 American epic space opera film directed, co-produced, and co-written by [[J. J. Abrams]].
  • 12. HTML <h1> Star Wars: The Force Awakens </h1> <p> Star Wars: The Force Awakens is a 2015 American epic space opera film directed, co-produced, and co-written by <a href="/wiki/J._J._Abrams" title="J. J. Abrams"> J. J. Abrams </a> </p>
  • 14. HTML
  • 20. Metadata [[Foo|{{echo|bar}}]] <a rel="mw:WikiLink" href="./Foo"> <span about="#mwt1" typeof="mw:Object/Template" data-parsoid="{...}" >bar</span> </a>
  • 21. Parsoid ● Node.js service ● Converts wikitext to HTML/RDFa ● Converts HTML/RDFa to wikitext ● Semantics, and syntax (avoid dirty diffs)! ● Expensive (slow) ● Resulting output is large
  • 22. RESTBase ● Services aggregator / proxy (REST) ● Durable cache (Cassandra) ● Wikimedia’s content API (e.g. https://en.wikipedia.org/api/rest_v1?doc)
  • 24. Other use-cases ● Mobile content service ● Math formula rendering service ● Dumps ● ...
  • 26. Environment ● 2 datacenters ● 3 racks per datacenter ● 18 hosts (16 core, 128G, SSDs) ● 54 nodes ● Deflate compression (~14-18%) ● 31T storage (~206T uncompressed) ● Cassandra 2.1.13 (moving to 2.2.6) ● Read-heavy workload (5:1)
  • 28. Data model CREATE TABLE data ( domain text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY ((domain, title), rev, tid) ) WITH CLUSTERING ORDER BY (rev DESC, tid DESC)
  • 29. Data model en.wikipedia.org + Star_Wars:_The_Force_Awakens 717862573 717873822 ...97466b12...7c7a913d3d8a1f2dd66c...7c7a913d3d8a ... 09877568...7c7a913d3d8a bdebc9a6...7c7a913d3d8a827e2ec2...7c7a913d3d8a
  • 33. Brotli compression ● Brought to you by the folks at Google; Successor to deflate ● Cassandra implementation (https://github.com/eevans/cassandra-brotli) ● Initial results very promising ● Better compression, lower cost (apples-apples) ● And, wider windows are possible (apples-oranges) ○ GC/memory permitting ○ Example: level=1, lgblock=4096, chunk_length_kb=4096, yields 1.73% compressed size! ○ https://phabricator.wikimedia.org/T122028 ● Stay tuned!
  • 35. Compaction ● The cost of having log-structured storage ● Asynchronously (post-write) optimize data on disk for reads ● At a minimum, reorganize into fewer files ○ Dropping what is obsolete ○ Expiring TTLs ○ Removing deleted (aka tombstoned) data (after a fashion) ● Reorganize data so results are nearer each other
  • 36. Compaction strategies ● Size-tiered ○ Combines tables of similar size ○ Oblivious to column distribution; Works best for workloads with no overwrites/deletes ○ Minimal IO ● Leveled ○ Small, fixed size files in levels of exponentially increasing size ○ Files have non-overlapping ranges within a level ○ Very efficient reads, but also quite IO intensive ● Date-tiered ○ For append only, total ordered data ○ Avoids mixing old data with new ○ Cold data eventually ceases to be compacted
  • 37. Compaction strategies ● Size-tiered ○ Combines tables of similar size ○ Oblivious to column distribution; Works best for workloads with no overwrites/deletes ○ Minimal IO ● Leveled ○ Small, fixed size files in levels of exponentially increasing size ○ Files have non-overlapping ranges within a level ○ Very efficient reads, but also quite IO intensive ● Date-tiered ○ For append only, total ordered data ○ Avoids mixing old data with new ○ Cold data eventually ceases to be compacted OMG, THIS!
  • 38. DTCS: Well...no, actually ● Hard to reason about ● Optimizations easily defeated ● See: https://phabricator.wikimedia.org/T126221
  • 39. DTCS: So now what? ● Size-tiered compaction? Might as well. ● TimeWindowCompactionStrategy (https://github.com/jeffjirsa/twcs)? Maybe... ● Reduce node density?
  • 41. G1GC ● Early adopters of G1 (aka “Garbage 1st”) ● Successor to Concurrent Mark-sweep (CMS) ● Incremental parallel compacting collector ● More predictable performance than CMS
  • 42. Humongous objects ● Anything >= ½ region size is classified as Humongous ● Humongous objects are allocated into Humongous Regions ● Only one object for a region (wastes space, creates fragmentation) ● Until 1.8u40, humongous regions collected only during full collections (Bad) ● Since 1.8u40, end of the marking cycle, during the cleanup phase (Better) ● Treated as exceptions, so should be exceptional ○ For us, that means 8MB regions ● Enable GC logging and have a look!
  • 44. “Many smaller-sized Cassandra nodes is always better than fewer, dense ones.” — Everyone
  • 46. What we do ● Processes (yup) ● Puppetized configuration ○ /etc/cassandra-a/ ○ /etc/cassandra-b/ ○ systemd units ○ Etc ● Shared RAID-0
  • 47. What we should have done ● Virtualization ● Containers ● Blades ● Not processes
  • 48. Cassandra: The Good ● Fault-tolerance ● Availability ● Datacenter / rack awareness ● Visibility ● Ubiquity ● Nice, helpful people (tickets, IRC, etc)
  • 49. Cassandra: The Bad ● Usability ○ Compaction ○ Streaming ○ JMX ○ etc ● Vertical scaling ● JVM
  • 50. Cassandra: The Ugly ● Upgrading ● Release process