SlideShare a Scribd company logo
Breakthrough OLAP
Performance with
Cassandra and Spark
Evan Chan
August 2015
Who am I?
Distinguished Engineer,
@evanfchan
User and contributor to Spark since 0.9, Cassandra since 0.6
Co-creator and maintainer of
TupleJump
http://velvia.github.io
Spark Job Server
About Tuplejump
is a big data technology leader providing solutions for
rapid insights from data.
Tuplejump
- the first Spark-Cassandra integration
- an open source Lucene indexer for Cassandra
- open source HDFS for Cassandra
Calliope
Stargate
SnackFS
Didn't I attend the same talk last year?
Similar title, but mostly new material
Will reveal new open source projects! :)

Recommended for you

What Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database ScalabilityWhat Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database Scalability

Replication. Partitioning. Relational databases. Bigtable. Dynamo. There is no one-size-fits-all approach to scaling your database, and the CAP theorem proved that there never will be. This talk will explain the advantages and limits of the approaches to scaling traditional relational databases, as well as the tradeoffs made by the designers of newer distributed systems like Cassandra. These slides are from Jonathan Ellis's OSCON 09 talk: http://en.oreilly.com/oscon2009/public/schedule/detail/7955

cachingdistributedreplication
Apache doris (incubating) introduction
Apache doris (incubating) introductionApache doris (incubating) introduction
Apache doris (incubating) introduction

Apache Doris (incubating) is an MPP-based interactive SQL data warehousing for reporting and analysis. It is open-sourced by Baidu. Doris mainly integrates the technology of Google Mesa and Apache Impala. Unlike other popular SQL-on-Hadoop systems, Doris is designed to be a simple and single tightly coupled system, not depending on other systems. Doris not only provides high concurrent low latency point query performance, but also provides high throughput queries of ad-hoc analysis. Doris not only provides batch data loading, but also provides near real-time mini-batch data loading. Doris also provides high availability, reliability, fault tolerance, and scalability. The simplicity (of developing, deploying and using) and meeting many data serving requirements in single system are the main features of Doris.

olapdorislide
YugabyteDB - Distributed SQL Database on Kubernetes
YugabyteDB - Distributed SQL Database on KubernetesYugabyteDB - Distributed SQL Database on Kubernetes
YugabyteDB - Distributed SQL Database on Kubernetes

ABSTRACT OF THE TALK Kubernetes has hit a home run for stateless workloads, but can it do the same for stateful services such as distributed databases? Before we can answer that question, we need to understand the challenges of running stateful workloads on, well anything. In this talk, we will first look at which stateful workloads, specifically databases, are ideal for running inside Kubernetes. Secondly, we will explore the various concerns around running databases in Kubernetes for production environments, such as: - The production-readiness of Kubernetes for stateful workloads in general - The pros and cons of the various deployment architectures - The failure characteristics of a distributed database inside containers In this session, we will demonstrate what Kubernetes brings to the table for stateful workloads and what database servers must provide to fit the Kubernetes model. This talk will also highlight some of the modern databases that take full advantage of Kubernetes and offer a peek into what’s possible if stateful services can meet Kubernetes halfway. We will go into the details of deployment choices, how the different cloud-vendor managed container offerings differ in what they offer, as well as compare performance and failure characteristics of a Kubernetes-based deployment with an equivalent VM-based deployment. BIO Amey is a VP of Data Engineering at Yugabyte with a deep passion for Data Analytics and Cloud-Native technologies. In his current role, he collaborates with Fortune 500 enterprises to architect their business applications with scalable microservices and geo-distributed, fault-tolerant data backend using YugabyteDB. Prior to joining Yugabyte, he spent 5 years at Pivotal as Platform Data Architect and has helped enterprise customers across multiple industry verticals to extend their analytical capabilities using Pivotal & OSS Big Data platforms. He is originally from Mumbai, India, and has a Master's degree in Computer Science from the University of Pennsylvania(UPenn), Philadelphia. Twitter: @ameybanarse LinkedIn: linkedin.com/in/ameybanarse/

 
by DoKC
sqlkubernetesyugabytedb
Problem Space
Need analytical database / queries on structured big data
Something SQL-like, very flexible and fast
Pre-aggregation too limiting
Fast data / constant updates
Ideally, want my queries to run over fresh data too
Example: Video analytics
Typical collection and analysis of consumer events
3 billion new events every day
Video publishers want updated stats, the sooner the better
Pre-aggregation only enables simple dashboard UIs
What if one wants to offer more advanced analysis, or a
generic data query API?
Eg, top countries filtered by device type, OS, browser
Requirements
Scalable - rules out PostGreSQL, etc.
Easy to update and ingest new data
Not traditional OLAP cubes - that's not what I'm talking
about
Very fast for analytical queries - OLAP not OLTP
Extremely flexible queries
Preferably open source
Parquet
Widely used, lots of support (Spark, Impala, etc.)
Problem: Parquet is read-optimized, not easy to use for writes
Cannot support idempotent writes
Optimized for writing very large chunks, not small updates
Not suitable for time series, IoT, etc.
Often needs multiple passes of jobs for compaction of small
files, deduplication, etc.
 
People really want a database-like abstraction, not a file format!

Recommended for you

Oracle_Multitenant_19c_-_All_About_Pluggable_D.pdf
Oracle_Multitenant_19c_-_All_About_Pluggable_D.pdfOracle_Multitenant_19c_-_All_About_Pluggable_D.pdf
Oracle_Multitenant_19c_-_All_About_Pluggable_D.pdf

This document discusses Oracle Multitenant 19c and pluggable databases. It begins with an introduction to the speaker and overview of pluggable databases. It then describes the traditional Oracle database architecture and the multitenant architecture in Oracle 19c. It discusses the different components of a container database including the root, seed PDB, and application containers. It also covers how to create pluggable databases from scratch, through cloning locally and remotely, relocating PDBs, and plugging in unplugged PDBs.

pdb
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra

This document discusses using Apache Cassandra for business intelligence, reporting and analytics. It covers: - Data modeling and querying Cassandra data using CQL - Accessing Cassandra data through drivers, ODBC/JDBC, and analytics frameworks like Spark and Hadoop - Doing reporting, dashboards, and analytics on Cassandra data using CQL, Solr, Spark, and BI tools - Capabilities of DataStax Enterprise for integrated search, batch analytics, and real-time analytics on Cassandra - Example architectures that isolate workloads and handle hot vs cold data

nosqlcassandrabusiness intelligence
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2

This document summarizes a series of performance issues seen by the author in their work with Oracle Exadata systems. It describes random session hangs occurring across several minutes, with long transaction locks and I/O waits seen. Analysis of AWR reports and blocking trees revealed that many sessions were blocked waiting on I/O, though initial I/O metrics from the OS did not show issues. Further analysis using ASH activity breakdowns and OS tools like sar and vmstat found high apparent CPU usage in ASH that was not reflected in actual low CPU load on the system. This discrepancy was due to the way ASH attributes non-waiting time to CPU. The root cause remained unclear.

troubleshootingperformanceoracle
Turns out this has been solved before!
Even .Facebook uses Vertica
MPP Databases
Easy writes plus fast queries, with constant transfers
Automatic query optimization by storing intermediate query
projections
Stonebraker, et. al. - paper (Brown Univ)CStore
What's wrong with MPP Databases?
Closed source
$$$
Usually don't scale horizontally that well (or cost is prohibitive)
Cassandra
Horizontally scalable
Very flexible data modelling (lists, sets, custom data types)
Easy to operate
Perfect for ingestion of real time / machine data
Best of breed storage technology, huge community
BUT: Simple queries only
OLTP-oriented

Recommended for you

MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive

Detailed technical material about MyRocks -- RocksDB storage engine for MySQL -- https://github.com/facebook/mysql-5.6

myrocksmysqlrocksdb
Troubleshooting tips and tricks for Oracle Database Oct 2020
Troubleshooting tips and tricks for Oracle Database Oct 2020Troubleshooting tips and tricks for Oracle Database Oct 2020
Troubleshooting tips and tricks for Oracle Database Oct 2020

This talk presents 15 different tips and tricks using tools to better troubleshoot and debug problems with Database , Oracle RAC and Oracle Clusterware , ASM and how to get the right pieces of data with the least of commands which today most people do manually. This session will cover tools from the Oracle Autonomous Health Framework (AHF) like Trace file Analyzer (TFA) to collect , organize and analyze log data , Exachk and orachk to perform mass best practices analysis and automation , Cluster Health Advisor to debug node evictions and calibrate the framework , OSWatcher and its analysis engine , oratop for pinpointing performance issues and many others to make one feel like a rockstar DBA.

oracle tfa exachk troubleshoot diagnose debugoracle rac troubleshooting diagnosing 12.2 oracleoracle 19c debugging and troubleshooting
Query Compilation in Impala
Query Compilation in ImpalaQuery Compilation in Impala
Query Compilation in Impala

Query compilation in Impala involves parsing the SQL, semantic analysis to validate the query, planning to generate an executable query plan, and finally executing the query. The query planner considers different join orders and strategies like broadcast joins and partitioned joins to minimize data transfer during query execution based on table and column statistics. The explain output provides details on how the query will be executed in a distributed fashion across nodes.

sqlimpalahadoop
Apache Spark
Horizontally scalable, in-memory queries
Functional Scala transforms - map, filter, groupBy, sort
etc.
SQL, machine learning, streaming, graph, R, many more plugins
all on ONE platform - feed your SQL results to a logistic
regression, easy!
Huge number of connectors with every single storage
technology
Spark provides the missing fast, deep
analytics piece of Cassandra!
Spark and Cassandra
OLAP Architectures
Separate Storage and Query Layers
Combine best of breed storage and query platforms
Take full advantage of evolution of each
Storage handles replication for availability
Query can replicate data for scaling read concurrency -
independent!

Recommended for you

Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problems

PostgreSQL is a very popular and feature-rich DBMS. At the same time, PostgreSQL has a set of annoying wicked problems, which haven't been resolved in decades. Miraculously, with just a small patch to PostgreSQL core extending this API, it appears possible to solve wicked PostgreSQL problems in a new engine made within an extension.

postgresqlorioledbdatabase
Let’s get to know Snowflake
Let’s get to know SnowflakeLet’s get to know Snowflake
Let’s get to know Snowflake

As cloud computing continues to gather speed, organizations with years’ worth of data stored on legacy on-premise technologies are facing issues with scale, speed, and complexity. Your customers and business partners are likely eager to get data from you, especially if you can make the process easy and secure. Challenges with performance are not uncommon and ongoing interventions are required just to “keep the lights on”. Discover how Snowflake empowers you to meet your analytics needs by unlocking the potential of your data. Agenda of Webinar : ~Understand Snowflake and its Architecture ~Quickly load data into Snowflake ~Leverage the latest in Snowflake’s unlimited performance and scale to make the data ready for analytics ~Deliver secure and governed access to all data – no more silos

snowflake and its architecturesnowflakecloud computing
Oracle data guard for beginners
Oracle data guard for beginnersOracle data guard for beginners
Oracle data guard for beginners

The document discusses Oracle Data Guard, a disaster recovery solution for Oracle databases. It provides: 1) An overview of Data Guard, explaining that it maintains a physical or logical standby copy of the primary database to enable failover in the event of outages or disasters. 2) Details on the different types of standby databases - physical, logical, and snapshot - and how they are maintained through redo application or SQL application. 3) The various Data Guard configuration options like real-time apply, time delay, and role transitions such as switchover and failover.

oracle
Spark as Cassandra's Cache
Spark SQL
Appeared with Spark 1.0
In-memory columnar store
Parquet, Json, Cassandra connector, Avro, many more
SQL as well as DataFrames (Pandas-style) API
Indexing integrated into data sources (eg C* secondary
indexes)
Write custom functions in Scala .... take that Hive UDFs!!
Integrates well with MLBase, Scala/Java/Python
Connecting Spark to Cassandra
Datastax's
Tuplejump
Spark Cassandra Connector
Calliope
 
Get started in one line with spark-shell!
bin/spark-shell
--packagescom.datastax.spark:spark-cassandra-connector_2.10:1.4.0-M3
--confspark.cassandra.connection.host=127.0.0.1
Caching a SQL Table from Cassandra
DataFrames support in Cassandra Connector 1.4.0 (and 1.3.0):
valsqlContext=neworg.apache.spark.sql.SQLContext(sc)
valdf=sqlContext.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table"->"gdelt","keyspace"->"test"))
.load()
df.registerTempTable("gdelt")
sqlContext.cacheTable("gdelt")
sqlContext.sql("SELECTcount(monthyear)FROMgdelt").show()
 
Spark does no caching by default - you will always be reading
from C*!

Recommended for you

Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase

Vladimir Rodionov (Hortonworks) Time-series applications (sensor data, application/system logging events, user interactions etc) present a new set of data storage challenges: very high velocity and very high volume of data. This talk will present the recent development in Apache HBase that make it a good fit for time-series applications.

opensourceopen sourcehortonworks
Exadata Backup
Exadata BackupExadata Backup
Exadata Backup

This document discusses backup and recovery strategies for Oracle Exadata systems. It outlines the fundamental principles of backups including having multiple copies of data stored on different media with one copy offsite. It then describes the various backup options for Exadata, including using additional Exadata storage cells for the fastest backups, using a ZFS storage appliance for flexibility, or backing up to tape for economical long-term storage with removable offline copies. Key metrics like backup and restore speeds are provided for each option.

oracelbackupexadata
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation

This document discusses using ClickHouse for experimentation and metrics at Spotify. It describes how Spotify built an experimentation platform using ClickHouse to provide teams interactive queries on granular metrics data with low latency. Key aspects include ingesting data from Google Cloud Storage to ClickHouse daily, defining metrics through a centralized catalog, and visualizing metrics and running queries using Superset connected to ClickHouse. The platform aims to reduce load on notebooks and BigQuery by serving common queries directly from ClickHouse.

experimentationclickhouseolap
How Spark SQL's Table Caching Works
Spark Cached Tables can be Really Fast
GDELT dataset, 4 million rows, 60 columns, localhost
Method secs
Uncached 317
Cached 0.38
 
Almost a 1000x speedup!
On an 8-node EC2 c3.XL cluster, 117 million rows, can run
common queries 1-2 seconds against cached dataset.
Tuning Connector Partitioning
spark.cassandra.input.split.size
Guideline: One split per partition, one partition per CPU core
Much more parallelism won't speed up job much, but will
starve other C* requests
Lesson #1: Take Advantage of Spark
Caching!

Recommended for you

Adapting and adopting spm v04
Adapting and adopting spm v04Adapting and adopting spm v04
Adapting and adopting spm v04

Adapting and adopting SQL Plan Management (SPM) to achieve execution plan stability for sub-second queries on a high-rate OLTP mission-critical application

oracle performanceoracle databaseplan stability
Survey of some free Tools to enhance your SQL Tuning and Performance Diagnost...
Survey of some free Tools to enhance your SQL Tuning and Performance Diagnost...Survey of some free Tools to enhance your SQL Tuning and Performance Diagnost...
Survey of some free Tools to enhance your SQL Tuning and Performance Diagnost...

Several free tools are available to help with SQL tuning and performance diagnostics, including SQLd360, SQLT, and eDB360. SQLd360 and SQLT are good for diagnosing a single SQL statement, while eDB360 provides a 360-degree view of an entire Oracle database. Snapper and TUNAs360 can diagnose sessions and database activity. Standalone scripts like planx and sqlmon provide specialized diagnostics for individual cases. These free tools vary in size and capabilities, but all aim to help tune and diagnose SQL and database performance issues.

oracle performanceoracleedb360
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Apache Cassandra is rock-solid and widely deployed for OLTP and real-time applications, but it is typically not thought of as an OLAP database for analytical queries. This talk will show architectures and techniques for combining Apache Cassandra and Spark to yield a 10-1000x improvement in OLAP analytical performance. We will then introduce a new open-source project that combines the above performance improvements with the ease of use of Apache Cassandra, and compare it to implementations based on Hadoop and Parquet. First, the existing Cassandra Spark connector allows one to easily load data from Cassandra to Spark. We'll cover how to accelerate queries through different caching options in Spark, and the tradeoffs and limitations around performance, memory, and updating data in real time. We then dive into the use of columnar storage layout and efficient coding techniques that dramatically speed up I/O for OLAP use cases. Cassandra features like triggers and custom secondary indexes allow for easy data ingestion into columnar format. Next, we explore how to integrate this new storage with Spark SQL and its pluggable data storage API. Future developments will enable extreme analytical database performance, including smart caching of column projections, a columnar version of Spark's Catalyst execution planner, and how vectorization makes for fast cache- and GPU-friendly calculations - see Spark's Project Tungsten. FiloDB is a new open-source database using the above techniques to combine very fast Spark SQL analytical queries with the ease of use of Cassandra. We will briefly cover interesting use cases, such as: * Easy exactly-once ingestion from Kafka for streaming and IoT applications * Incremental computed columns and geospatial annotations. We'll discuss how FiloDB improves aggregations needed for choropleth maps over standard PostGIS solutions.

datastaxdatastax communityscalable
Problems with Cached Tables
Still have to read the data from Cassandra first, which is slow
Amount of RAM: your entire data + extra for conversion to
cached table
Cached tables only live in Spark executors - by default
tied to single context - not HA
once any executor dies, must re-read data from C*
Caching takes time: convert from RDD[Row] to compressed
columnar format
Cannot easily combine new RDD[Row] with cached tables
(and keep speed)
Problems with Cached Tables
If you don't have enough RAM, Spark can cache your tables
partly to disk. This is still way, way, faster than scanning an entire
C* table. However, cached tables are still tied to a single Spark
context/application.
Also: rdd.cache()is NOT the same as SQLContext's
cacheTable!
What about C* Secondary Indexing?
Spark-Cassandra Connector and Calliope can both reduce I/O by
using Cassandra secondary indices. Does this work with caching?
No, not really, because only the filtered rows would be cached.
Subsequent queries against this limited cached table would not
give you expected results.
Tachyon Off-Heap Caching

Recommended for you

FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark

You want to ingest event, time-series, streaming data easily, yet have flexible, fast ad-hoc queries. Is this even possible? Yes! Find out how in this talk of combining Apache Cassandra and Apache Spark, using a new open-source database, FiloDB.

big datastreamingspark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark

The document discusses using Apache Spark and Cassandra for online analytical processing (OLAP) of big data. It describes challenges with relational databases and OLAP cubes at large scales and how Spark can provide fast, distributed querying of data stored in Cassandra. The key points made are that Spark and Cassandra combine to provide horizontally scalable storage with Cassandra and fast, in-memory analytics with Spark; and that for optimal performance, data should be cached in Spark SQL tables for column-oriented querying and aggregation.

nosql cassandraplanet cassandrabig data
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...

O'Reilly Webcast with Myself and Evan Chan on the new SNACK Stack (playoff of SMACK) with FIloDB: Scala, Spark Streaming, Akka, Cassandra, FiloDB and Kafka.

analyticsfilodbscala
Intro to Tachyon
Tachyon: an in-memory cache for HDFS and other binary data
sources
Keeps data off-heap, so multiple Spark applications/executors
can share data
Solves HA problem for data
Wait, wait, wait!
What am I caching exactly? Tachyon is designed for caching files
or binary blobs.
A serialized form of CassandraRow/CassandraRDD?
Raw output from Cassandra driver?
What you really want is this:
Cassandra SSTable -> Tachyon (as row cache) -> CQL -> Spark
Bad programmers worry about the code. Good
programmers worry about data structures.
- Linus Torvalds
 
Are we really thinking holistically about data modelling, caching,
and how it affects the entire systems architecture?
Efficient Columnar Storage in Cassandra
Wait, I thought Cassandra was columnar?

Recommended for you

Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir

کارگاه کلان داده در سومین کنفرانس ملی محاسبات توزیعی و پردازش داده های بزرگ دانشگاه شهید مدنی آذربایجان تبریز اردیبهشت 96

اسپارکhadoopبیگ دیتا
Cassndra (4).pptx
Cassndra (4).pptxCassndra (4).pptx
Cassndra (4).pptx

This document provides an overview of Cassandra, a NoSQL database. It discusses that Cassandra is an open source, distributed database designed to handle large amounts of structured data across nodes. The document outlines Cassandra's architecture, which involves distributing data across peer nodes so that there is no single point of failure. It also discusses Cassandra's data model, including keyspaces, column families, and the use of the Cassandra Query Language to define schemas, insert and query data. In closing, the document notes that Cassandra is well-suited for applications that require scaling to handle large, variable workloads across data centers with high performance and availability.

nosql
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion DubaiSMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai

A talk covering the best-of-breed platform consisting of Spark, Mesos, Akka, Cassandra and Kafka. SMACK is more of a toolbox of technologies to allow the building of resilient ingestion pipelines, offering a high degree of freedom in the selection of analysis and query possibilities and baked in support for flow-control. More and more customers are using this stack, which is rapidly becoming the new industry standard for Big Data solutions. Session can be seen here - in German - https://speakerdeck.com/stefan79/fast-data-smack-down

analyticsbig dataplatform
How Cassandra stores your CQL Tables
Suppose you had this CQL table:
CREATETABLE(
departmenttext,
empIdtext,
firsttext,
lasttext,
ageint,
PRIMARYKEY(department,empId)
);
How Cassandra stores your CQL Tables
PartitionKey 01:first 01:last 01:age 02:first 02:last 02:age
Sales Bob Jones 34 Susan O'Connor 40
Engineering Dilbert P ? Dogbert Dog 1
 
Each row is stored contiguously. All columns in row 2 come after
row 1.
To analyze only age, C* still has to read every field.
Cassandra is really a row-based, OLTP-oriented datastore.
Unless you know how to use it otherwise :)
The traditional row-based data storage
approach is dead
- Michael Stonebraker

Recommended for you

Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming

My presentation at recently concluded Apache Big Data Conference Europe about the Reliable Low Level Kafka Spark Consumer I developed and an use case of real time indexing to Apache Blur using this consumer

sparkspark streamingkafka
Using Cassandra with your Web Application
Using Cassandra with your Web ApplicationUsing Cassandra with your Web Application
Using Cassandra with your Web Application

This document provides an overview of using Cassandra in web applications. It discusses why developers may consider using a NoSQL solution like Cassandra over traditional SQL databases. It then covers topics like Cassandra's architecture, data modeling, configuration options, APIs, development tools, and examples of companies using Cassandra in production systems. Key points emphasized are that Cassandra offers high performance but requires rewriting code and developing new processes and tools to support its flexible schema and data model.

ipc10 cassandra
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users

At Databricks, we have a unique view into over a hundred different companies trying out Spark for development and production use-cases, from their support tickets and forum posts. Having seen so many different workflows and applications, some discernible patterns emerge when looking at common performance and scalability issues that our users run into. This talk will discuss some of these common common issues from an engineering and operations perspective, describing solutions and clarifying misconceptions.

databricksapache sparkspark summit
Columnar Storage (Memory)
Name column
0 1
0 1
 
Dictionary: {0: "Barak", 1: "Hillary"}
 
Age column
0 1
46 66
Columnar Storage (Cassandra)
Review: each physical row in Cassandra (e.g. a "partition key")
stores its columns together on disk.
 
Schema CF
Rowkey Type
Name StringDict
Age Int
 
Data CF
Rowkey 0 1
Name 0 1
Age 46 66
Columnar Format solves I/O
Compression
Dictionary compression - HUGE savings for low-cardinality
string columns
RLE, other techniques
Reduce I/O
Only columns needed for query are loaded from disk
Batch multiple rows in one cell for efficiency (avoid cluster key
overhead)
Columnar Format solves Caching
Use the same format on disk, in cache, in memory scan
Caching works a lot better when the cached object is the
same!!
No data format dissonance means bringing in new bits of data
and combining with existing cached data is seamless

Recommended for you

Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish

Cassandra at Pollfish. Athens Cassandra Users MeetUp (Datastax)- 11 March 2015. http://www.meetup.com/Athens-Cassandra-Users/events/220922960/

Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish

Pollfish is a survey platform which provides access to millions of targeted users. Pollfish allows easy distribution and targeting of surveys through existing mobile apps. (https://www.pollfish.com/). At pollfish we use Cassandra for difference use cases, eg. for application data store to maximize write throughput when appropriate and for our analytics project to find insights in application generated data. As a medium to accomplish our success so far, we use the Datastax's DSE 4.6 environment which integrates Appache Cassadra, Spark and a hadoop compatible file system (CFS). We will discuss how we started, how the journey was and the impressions gained so far along with some tips learned the hard way. This is a result of joint work of an excellent team here at Pollfish.

cassandrapollfish
Scaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosqlScaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosql

This presentation explains how to get started with Apache Cassandra to provide a scale out, fault tolerant backend for inventory storage on OpenSimulator.

apache cassandraopensimulatoroscc14
So, why isn't everybody doing this?
No columnar storage format designed to work with NoSQL
stores
Efficient conversion to/from columnar format a hard problem
Most infrastructure is still row oriented
Spark SQL/DataFrames based on RDD[Row]
Spark Catalyst is a row-oriented query parser
All hard work leads to profit, but mere talk leads
to poverty.
- Proverbs 14:23
Breakthrough OLAP performance with Cassandra and Spark
Columnar Storage Performance Study
 
http://github.com/velvia/cassandra-gdelt

Recommended for you

Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...

How would you build a database to support sustained ingestion of several hundreds of thousands rows per second while running near real-time queries on top? In this session I will go over some of the technical decisions and trade-offs we applied when building QuestDB, an open source time-series database developed mainly in JAVA, and how we can achieve over four million row writes per second on a single instance without blocking or slowing down the reads. There will be code and demos, of course. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.

time-seriesquestdbdatabase
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics

This document provides an overview of SK Telecom's use of big data analytics and Spark. Some key points: - SKT collects around 250 TB of data per day which is stored and analyzed using a Hadoop cluster of over 1400 nodes. - Spark is used for both batch and real-time processing due to its performance benefits over other frameworks. Two main use cases are described: real-time network analytics and a network enterprise data warehouse (DW) built on Spark SQL. - The network DW consolidates data from over 130 legacy databases to enable thorough analysis of the entire network. Spark SQL, dynamic resource allocation in YARN, and BI integration help meet requirements for timely processing and quick responses.

sparksparkstreamingsparksql
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong

This document provides an overview of SK Telecom's use of big data analytics and Spark. Some key points: - SKT collects around 250 TB of data per day which is stored and analyzed using a Hadoop cluster of over 1400 nodes. - Spark is used for both batch and real-time processing due to its performance benefits over other frameworks. Two main use cases are described: real-time network analytics and a network enterprise data warehouse (DW) built on Spark SQL. - The network DW consolidates data from over 130 legacy databases to enable thorough analysis of the entire network. Spark SQL, dynamic resource allocation in YARN, and integration with BI tools help meet requirements for timely processing and quick

spark summit euapache spark
GDELT Dataset
1979 to now
60 columns, 250 million+ rows, 250GB+
Let's compare Cassandra I/O only, no caching or Spark
Global Database of Events, Language, and Tone
The scenarios
1. Narrow table - CQL table with one row per partition key
2. Wide table - wide rows with 10,000 logical rows per partition
key
3. Columnar layout - 1000 rows per columnar chunk, wide rows,
with dictionary compression
First 4 million rows, localhost, SSD, C* 2.0.9, LZ4 compression.
Compaction performed before read benchmarks.
Query and ingest times
Scenario Ingest Read all
columns
Read one
column
Narrow
table
1927
sec
505 sec 504 sec
Wide
table
3897
sec
365 sec 351 sec
Columnar 93 sec 8.6 sec 0.23 sec
 
On reads, using a columnar format is up to 2190x faster, while
ingestion is 20-40x faster.
Of course, real life perf gains will depend heavily on query,
table width, etc. etc.
Disk space usage
Scenario Disk used
Narrow table 2.7 GB
Wide table 1.6 GB
Columnar 0.34 GB
The disk space usage helps explain some of the numbers.

Recommended for you

Nosql seminar
Nosql seminarNosql seminar
Nosql seminar

The document provides an introduction to NOSQL databases. It begins with basic concepts of databases and DBMS. It then discusses SQL and relational databases. The main part of the document defines NOSQL and explains why NOSQL databases were developed as an alternative to relational databases for handling large datasets. It provides examples of popular NOSQL databases like MongoDB, Cassandra, HBase, and CouchDB and describes their key features and use cases.

Cassandra synergy
Cassandra synergyCassandra synergy
Cassandra synergy

Presented to the Dublin Cassandra User Group by Niall Milton of DigBigData. This presentation is on Cassandra and its use with other technologies such as Storm, Spark, Hadoop, ElasticSearch and Redis. This presentation should act as a solid foundation to explore some of the mentioned technologies in more depth.

sparksharkredis
cassandra
cassandracassandra
cassandra

Apache Cassandra is a highly scalable, distributed database designed to handle large amounts of data across many servers with no single point of failure. It uses a peer-to-peer distributed system where data is replicated across multiple nodes for availability even if some nodes fail. Cassandra uses a column-oriented data model with dynamic schemas and supports fast writes and linear scalability.

big datadatabasenosql
Towards Extreme Query Performance
The filo project
is a binary data vector library
designed for extreme read performance with minimal
deserialization costs.
http://github.com/velvia/filo
Designed for NoSQL, not a file format
random or linear access
on or off heap
missing value support
Scala only, but cross-platform support possible
What is the ceiling?
This Scala loop can read integers from a binary Filo blob at a rate
of 2 billion integers per second - single threaded:
defsumAllInts():Int={
vartotal=0
for{i<-0untilnumValuesoptimized}{
total+=sc(i)
}
total
}
Vectorization of Spark Queries
The project.Tungsten
Process many elements from the same column at once, keep data
in L1/L2 cache.
Coming in Spark 1.4 through 1.6

Recommended for you

Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins

Quick introduction to the moving parts inside Cassandra and essential commands and tasks for System Administrators.

distributednosqldatabase
Time-State Analytics: MinneAnalytics 2024 Talk
Time-State Analytics: MinneAnalytics 2024 TalkTime-State Analytics: MinneAnalytics 2024 Talk
Time-State Analytics: MinneAnalytics 2024 Talk

Slides from my talk at MinneAnalytics 2024 - June 7, 2024 https://datatech2024.sched.com/event/1eO0m/time-state-analytics-a-new-paradigm Across many domains, we see a growing need for complex analytics to track precise metrics at Internet scale to detect issues, identify mitigations, and analyze patterns. Think about delays in airlines (Logistics), food delivery tracking (Apps), detect fraudulent transactions (Fintech), flagging computers for intrusion (Cybersecurity), device health (IoT), and many more. For instance, at Conviva, our customers want to analyze the buffering that users on some types of devices suffer, when using a specific CDN. We refer to such problems as Multidimensional Time-State Analytics. Time-State here refers to the stateful context-sensitive analysis over event streams needed to capture metrics of interest, in contrast to simple aggregations. Multidimensional refers to the need to run ad hoc queries to drill down into subpopulations of interest. Furthermore, we need both real-time streaming and offline retrospective analysis capabilities. In this talk, we will share our experiences to explain why state-of-art systems offer poor abstractions to tackle such workloads and why they suffer from poor cost-performance tradeoffs and significant complexity. We will also describe Conviva’s architectural and algorithmic efforts to tackle these challenges. We present early evidence on how raising the level of abstraction can reduce developer effort, bugs, and cloud costs by (up to) an order of magnitude, and offer a unified framework to support both streaming and retrospective analysis. We will also discuss how our ideas can be plugged into existing pipelines and how our new ``visual'' abstraction can democratize analytics across many domains and to non-programmers.

analyticsbig datastreaming
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust

How we at Conviva ported a streaming data pipeline in months from Scala to Rust. What are the important human and technical factors in our port, and what did we learn?

big datadatasoftware development
Hot Column Caching in Tachyon
Has a "table" feature, originally designed for Shark
Keep hot columnar chunks in shared off-heap memory for fast
access
Introducing FiloDB
 
http://github.com/velvia/FiloDB
What's in the name?
Rich sweet layers of distributed, versioned database goodness
Distributed
Apache Cassandra. Scale out with no SPOF. Cross-datacenter
replication. Proven storage and database technology.

Recommended for you

Designing Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesDesigning Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and Kubernetes

Almost all applications have some kind of state. Some data processing apps and databases have huge amounts of state. How do we navigate a cloud-based world of containers where stateless and functions-as-a-service is all the rage? As a long-time architect, designer, and developer of very stateful apps (databases and data processing apps), I’d like to take you on a journey through the modern cloud world and Kubernetes, offering helpful design patterns, considerations, tips, and where things are going. How is Kubernetes shaking up stateful app design?

kubernetescloudstateless
Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019

Slides for my talk at Monitorama PDX 2019. Histograms have the potential to give us tools to meet SLO/SLAs, quantile measurements, and very rich heatmap displays for debugging. Their promise has not been fulfilled by TSDB backends however. This talk talks about the concept of histograms as first class citizens in storage. What does accuracy mean for histograms? How can we store and compress rich histograms for evaluation and querying at massive scale? How can we fix some of the issues with histograms in Prometheus, such as proper aggregation, bucketing, avoiding clipping, etc.?

histogramsstatisticsprometheus
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale

My keynote presentation about how we developed FiloDB, a distributed, Prometheus-compatible time series database, productionized it at Apple and scaled it out to handle a huge amount of operational data, based on the stack of Kafka, Cassandra, Scala/Akka.

big datakafkastreaming
Versioned
Incrementally add a column or a few rows as a new version. Easily
control what versions to query. Roll back changes inexpensively.
Stream out new versions as continuous queries :)
Columnar
Parquet-style storage layout
Retrieve select columns and minimize I/O for OLAP queries
Add a new column without having to copy the whole table
Vectorization and lazy/zero serialization for extreme
efficiency
100% Reactive
Built completely on the Typesafe Platform:
Scala 2.10 and SBT
Spark (including custom data source)
Akka Actors for rational scale-out concurrency
Futures for I/O
Phantom Cassandra client for reactive, type-safe C* I/O
Typesafe Config
Spark SQL Queries!
SELECTfirst,last,ageFROMcustomers
WHERE_version>3ANDage<40LIMIT100
Read to and write from Spark Dataframes
Append/merge to FiloDB table from Spark Streaming

Recommended for you

Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark

Here is my talk at Scala by the Bay 2016, Building a High-Performance Database with Scala, Akka, and Spark. Covers integration of Akka and Spark, when to use actors and futures, back pressure, reactive monitoring with Kamon, and more.

akkabig datascala
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service

700 Updatable Queries Per Second: Spark as a Real-Time Web Service. Find out how to use Apache Spark with FiloDb for low-latency queries - something you never thought possible with Spark. Scale it down, not just scale it up!

real-timesparkanalytics
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle

An overview of building scalable data pipelines and specific technologies including Apache #Kafka, #Spark, Spark Streaming, #Cassandra, and #FiloDB.

cassandrasparkkafka
FiloDB vs Parquet
Comparable read performance - with lots of space to improve
Assuming co-located Spark and Cassandra
On localhost, both subsecond for simple queries (GDELT
1979-1984)
FiloDB has more room to grow - due to hot column caching
and much less deserialization overhead
Lower memory requirement due to much smaller block sizes
Much better fit for IoT / Machine / Time-series applications
Limited support for types
array / set / map support not there, but will be added later
Where FiloDB Fits In
Use regular C* denormalized tables for OLTP and single-key
lookups
Use FiloDB for the remaining ad-hoc or more complex
analytical queries
Simplify your analytics infrastructure!
No need to export to Hadoop/Parquet/data warehouse.
Use Spark and C* for both OLAP and OLTP!
Perform ad-hoc OLAP analysis of your time-series, IoT data
Simplify your Lambda Architecture...
( )https://www.mapr.com/developercentral/lambda-architecture
With Spark, Cassandra, and FiloDB
Ma, where did all the components go?
You mean I don't have to deal with Hadoop?
Use Cassandra as a front end to store IoT data first

Recommended for you

Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server

You won't find this in many places - an overview of deploying, configuring, and running Apache Spark, including Mesos vs YARN vs Standalone clustering modes, useful config tuning parameters, and other tips from years of using Spark in production. Also, learn about the Spark Job Server and how it can help your organization deploy Spark as a RESTful service, track Spark jobs, and enable fast queries (including SQL!) of cached RDDs.

productionscaladevops
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015

Everyone in the Scala world is using or looking into using Akka for low-latency, scalable, distributed or concurrent systems. I'd like to share my story of developing and productionizing multiple Akka apps, including low-latency ingestion and real-time processing systems, and Spark-based applications. When does one use actors vs futures? Can we use Akka with, or in place of, Storm? How did we set up instrumentation and monitoring in production? How does one use VisualVM to debug Akka apps in production? What happens if the mailbox gets full? What is our Akka stack like? I will share best practices for building Akka and Scala apps, pitfalls and things we'd like to avoid, and a vision of where we would like to go for ideal Akka monitoring, instrumentation, and debugging facilities. Plus backpressure and at-least-once processing.

akkamonitoringscala
MIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data ArchitectureMIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data Architecture

Socrata is a software company that provides an open data platform to enable governments to publish and share data with the public and developers in order to spur innovation; their platform allows users to find, explore, and analyze datasets through tools for visualization, analysis, and application building. The document discusses Socrata's architecture and technologies that power their open data platform and allow it to handle large volumes of data and queries in a scalable way.

distributed systemsarchitecture#opendata
Exactly-Once Ingestion from Kafka
New rows appended via Kafka
Writes are idempotent - no need to dedup!
Converted to columnar chunks on ingest and stored in C*
Only necessary columnar chunks are read into Spark for
minimal I/O
You can help!
Send me your use cases for OLAP on Cassandra and Spark
Especially IoT and Geospatial
Email if you want to contribute
Thanks...
to the entire OSS community, but in particular:
Lee Mighdoll, Nest/Google
Rohit Rai and Satya B., Tuplejump
My colleagues at Socrata
 
If you want to go fast, go alone. If you want to go
far, go together.
-- African proverb
DEMO TIME
GDELT: Regular C* Tables vs FiloDB

Recommended for you

OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and Spark

How do you rapidly derive complex insights on top of really big data sets in Cassandra? This session draws upon Evan's experience building a distributed, interactive, columnar query engine on top of Cassandra and Spark. We will start by surveying the existing query landscape of Cassandra and discuss ways to integrate Cassandra and Spark. We will dive into the design and architecture of a fast, column-oriented query architecture for Spark, and why columnar stores are so advantageous for OLAP workloads. I will present a schema for Parquet-like storage of analytical datasets onCassandra. Find out why Cassandra and Spark are the perfect match for enabling fast, scalable, complex querying and storage of big analytical data.

queriesscalaspark
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk

This document discusses Spark Job Server, an open source project that allows Spark jobs to be submitted and run via a REST API. It provides features like job monitoring, context sharing between jobs to reuse cached data, and asynchronous APIs. The document outlines motivations for the project, how to use it including submitting and monitoring jobs, and future plans like high availability and hot failover support.

sparkrest apideployment
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

This was a talk that Kelvin Chu and I just gave at the SF Bay Area Spark Meetup 5/14 at Palantir Technologies. We discussed the Spark Job Server (http://github.com/ooyala/spark-jobserver), its history, example workflows, architecture, and exciting future plans to provide HA spark job contexts. We also discussed the use case of the job server at Ooyala to facilitate fast query jobs using shared RDD and a shared job context, and how we integrate with Apache Cassandra.

scalasparkinfrastructure
Extra Slides
When in doubt, use brute force
- Ken Thompson
Automatic Columnar Conversion using
Custom Indexes
Write to Cassandra as you normally do
Custom indexer takes changes, merges and compacts into
columnar chunks behind scenes
Implementing Lambda is Hard
Use real-time pipeline backed by a KV store for new updates
Lots of moving parts
Key-value store, real time sys, batch, etc.
Need to run similar code in two places
Still need to deal with ingesting data to Parquet/HDFS
Need to reconcile queries against two different places

Recommended for you

Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

This document discusses using Spark and Cassandra together for interactive analytics. It describes how Evan Chan uses both technologies at Ooyala to solve the problem of generating analytics from raw data in Cassandra in a flexible and fast way. It outlines their architecture of using Spark to generate materialized views from Cassandra data and then powering queries with those cached views for low latency queries.

distributed systemssparkooyala
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and Shark

"Real-time Analytics with Cassandra, Spark, and Shark", my talk at the Cassandra Summit 2013 conference in San Francisco

#cassandra13cassandraspark
Conservation of Taksar through Economic Regeneration
Conservation of Taksar through Economic RegenerationConservation of Taksar through Economic Regeneration
Conservation of Taksar through Economic Regeneration

This was our 9th Sem Design Studio Project, introduced as Conservation of Taksar Bazar, Bhojpur, an ancient city famous for Taksar- Making Coins. Taksar Bazaar has a civilization of Newars shifted from Patan, with huge socio-economic and cultural significance having a settlement of about 300 years. But in the present scenario, Taksar Bazar has lost its charm and importance, due to various reasons like, migration, unemployment, shift of economic activities to Bhojpur and many more. The scenario was so pityful that when we went to make inventories, take survey and study the site, the people and the context, we barely found any youth of our age! Many houses were vacant, the earthquake devasted and ruined heritages. Conservation of those heritages, ancient marvels,a nd history was in dire need, so we proposed the Conservation of Taksar through economic regeneration because the lack of economy was the main reason for the people to leave the settlement and the reason for the overall declination.

architectureconservationregeneration

More Related Content

What's hot

RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
MIJIN AN
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
Flink Forward
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
Mykola Zerniuk
 
What Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database ScalabilityWhat Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database Scalability
jbellis
 
Apache doris (incubating) introduction
Apache doris (incubating) introductionApache doris (incubating) introduction
Apache doris (incubating) introduction
leanderlee2
 
YugabyteDB - Distributed SQL Database on Kubernetes
YugabyteDB - Distributed SQL Database on KubernetesYugabyteDB - Distributed SQL Database on Kubernetes
YugabyteDB - Distributed SQL Database on Kubernetes
DoKC
 
Oracle_Multitenant_19c_-_All_About_Pluggable_D.pdf
Oracle_Multitenant_19c_-_All_About_Pluggable_D.pdfOracle_Multitenant_19c_-_All_About_Pluggable_D.pdf
Oracle_Multitenant_19c_-_All_About_Pluggable_D.pdf
SrirakshaSrinivasan2
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
Victor Coustenoble
 
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
Tanel Poder
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
Yoshinori Matsunobu
 
Troubleshooting tips and tricks for Oracle Database Oct 2020
Troubleshooting tips and tricks for Oracle Database Oct 2020Troubleshooting tips and tricks for Oracle Database Oct 2020
Troubleshooting tips and tricks for Oracle Database Oct 2020
Sandesh Rao
 
Query Compilation in Impala
Query Compilation in ImpalaQuery Compilation in Impala
Query Compilation in Impala
Cloudera, Inc.
 
Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problems
Alexander Korotkov
 
Let’s get to know Snowflake
Let’s get to know SnowflakeLet’s get to know Snowflake
Let’s get to know Snowflake
Knoldus Inc.
 
Oracle data guard for beginners
Oracle data guard for beginnersOracle data guard for beginners
Oracle data guard for beginners
Pini Dibask
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
HBaseCon
 
Exadata Backup
Exadata BackupExadata Backup
Exadata Backup
Fran Navarro
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
Gleb Kanterov
 
Adapting and adopting spm v04
Adapting and adopting spm v04Adapting and adopting spm v04
Adapting and adopting spm v04
Carlos Sierra
 
Survey of some free Tools to enhance your SQL Tuning and Performance Diagnost...
Survey of some free Tools to enhance your SQL Tuning and Performance Diagnost...Survey of some free Tools to enhance your SQL Tuning and Performance Diagnost...
Survey of some free Tools to enhance your SQL Tuning and Performance Diagnost...
Carlos Sierra
 

What's hot (20)

RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
 
What Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database ScalabilityWhat Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database Scalability
 
Apache doris (incubating) introduction
Apache doris (incubating) introductionApache doris (incubating) introduction
Apache doris (incubating) introduction
 
YugabyteDB - Distributed SQL Database on Kubernetes
YugabyteDB - Distributed SQL Database on KubernetesYugabyteDB - Distributed SQL Database on Kubernetes
YugabyteDB - Distributed SQL Database on Kubernetes
 
Oracle_Multitenant_19c_-_All_About_Pluggable_D.pdf
Oracle_Multitenant_19c_-_All_About_Pluggable_D.pdfOracle_Multitenant_19c_-_All_About_Pluggable_D.pdf
Oracle_Multitenant_19c_-_All_About_Pluggable_D.pdf
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
 
Troubleshooting tips and tricks for Oracle Database Oct 2020
Troubleshooting tips and tricks for Oracle Database Oct 2020Troubleshooting tips and tricks for Oracle Database Oct 2020
Troubleshooting tips and tricks for Oracle Database Oct 2020
 
Query Compilation in Impala
Query Compilation in ImpalaQuery Compilation in Impala
Query Compilation in Impala
 
Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problems
 
Let’s get to know Snowflake
Let’s get to know SnowflakeLet’s get to know Snowflake
Let’s get to know Snowflake
 
Oracle data guard for beginners
Oracle data guard for beginnersOracle data guard for beginners
Oracle data guard for beginners
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
 
Exadata Backup
Exadata BackupExadata Backup
Exadata Backup
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
Adapting and adopting spm v04
Adapting and adopting spm v04Adapting and adopting spm v04
Adapting and adopting spm v04
 
Survey of some free Tools to enhance your SQL Tuning and Performance Diagnost...
Survey of some free Tools to enhance your SQL Tuning and Performance Diagnost...Survey of some free Tools to enhance your SQL Tuning and Performance Diagnost...
Survey of some free Tools to enhance your SQL Tuning and Performance Diagnost...
 

Similar to Breakthrough OLAP performance with Cassandra and Spark

TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
DataStax Academy
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
Evan Chan
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
DataStax Academy
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 
Cassndra (4).pptx
Cassndra (4).pptxCassndra (4).pptx
Cassndra (4).pptx
NikhilAmauriya
 
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion DubaiSMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
Codemotion Dubai
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Dibyendu Bhattacharya
 
Using Cassandra with your Web Application
Using Cassandra with your Web ApplicationUsing Cassandra with your Web Application
Using Cassandra with your Web Application
supertom
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
Stavros Kontopoulos
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
Pollfish
 
Scaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosqlScaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosql
David Daeschler
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
javier ramirez
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics
Yousun Jeong
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
Spark Summit
 
Nosql seminar
Nosql seminarNosql seminar
Cassandra synergy
Cassandra synergyCassandra synergy
Cassandra synergy
niallmilton
 
cassandra
cassandracassandra
cassandra
Akash R
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
Nathan Milford
 

Similar to Breakthrough OLAP performance with Cassandra and Spark (20)

TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Cassndra (4).pptx
Cassndra (4).pptxCassndra (4).pptx
Cassndra (4).pptx
 
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion DubaiSMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
 
Using Cassandra with your Web Application
Using Cassandra with your Web ApplicationUsing Cassandra with your Web Application
Using Cassandra with your Web Application
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
 
Scaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosqlScaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosql
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Nosql seminar
Nosql seminarNosql seminar
Nosql seminar
 
Cassandra synergy
Cassandra synergyCassandra synergy
Cassandra synergy
 
cassandra
cassandracassandra
cassandra
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
 

More from Evan Chan

Time-State Analytics: MinneAnalytics 2024 Talk
Time-State Analytics: MinneAnalytics 2024 TalkTime-State Analytics: MinneAnalytics 2024 Talk
Time-State Analytics: MinneAnalytics 2024 Talk
Evan Chan
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
Evan Chan
 
Designing Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesDesigning Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and Kubernetes
Evan Chan
 
Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019
Evan Chan
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
Evan Chan
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
Evan Chan
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Evan Chan
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
Evan Chan
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
Evan Chan
 
MIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data ArchitectureMIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data Architecture
Evan Chan
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and Spark
Evan Chan
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
Evan Chan
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Evan Chan
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Evan Chan
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and Shark
Evan Chan
 

More from Evan Chan (16)

Time-State Analytics: MinneAnalytics 2024 Talk
Time-State Analytics: MinneAnalytics 2024 TalkTime-State Analytics: MinneAnalytics 2024 Talk
Time-State Analytics: MinneAnalytics 2024 Talk
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
 
Designing Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesDesigning Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and Kubernetes
 
Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
MIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data ArchitectureMIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data Architecture
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and Spark
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and Shark
 

Recently uploaded

Conservation of Taksar through Economic Regeneration
Conservation of Taksar through Economic RegenerationConservation of Taksar through Economic Regeneration
Conservation of Taksar through Economic Regeneration
PriyankaKarn3
 
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
VICTOR MAESTRE RAMIREZ
 
Introduction to IP address concept - Computer Networking
Introduction to IP address concept - Computer NetworkingIntroduction to IP address concept - Computer Networking
Introduction to IP address concept - Computer Networking
Md.Shohel Rana ( M.Sc in CSE Khulna University of Engineering & Technology (KUET))
 
Chlorine and Nitric Acid application, properties, impacts.pptx
Chlorine and Nitric Acid application, properties, impacts.pptxChlorine and Nitric Acid application, properties, impacts.pptx
Chlorine and Nitric Acid application, properties, impacts.pptx
yadavsuyash008
 
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model SafeBangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
bookhotbebes1
 
How to Manage Internal Notes in Odoo 17 POS
How to Manage Internal Notes in Odoo 17 POSHow to Manage Internal Notes in Odoo 17 POS
How to Manage Internal Notes in Odoo 17 POS
Celine George
 
Rohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeRohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
binna singh$A17
 
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
Mani Krishna Sarkar
 
Rotary Intersection in traffic engineering.pptx
Rotary Intersection in traffic engineering.pptxRotary Intersection in traffic engineering.pptx
Rotary Intersection in traffic engineering.pptx
surekha1287
 
Vernier Caliper and How to use Vernier Caliper.ppsx
Vernier Caliper and How to use Vernier Caliper.ppsxVernier Caliper and How to use Vernier Caliper.ppsx
Vernier Caliper and How to use Vernier Caliper.ppsx
Tool and Die Tech
 
Unblocking The Main Thread - Solving ANRs and Frozen Frames
Unblocking The Main Thread - Solving ANRs and Frozen FramesUnblocking The Main Thread - Solving ANRs and Frozen Frames
Unblocking The Main Thread - Solving ANRs and Frozen Frames
Sinan KOZAK
 
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
PradeepKumarSK3
 
Development of Chatbot Using AI/ML Technologies
Development of  Chatbot Using AI/ML TechnologiesDevelopment of  Chatbot Using AI/ML Technologies
Development of Chatbot Using AI/ML Technologies
maisnampibarel
 
Phone Us ❤ X000XX000X ❤ #ℂall #gIRLS In Chennai By Chenai @ℂall @Girls Hotel ...
Phone Us ❤ X000XX000X ❤ #ℂall #gIRLS In Chennai By Chenai @ℂall @Girls Hotel ...Phone Us ❤ X000XX000X ❤ #ℂall #gIRLS In Chennai By Chenai @ℂall @Girls Hotel ...
Phone Us ❤ X000XX000X ❤ #ℂall #gIRLS In Chennai By Chenai @ℂall @Girls Hotel ...
Miss Khusi #V08
 
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdfOCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
Muanisa Waras
 
LeetCode Database problems solved using PySpark.pdf
LeetCode Database problems solved using PySpark.pdfLeetCode Database problems solved using PySpark.pdf
LeetCode Database problems solved using PySpark.pdf
pavanaroshni1977
 
Biology for computer science BBOC407 vtu
Biology for computer science BBOC407 vtuBiology for computer science BBOC407 vtu
Biology for computer science BBOC407 vtu
santoshpatilrao33
 
IS Code SP 23: Handbook on concrete mixes
IS Code SP 23: Handbook  on concrete mixesIS Code SP 23: Handbook  on concrete mixes
IS Code SP 23: Handbook on concrete mixes
Mani Krishna Sarkar
 
kiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinkerkiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinker
hamedmustafa094
 
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
IJAEMSJORNAL
 

Recently uploaded (20)

Conservation of Taksar through Economic Regeneration
Conservation of Taksar through Economic RegenerationConservation of Taksar through Economic Regeneration
Conservation of Taksar through Economic Regeneration
 
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
 
Introduction to IP address concept - Computer Networking
Introduction to IP address concept - Computer NetworkingIntroduction to IP address concept - Computer Networking
Introduction to IP address concept - Computer Networking
 
Chlorine and Nitric Acid application, properties, impacts.pptx
Chlorine and Nitric Acid application, properties, impacts.pptxChlorine and Nitric Acid application, properties, impacts.pptx
Chlorine and Nitric Acid application, properties, impacts.pptx
 
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model SafeBangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
 
How to Manage Internal Notes in Odoo 17 POS
How to Manage Internal Notes in Odoo 17 POSHow to Manage Internal Notes in Odoo 17 POS
How to Manage Internal Notes in Odoo 17 POS
 
Rohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeRohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
 
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
 
Rotary Intersection in traffic engineering.pptx
Rotary Intersection in traffic engineering.pptxRotary Intersection in traffic engineering.pptx
Rotary Intersection in traffic engineering.pptx
 
Vernier Caliper and How to use Vernier Caliper.ppsx
Vernier Caliper and How to use Vernier Caliper.ppsxVernier Caliper and How to use Vernier Caliper.ppsx
Vernier Caliper and How to use Vernier Caliper.ppsx
 
Unblocking The Main Thread - Solving ANRs and Frozen Frames
Unblocking The Main Thread - Solving ANRs and Frozen FramesUnblocking The Main Thread - Solving ANRs and Frozen Frames
Unblocking The Main Thread - Solving ANRs and Frozen Frames
 
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
 
Development of Chatbot Using AI/ML Technologies
Development of  Chatbot Using AI/ML TechnologiesDevelopment of  Chatbot Using AI/ML Technologies
Development of Chatbot Using AI/ML Technologies
 
Phone Us ❤ X000XX000X ❤ #ℂall #gIRLS In Chennai By Chenai @ℂall @Girls Hotel ...
Phone Us ❤ X000XX000X ❤ #ℂall #gIRLS In Chennai By Chenai @ℂall @Girls Hotel ...Phone Us ❤ X000XX000X ❤ #ℂall #gIRLS In Chennai By Chenai @ℂall @Girls Hotel ...
Phone Us ❤ X000XX000X ❤ #ℂall #gIRLS In Chennai By Chenai @ℂall @Girls Hotel ...
 
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdfOCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
 
LeetCode Database problems solved using PySpark.pdf
LeetCode Database problems solved using PySpark.pdfLeetCode Database problems solved using PySpark.pdf
LeetCode Database problems solved using PySpark.pdf
 
Biology for computer science BBOC407 vtu
Biology for computer science BBOC407 vtuBiology for computer science BBOC407 vtu
Biology for computer science BBOC407 vtu
 
IS Code SP 23: Handbook on concrete mixes
IS Code SP 23: Handbook  on concrete mixesIS Code SP 23: Handbook  on concrete mixes
IS Code SP 23: Handbook on concrete mixes
 
kiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinkerkiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinker
 
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
 

Breakthrough OLAP performance with Cassandra and Spark

  • 1. Breakthrough OLAP Performance with Cassandra and Spark Evan Chan August 2015
  • 2. Who am I? Distinguished Engineer, @evanfchan User and contributor to Spark since 0.9, Cassandra since 0.6 Co-creator and maintainer of TupleJump http://velvia.github.io Spark Job Server
  • 3. About Tuplejump is a big data technology leader providing solutions for rapid insights from data. Tuplejump - the first Spark-Cassandra integration - an open source Lucene indexer for Cassandra - open source HDFS for Cassandra Calliope Stargate SnackFS
  • 4. Didn't I attend the same talk last year? Similar title, but mostly new material Will reveal new open source projects! :)
  • 5. Problem Space Need analytical database / queries on structured big data Something SQL-like, very flexible and fast Pre-aggregation too limiting Fast data / constant updates Ideally, want my queries to run over fresh data too
  • 6. Example: Video analytics Typical collection and analysis of consumer events 3 billion new events every day Video publishers want updated stats, the sooner the better Pre-aggregation only enables simple dashboard UIs What if one wants to offer more advanced analysis, or a generic data query API? Eg, top countries filtered by device type, OS, browser
  • 7. Requirements Scalable - rules out PostGreSQL, etc. Easy to update and ingest new data Not traditional OLAP cubes - that's not what I'm talking about Very fast for analytical queries - OLAP not OLTP Extremely flexible queries Preferably open source
  • 8. Parquet Widely used, lots of support (Spark, Impala, etc.) Problem: Parquet is read-optimized, not easy to use for writes Cannot support idempotent writes Optimized for writing very large chunks, not small updates Not suitable for time series, IoT, etc. Often needs multiple passes of jobs for compaction of small files, deduplication, etc.   People really want a database-like abstraction, not a file format!
  • 9. Turns out this has been solved before! Even .Facebook uses Vertica
  • 10. MPP Databases Easy writes plus fast queries, with constant transfers Automatic query optimization by storing intermediate query projections Stonebraker, et. al. - paper (Brown Univ)CStore
  • 11. What's wrong with MPP Databases? Closed source $$$ Usually don't scale horizontally that well (or cost is prohibitive)
  • 12. Cassandra Horizontally scalable Very flexible data modelling (lists, sets, custom data types) Easy to operate Perfect for ingestion of real time / machine data Best of breed storage technology, huge community BUT: Simple queries only OLTP-oriented
  • 13. Apache Spark Horizontally scalable, in-memory queries Functional Scala transforms - map, filter, groupBy, sort etc. SQL, machine learning, streaming, graph, R, many more plugins all on ONE platform - feed your SQL results to a logistic regression, easy! Huge number of connectors with every single storage technology
  • 14. Spark provides the missing fast, deep analytics piece of Cassandra!
  • 15. Spark and Cassandra OLAP Architectures
  • 16. Separate Storage and Query Layers Combine best of breed storage and query platforms Take full advantage of evolution of each Storage handles replication for availability Query can replicate data for scaling read concurrency - independent!
  • 18. Spark SQL Appeared with Spark 1.0 In-memory columnar store Parquet, Json, Cassandra connector, Avro, many more SQL as well as DataFrames (Pandas-style) API Indexing integrated into data sources (eg C* secondary indexes) Write custom functions in Scala .... take that Hive UDFs!! Integrates well with MLBase, Scala/Java/Python
  • 19. Connecting Spark to Cassandra Datastax's Tuplejump Spark Cassandra Connector Calliope   Get started in one line with spark-shell! bin/spark-shell --packagescom.datastax.spark:spark-cassandra-connector_2.10:1.4.0-M3 --confspark.cassandra.connection.host=127.0.0.1
  • 20. Caching a SQL Table from Cassandra DataFrames support in Cassandra Connector 1.4.0 (and 1.3.0): valsqlContext=neworg.apache.spark.sql.SQLContext(sc) valdf=sqlContext.read .format("org.apache.spark.sql.cassandra") .options(Map("table"->"gdelt","keyspace"->"test")) .load() df.registerTempTable("gdelt") sqlContext.cacheTable("gdelt") sqlContext.sql("SELECTcount(monthyear)FROMgdelt").show()   Spark does no caching by default - you will always be reading from C*!
  • 21. How Spark SQL's Table Caching Works
  • 22. Spark Cached Tables can be Really Fast GDELT dataset, 4 million rows, 60 columns, localhost Method secs Uncached 317 Cached 0.38   Almost a 1000x speedup! On an 8-node EC2 c3.XL cluster, 117 million rows, can run common queries 1-2 seconds against cached dataset.
  • 23. Tuning Connector Partitioning spark.cassandra.input.split.size Guideline: One split per partition, one partition per CPU core Much more parallelism won't speed up job much, but will starve other C* requests
  • 24. Lesson #1: Take Advantage of Spark Caching!
  • 25. Problems with Cached Tables Still have to read the data from Cassandra first, which is slow Amount of RAM: your entire data + extra for conversion to cached table Cached tables only live in Spark executors - by default tied to single context - not HA once any executor dies, must re-read data from C* Caching takes time: convert from RDD[Row] to compressed columnar format Cannot easily combine new RDD[Row] with cached tables (and keep speed)
  • 26. Problems with Cached Tables If you don't have enough RAM, Spark can cache your tables partly to disk. This is still way, way, faster than scanning an entire C* table. However, cached tables are still tied to a single Spark context/application. Also: rdd.cache()is NOT the same as SQLContext's cacheTable!
  • 27. What about C* Secondary Indexing? Spark-Cassandra Connector and Calliope can both reduce I/O by using Cassandra secondary indices. Does this work with caching? No, not really, because only the filtered rows would be cached. Subsequent queries against this limited cached table would not give you expected results.
  • 29. Intro to Tachyon Tachyon: an in-memory cache for HDFS and other binary data sources Keeps data off-heap, so multiple Spark applications/executors can share data Solves HA problem for data
  • 30. Wait, wait, wait! What am I caching exactly? Tachyon is designed for caching files or binary blobs. A serialized form of CassandraRow/CassandraRDD? Raw output from Cassandra driver? What you really want is this: Cassandra SSTable -> Tachyon (as row cache) -> CQL -> Spark
  • 31. Bad programmers worry about the code. Good programmers worry about data structures. - Linus Torvalds   Are we really thinking holistically about data modelling, caching, and how it affects the entire systems architecture?
  • 32. Efficient Columnar Storage in Cassandra Wait, I thought Cassandra was columnar?
  • 33. How Cassandra stores your CQL Tables Suppose you had this CQL table: CREATETABLE( departmenttext, empIdtext, firsttext, lasttext, ageint, PRIMARYKEY(department,empId) );
  • 34. How Cassandra stores your CQL Tables PartitionKey 01:first 01:last 01:age 02:first 02:last 02:age Sales Bob Jones 34 Susan O'Connor 40 Engineering Dilbert P ? Dogbert Dog 1   Each row is stored contiguously. All columns in row 2 come after row 1. To analyze only age, C* still has to read every field.
  • 35. Cassandra is really a row-based, OLTP-oriented datastore. Unless you know how to use it otherwise :)
  • 36. The traditional row-based data storage approach is dead - Michael Stonebraker
  • 37. Columnar Storage (Memory) Name column 0 1 0 1   Dictionary: {0: "Barak", 1: "Hillary"}   Age column 0 1 46 66
  • 38. Columnar Storage (Cassandra) Review: each physical row in Cassandra (e.g. a "partition key") stores its columns together on disk.   Schema CF Rowkey Type Name StringDict Age Int   Data CF Rowkey 0 1 Name 0 1 Age 46 66
  • 39. Columnar Format solves I/O Compression Dictionary compression - HUGE savings for low-cardinality string columns RLE, other techniques Reduce I/O Only columns needed for query are loaded from disk Batch multiple rows in one cell for efficiency (avoid cluster key overhead)
  • 40. Columnar Format solves Caching Use the same format on disk, in cache, in memory scan Caching works a lot better when the cached object is the same!! No data format dissonance means bringing in new bits of data and combining with existing cached data is seamless
  • 41. So, why isn't everybody doing this? No columnar storage format designed to work with NoSQL stores Efficient conversion to/from columnar format a hard problem Most infrastructure is still row oriented Spark SQL/DataFrames based on RDD[Row] Spark Catalyst is a row-oriented query parser
  • 42. All hard work leads to profit, but mere talk leads to poverty. - Proverbs 14:23
  • 44. Columnar Storage Performance Study   http://github.com/velvia/cassandra-gdelt
  • 45. GDELT Dataset 1979 to now 60 columns, 250 million+ rows, 250GB+ Let's compare Cassandra I/O only, no caching or Spark Global Database of Events, Language, and Tone
  • 46. The scenarios 1. Narrow table - CQL table with one row per partition key 2. Wide table - wide rows with 10,000 logical rows per partition key 3. Columnar layout - 1000 rows per columnar chunk, wide rows, with dictionary compression First 4 million rows, localhost, SSD, C* 2.0.9, LZ4 compression. Compaction performed before read benchmarks.
  • 47. Query and ingest times Scenario Ingest Read all columns Read one column Narrow table 1927 sec 505 sec 504 sec Wide table 3897 sec 365 sec 351 sec Columnar 93 sec 8.6 sec 0.23 sec   On reads, using a columnar format is up to 2190x faster, while ingestion is 20-40x faster. Of course, real life perf gains will depend heavily on query, table width, etc. etc.
  • 48. Disk space usage Scenario Disk used Narrow table 2.7 GB Wide table 1.6 GB Columnar 0.34 GB The disk space usage helps explain some of the numbers.
  • 49. Towards Extreme Query Performance
  • 50. The filo project is a binary data vector library designed for extreme read performance with minimal deserialization costs. http://github.com/velvia/filo Designed for NoSQL, not a file format random or linear access on or off heap missing value support Scala only, but cross-platform support possible
  • 51. What is the ceiling? This Scala loop can read integers from a binary Filo blob at a rate of 2 billion integers per second - single threaded: defsumAllInts():Int={ vartotal=0 for{i<-0untilnumValuesoptimized}{ total+=sc(i) } total }
  • 52. Vectorization of Spark Queries The project.Tungsten Process many elements from the same column at once, keep data in L1/L2 cache. Coming in Spark 1.4 through 1.6
  • 53. Hot Column Caching in Tachyon Has a "table" feature, originally designed for Shark Keep hot columnar chunks in shared off-heap memory for fast access
  • 55. What's in the name? Rich sweet layers of distributed, versioned database goodness
  • 56. Distributed Apache Cassandra. Scale out with no SPOF. Cross-datacenter replication. Proven storage and database technology.
  • 57. Versioned Incrementally add a column or a few rows as a new version. Easily control what versions to query. Roll back changes inexpensively. Stream out new versions as continuous queries :)
  • 58. Columnar Parquet-style storage layout Retrieve select columns and minimize I/O for OLAP queries Add a new column without having to copy the whole table Vectorization and lazy/zero serialization for extreme efficiency
  • 59. 100% Reactive Built completely on the Typesafe Platform: Scala 2.10 and SBT Spark (including custom data source) Akka Actors for rational scale-out concurrency Futures for I/O Phantom Cassandra client for reactive, type-safe C* I/O Typesafe Config
  • 60. Spark SQL Queries! SELECTfirst,last,ageFROMcustomers WHERE_version>3ANDage<40LIMIT100 Read to and write from Spark Dataframes Append/merge to FiloDB table from Spark Streaming
  • 61. FiloDB vs Parquet Comparable read performance - with lots of space to improve Assuming co-located Spark and Cassandra On localhost, both subsecond for simple queries (GDELT 1979-1984) FiloDB has more room to grow - due to hot column caching and much less deserialization overhead Lower memory requirement due to much smaller block sizes Much better fit for IoT / Machine / Time-series applications Limited support for types array / set / map support not there, but will be added later
  • 62. Where FiloDB Fits In Use regular C* denormalized tables for OLTP and single-key lookups Use FiloDB for the remaining ad-hoc or more complex analytical queries Simplify your analytics infrastructure! No need to export to Hadoop/Parquet/data warehouse. Use Spark and C* for both OLAP and OLTP! Perform ad-hoc OLAP analysis of your time-series, IoT data
  • 63. Simplify your Lambda Architecture... ( )https://www.mapr.com/developercentral/lambda-architecture
  • 64. With Spark, Cassandra, and FiloDB Ma, where did all the components go? You mean I don't have to deal with Hadoop? Use Cassandra as a front end to store IoT data first
  • 65. Exactly-Once Ingestion from Kafka New rows appended via Kafka Writes are idempotent - no need to dedup! Converted to columnar chunks on ingest and stored in C* Only necessary columnar chunks are read into Spark for minimal I/O
  • 66. You can help! Send me your use cases for OLAP on Cassandra and Spark Especially IoT and Geospatial Email if you want to contribute
  • 67. Thanks... to the entire OSS community, but in particular: Lee Mighdoll, Nest/Google Rohit Rai and Satya B., Tuplejump My colleagues at Socrata   If you want to go fast, go alone. If you want to go far, go together. -- African proverb
  • 68. DEMO TIME GDELT: Regular C* Tables vs FiloDB
  • 70. When in doubt, use brute force - Ken Thompson
  • 71. Automatic Columnar Conversion using Custom Indexes Write to Cassandra as you normally do Custom indexer takes changes, merges and compacts into columnar chunks behind scenes
  • 72. Implementing Lambda is Hard Use real-time pipeline backed by a KV store for new updates Lots of moving parts Key-value store, real time sys, batch, etc. Need to run similar code in two places Still need to deal with ingesting data to Parquet/HDFS Need to reconcile queries against two different places