Intro to SnappyData Webinar

SnappyData
Unified Stream, interactive analytics in
a single in-memory cluster with Spark
www.snappydata.io
Jags Ramnarayan
CTO, Co-founder SnappyData
Feb 2016

SnappyData - an EMC/Pivotal spin out
● New Spark-based open source project started by Pivotal
GemFire founders+engineers
● Decades of in-memory data management experience
● Focus on real-time, operational analytics: Spark inside an
OLTP+OLAP database
www.snappydata.io

Lambda Architecture (LA) for Analytics

Is Lambda Complex ?
Impala,
HBase
Cassandra, Redis, ..
Application

Is Lambda Complex ?
Impala,
HBase
Cassandra, Redis, ..
Application
• Application has to federate queries?
• Complex Application – deal with multiple
data models? Disparate APIs?
• Slow?
• Expensive to maintain?

Can we simplify, optimize?
Can the batch and speed layer use a single, unified serving DB?

Deeper Look – Ad Impression Analytics

Ad Impression Analytics
Ref - https://chimpler.wordpress.com/2014/07/01/implementing-a-real-time-data-pipeline-with-spark-streaming/

Bottlenecks in the write path
Stream micro batches in
parallel from Kafka to each
Spark executor
Emit Key-Value pairs.
So, we can group By
Publisher, Geo
Execute GROUP BY … Expensive
Spark Shuffle …
Shuffle again in DB cluster …
data format changes …
serialization costs

Bottlenecks in the Write Path
• Aggregations –
GroupBy,
MapReduce
• Joins with other
streams,
Reference data
• Replication
(HA) in Fast
data store
Shuffle Costs
• Data models
across Spark
and FastData
store are
different
• Data flows
through too
many
processes
• In-JVM copying
Copying, Serialization Excessive copying in
Java based Scale out stores

Goal – Localize processing
• Can we Localize processing with state, avoid shuffle.
• Can Kafka, Spark partitions and the data store share the same
partitioning policy? ( partition by Advertiser,Geo )
-- Show cassandra, memSQL ingestion performance (maybe later section?)
• Embedded State : Spark’s native MapWithState, Apache Samza, Flink?
Good, scalable KV stores but IS THIS ENOUGH?

Impedance mismatch for Interactive Queries
On Ingested Ad Impressions we want to run queries like:
- Find total uniques for a certain AD grouped on geography,
month
- Impression trends for advertisers, or, detect outliers in
the bidding price
Unfortunately, in most KV oriented stores this is very inefficient
- Not suited for scans, aggregations, distributed joins on large volumes
- Memory utilization in KV store is poor

In-memory Columns but still slow?
SELECT
SUBSTR(sourceIP, 1, X),
SUM(adRevenue)
FROM uservisits
GROUP BY SUBSTR(sourceIP, 1, X)
Berkeley AMPLab Big Data Benchmark
-- AWS m2.4xlarge ; total of 342 GB

Can we use Statistical methods to shrink data?
• It is not always possible to store the data in full
Many applications (telecoms, ISPs, search engines) can’t keep
everything
• It is inconvenient to work with data in full
Just because we can, doesn’t mean we should
• It is faster to work with a compact summary
Better to explore data on a laptop than a cluster
Ref: Graham Cormode - Sampling for Big Data
Can we use statistical techniques to understand data, synthesize
something very small but still answer Analytical queries?

SnappyData: A new approach
Single unified HA cluster: OLTP + OLAP +
Stream for real-time analytics
Batch design, high throughput
Real-time
design center
- Low latency, HA,
concurrent
Vision: Drastically reduce the cost and
complexity in modern big data
Rapidly Maturing Matured over
13 years

SnappyData: A new approach
Single unified HA cluster: OLTP + OLAP +
Stream for real-time analytics
Batch design, high throughput
Real time operational Analytics – TBs in memory
RDB
Rows
Txn
Columnar
API
Stream processing
ODBC,
JDBC, REST
Spark -
Scala, Java,
Python, R
HDFS
AQP
First commercial project on Approximate
Query Processing(AQP)
MPP DB
Index

Tenets, Guiding Principles
● Memory is the new bottleneck for speed
● Memory densities will follow Moore’s law
● 100% Spark compatible - powerful, concise. Simplify runtime
● Aim for Google Search like speed for analytic queries
● Dramatically reduce costs associated with Analytics
Distributed systems are expensive – Esp. in production

Use Case Patterns
• Stream ingestion database for spark
• Process streams, transform, real-time scoring, store, query
• In-memory database for apps
• Highly concurrent apps, SQL cache, OLTP + OLAP
• Analytic caching pattern
• Caching for Analytics over any “Big data” store (esp MPP)
• Federate query between samples and backend

Why Spark
– Confluence of Streaming, interactive, batch
Unifies batch, streaming, interactive comp.
Easy to build sophisticated applications
Support iterative, graph-parallel algorithms
Powerful APIs in Scala, Python, Java
Spark
Spark
Streaming SQL
BlinkDB
GraphX MLlib
Streamin
g
Batch,
Interactive
Batch, Interactive
Interactiv
e
Data-parallel,
Iterative
Sophisticated algos.
Source: Spark summit presentation from Ion Stoica

Snappy Spark Cluster Deployment topologies
• Snappy store and Spark
Executor share the JVM
memory
• Reference based access –
zero copy
• SnappyStore is isolated but
use the same COLUMN
FORMAT AS SPARK for high
throughput
Unified Cluster
Split Cluster

Simple API – Spark Compatible
● Access Table as DataFrame
Catalog is automatically recovered
● Store RDD[T]/DataFrame can be
stored in SnappyData tables
● Access from Remote SQL clients
● Addtional API for updates,
inserts, deletes
//Save a dataFrame using the Snappy or spark context …
context.createExternalTable(”T1", "ROW", myDataFrame.schema,
props );
//save using DataFrame API
dataDF.write.format("ROW").mode(SaveMode.Append).options(pro
ps).saveAsTable(”T1");
val impressionLogs: DataFrame = context.table(colTable)
val campaignRef: DataFrame = context.table(rowTable)
val parquetData: DataFrame = context.table(parquetTable)
<… Now use any of DataFrame APIs … >

Extends Spark
CREATE [Temporary] TABLE [IF NOT EXISTS] table_name
(
<column definition>
) USING ‘JDBC | ROW | COLUMN ’
OPTIONS (
COLOCATE_WITH 'table_name', // Default none
PARTITION_BY 'PRIMARY KEY | column name', // will be a replicated table, by default
REDUNDANCY '1' , // Manage HA
PERSISTENT "DISKSTORE_NAME ASYNCHRONOUS | SYNCHRONOUS",
// Empty string will map to default disk store.
OFFHEAP "true | false"
EVICTION_BY "MEMSIZE 200 | COUNT 200 | HEAPPERCENT",
…..
[AS select_statement];

Simple to Ingest Streams using SQL
Consume from stream
Transform raw data
Continuous Analytics
Ingest into in-memory Store
Overflow table to HDFS
Create stream table AdImpressionLog
(<Columns>) using directkafka_stream options (
<socket endpoints>
"topics 'adnetwork-topic’ “,
"rowConverter ’ AdImpressionLogAvroDecoder’ )
streamingContext.registerCQ(
"select publisher, geo, avg(bid) as avg_bid, count(*) imps,
count(distinct(cookie)) uniques from AdImpressionLog
window (duration '2' seconds, slide '2' seconds)
where geo != 'unknown' group by publisher, geo”)// Register CQ
.foreachDataFrame(df => {
df.write.format("column").mode(SaveMode.Appen
d)
.saveAsTable("adImpressions")

AdImpression Demo
Spark, SQL Code Walkthrough, interactive SQL

AdImpression Demo
Performance comparison to Cassandra,
MemSQL - TBD

AdImpression Ingest Performance
Loading from parquet files using 4 cores Stream ingest
(Kafka + Spark streaming to store using 1 core)

Why is ingestion, querying fast?
Linearly scale with partition pruning
Input queue,
Stream, IMDB
all share the
same
partitioning
strategy

How does it scale with concurrency?
● Parallel query engine skips Spark SQL
scheduling for low latency queries
● Column tables
For fast scan/aggregations
Also automatically compressed
● Row tables
Fast key based or selective queries
Can have any number of secondary indices

How does it scale with concurrency?
● Distributed shuffle for joins, ordering .. expensive
● Two techniques to minimize this
1) Colocation of related tables. All related records
are collocated.
For instance, ‘Users’ table partitioned across nodes and
all related ‘Ad impressions” can be collocated on same partition.
The parallel query would execute the Join locally on each partition
2) Replication of ‘Dimension’ tables
Joins to dimension tables are always localized

Low latency Interactive Analytic Queries
– Exact or Approximate
Select avg(volume),symbol from T1 where <time range> group by symbol
Select avg(volume), symbol from T1 where <time range> group by symbol
With error 0.1 confidence 0.8
Speed/Accuracy tradeoff
Error
30 mins
Time to Execute on
Entire Dataset
Interactive
Queries
2 sec
Execution Time (Sample Size)
32
100 secs
2 secs

Key feature: Synopses Data
● Maintain stratified samples
○ Intelligent sampling to keep error bounds low
● Probabilistic data
○ TopK for time series (using time aggregation CMS, item
aggregation)
○ Histograms, HyperLogLog, Bloom Filters, Wavelets
CREATE SAMPLE TABLE sample-table-name USING columnar
OPTIONS (
BASETABLE ‘table_name’ // source column table or stream table
[ SAMPLINGMETHOD "stratified | uniform" ]
STRATA name (
QCS (“comma-separated-column-names”)
[ FRACTION “frac” ]
),+ // one or more QCS

How do we extend Spark for Real Time?
• Spark Executors are long
running. Driver failure
doesn’t shutdown
Executors
• Driver HA – Drivers run
“Managed” with standby
secondary
• Data HA – Consensus based
clustering integrated for
eager replication

How do we extend Spark for Real Time?
• By pass scheduler for low
latency SQL
• Deep integration with
Spark Catalyst(SQL) –
collocation optimizations,
indexing use, etc
• Full SQL support –
Persistent Catalog,
Transaction, DML

Performance – Spark vs Snappy (TPC-H)
See ACM Sigmod 2016 paper for details
Available on snappydata.io blogs

Performance – Snappy vs in-memoryDB (YCSB)

Unified OLAP/OLTP streaming w/ Spark
● Far fewer resources: TB problem becomes GB.
○ CPU contention drops
● Far less complex
○ single cluster for stream ingestion, continuous queries, interactive
queries and machine learning
● Much faster
○ compressed data managed in distributed memory in columnar
form reduces volume and is much more responsive

www.snappydata.io
SnappyData is Open Source
● Available for download on Github today
● https://github.com/SnappyDataInc/snappydata
● Learn more www.snappydata.io/blog
● Connect:
○ twitter: www.twitter.com/snappydata
○ facebook: www.facebook.com/snappydata
○ linkedin: www.linkedin.com/snappydata
○ slack: http://snappydata-slackin.herokuapp.com

Colocated row/column Tables in Spark
Row
Table
Column
Table
Spark
Executor
TASK
Spark Block Manager
Stream
processing
Row
Table
Column
Table
Spark
Executor
TASK
Spark Block Manager
Stream
processing
Row
Table
Column
Table
Spark
Executor
TASK
Spark Block Manager
Stream
processing
● Spark Executors are long lived and shared across multiple apps
● Snappy Memory Mgr and Spark Block Mgr integrated

Table can be partitioned or replicated
Replicated
Table
Partitioned
Table
(Buckets A-H) Replicated
Table
Partitioned
Table
(Buckets I-P)
consistent replica on each node
Partition
Replica
(Buckets A-H)
Replicated
Table
Partitioned
Table
(Buckets Q-W)Partition
Replica
(Buckets I-P)
Data partitioned with one or more replicas
Use partitioned tables for large fact tables, Replicated for small dimension tables

Spark Core
Micro-batch
Streaming
Spark SQL Catalyst
T
X
N
OLAP
Job
Scheduler OLAP
Query
OLTP
Query
AQP
P2P Cluster Replication Svc
RowColumn
Table
Index
Sample
Table
Shared nothing logs
Spark
program
JDBC
ODBC
SnappyData Components

Intro to SnappyData Webinar

More Related Content

Intro to SnappyData Webinar

Editor's Notes