SlideShare a Scribd company logo
SnappyData
Unified Stream, interactive analytics in
a single in-memory cluster with Spark
www.snappydata.io
Jags Ramnarayan
CTO, Co-founder SnappyData
Feb 2016
SnappyData - an EMC/Pivotal spin out
● New Spark-based open source project started by Pivotal
GemFire founders+engineers
● Decades of in-memory data management experience
● Focus on real-time, operational analytics: Spark inside an
OLTP+OLAP database
www.snappydata.io
Lambda Architecture (LA) for Analytics
Is Lambda Complex ?
Impala,
HBase
Cassandra, Redis, ..
Application
Is Lambda Complex ?
Impala,
HBase
Cassandra, Redis, ..
Application
• Application has to federate queries?
• Complex Application – deal with multiple
data models? Disparate APIs?
• Slow?
• Expensive to maintain?
Can we simplify, optimize?
Can the batch and speed layer use a single, unified serving DB?
Deeper Look – Ad Impression Analytics
Ad Impression Analytics
Ref - https://chimpler.wordpress.com/2014/07/01/implementing-a-real-time-data-pipeline-with-spark-streaming/
Bottlenecks in the write path
Stream micro batches in
parallel from Kafka to each
Spark executor
Emit Key-Value pairs.
So, we can group By
Publisher, Geo
Execute GROUP BY … Expensive
Spark Shuffle …
Shuffle again in DB cluster …
data format changes …
serialization costs
Bottlenecks in the Write Path
• Aggregations –
GroupBy,
MapReduce
• Joins with other
streams,
Reference data
• Replication
(HA) in Fast
data store
Shuffle Costs
• Data models
across Spark
and FastData
store are
different
• Data flows
through too
many
processes
• In-JVM copying
Copying, Serialization Excessive copying in
Java based Scale out stores
Goal – Localize processing
• Can we Localize processing with state, avoid shuffle.
• Can Kafka, Spark partitions and the data store share the same
partitioning policy? ( partition by Advertiser,Geo )
-- Show cassandra, memSQL ingestion performance (maybe later section?)
• Embedded State : Spark’s native MapWithState, Apache Samza, Flink?
Good, scalable KV stores but IS THIS ENOUGH?
Impedance mismatch for Interactive Queries
On Ingested Ad Impressions we want to run queries like:
- Find total uniques for a certain AD grouped on geography,
month
- Impression trends for advertisers, or, detect outliers in
the bidding price
Unfortunately, in most KV oriented stores this is very inefficient
- Not suited for scans, aggregations, distributed joins on large volumes
- Memory utilization in KV store is poor
Why columnar storage?
In-memory Columns but still slow?
SELECT
SUBSTR(sourceIP, 1, X),
SUM(adRevenue)
FROM uservisits
GROUP BY SUBSTR(sourceIP, 1, X)
Berkeley AMPLab Big Data Benchmark
-- AWS m2.4xlarge ; total of 342 GB
Can we use Statistical methods to shrink data?
• It is not always possible to store the data in full
Many applications (telecoms, ISPs, search engines) can’t keep
everything
• It is inconvenient to work with data in full
Just because we can, doesn’t mean we should
• It is faster to work with a compact summary
Better to explore data on a laptop than a cluster
Ref: Graham Cormode - Sampling for Big Data
Can we use statistical techniques to understand data, synthesize
something very small but still answer Analytical queries?
SnappyData: A new approach
Single unified HA cluster: OLTP + OLAP +
Stream for real-time analytics
Batch design, high throughput
Real-time
design center
- Low latency, HA,
concurrent
Vision: Drastically reduce the cost and
complexity in modern big data
Rapidly Maturing Matured over
13 years
SnappyData: A new approach
Single unified HA cluster: OLTP + OLAP +
Stream for real-time analytics
Batch design, high throughput
Real time operational Analytics – TBs in memory
RDB
Rows
Txn
Columnar
API
Stream processing
ODBC,
JDBC, REST
Spark -
Scala, Java,
Python, R
HDFS
AQP
First commercial project on Approximate
Query Processing(AQP)
MPP DB
Index
Tenets, Guiding Principles
● Memory is the new bottleneck for speed
● Memory densities will follow Moore’s law
● 100% Spark compatible - powerful, concise. Simplify runtime
● Aim for Google Search like speed for analytic queries
● Dramatically reduce costs associated with Analytics
Distributed systems are expensive – Esp. in production
Use Case Patterns
• Stream ingestion database for spark
• Process streams, transform, real-time scoring, store, query
• In-memory database for apps
• Highly concurrent apps, SQL cache, OLTP + OLAP
• Analytic caching pattern
• Caching for Analytics over any “Big data” store (esp MPP)
• Federate query between samples and backend
Why Spark
– Confluence of Streaming, interactive, batch
Unifies batch, streaming, interactive comp.
Easy to build sophisticated applications
Support iterative, graph-parallel algorithms
Powerful APIs in Scala, Python, Java
Spark
Spark
Streaming SQL
BlinkDB
GraphX MLlib
Streamin
g
Batch,
Interactive
Batch, Interactive
Interactiv
e
Data-parallel,
Iterative
Sophisticated algos.
Source: Spark summit presentation from Ion Stoica
Spark Cluster
Snappy Spark Cluster Deployment topologies
• Snappy store and Spark
Executor share the JVM
memory
• Reference based access –
zero copy
• SnappyStore is isolated but
use the same COLUMN
FORMAT AS SPARK for high
throughput
Unified Cluster
Split Cluster
Simple API – Spark Compatible
● Access Table as DataFrame
Catalog is automatically recovered
● Store RDD[T]/DataFrame can be
stored in SnappyData tables
● Access from Remote SQL clients
● Addtional API for updates,
inserts, deletes
//Save a dataFrame using the Snappy or spark context …
context.createExternalTable(”T1", "ROW", myDataFrame.schema,
props );
//save using DataFrame API
dataDF.write.format("ROW").mode(SaveMode.Append).options(pro
ps).saveAsTable(”T1");
val impressionLogs: DataFrame = context.table(colTable)
val campaignRef: DataFrame = context.table(rowTable)
val parquetData: DataFrame = context.table(parquetTable)
<… Now use any of DataFrame APIs … >
Extends Spark
CREATE [Temporary] TABLE [IF NOT EXISTS] table_name
(
<column definition>
) USING ‘JDBC | ROW | COLUMN ’
OPTIONS (
COLOCATE_WITH 'table_name', // Default none
PARTITION_BY 'PRIMARY KEY | column name', // will be a replicated table, by default
REDUNDANCY '1' , // Manage HA
PERSISTENT "DISKSTORE_NAME ASYNCHRONOUS | SYNCHRONOUS",
// Empty string will map to default disk store.
OFFHEAP "true | false"
EVICTION_BY "MEMSIZE 200 | COUNT 200 | HEAPPERCENT",
…..
[AS select_statement];
Simple to Ingest Streams using SQL
Consume from stream
Transform raw data
Continuous Analytics
Ingest into in-memory Store
Overflow table to HDFS
Create stream table AdImpressionLog
(<Columns>) using directkafka_stream options (
<socket endpoints>
"topics 'adnetwork-topic’ “,
"rowConverter ’ AdImpressionLogAvroDecoder’ )
streamingContext.registerCQ(
"select publisher, geo, avg(bid) as avg_bid, count(*) imps,
count(distinct(cookie)) uniques from AdImpressionLog
window (duration '2' seconds, slide '2' seconds)
where geo != 'unknown' group by publisher, geo”)// Register CQ
.foreachDataFrame(df => {
df.write.format("column").mode(SaveMode.Appen
d)
.saveAsTable("adImpressions")
AdImpression Demo
Spark, SQL Code Walkthrough, interactive SQL
AdImpression Demo
Performance comparison to Cassandra,
MemSQL - TBD
AdImpression Ingest Performance
Loading from parquet files using 4 cores Stream ingest
(Kafka + Spark streaming to store using 1 core)
Why is ingestion, querying fast?
Linearly scale with partition pruning
Input queue,
Stream, IMDB
all share the
same
partitioning
strategy
How does it scale with concurrency?
● Parallel query engine skips Spark SQL
scheduling for low latency queries
● Column tables
For fast scan/aggregations
Also automatically compressed
● Row tables
Fast key based or selective queries
Can have any number of secondary indices
How does it scale with concurrency?
● Distributed shuffle for joins, ordering .. expensive
● Two techniques to minimize this
1) Colocation of related tables. All related records
are collocated.
For instance, ‘Users’ table partitioned across nodes and
all related ‘Ad impressions” can be collocated on same partition.
The parallel query would execute the Join locally on each partition
2) Replication of ‘Dimension’ tables
Joins to dimension tables are always localized
Low latency Interactive Analytic Queries
– Exact or Approximate
Select avg(volume),symbol from T1 where <time range> group by symbol
Select avg(volume), symbol from T1 where <time range> group by symbol
With error 0.1 confidence 0.8
Speed/Accuracy tradeoff
Error
30 mins
Time to Execute on
Entire Dataset
Interactive
Queries
2 sec
Execution Time (Sample Size)
32
100 secs
2 secs
Key feature: Synopses Data
● Maintain stratified samples
○ Intelligent sampling to keep error bounds low
● Probabilistic data
○ TopK for time series (using time aggregation CMS, item
aggregation)
○ Histograms, HyperLogLog, Bloom Filters, Wavelets
CREATE SAMPLE TABLE sample-table-name USING columnar
OPTIONS (
BASETABLE ‘table_name’ // source column table or stream table
[ SAMPLINGMETHOD "stratified | uniform" ]
STRATA name (
QCS (“comma-separated-column-names”)
[ FRACTION “frac” ]
),+ // one or more QCS
Unified Cluster Architecture
How do we extend Spark for Real Time?
• Spark Executors are long
running. Driver failure
doesn’t shutdown
Executors
• Driver HA – Drivers run
“Managed” with standby
secondary
• Data HA – Consensus based
clustering integrated for
eager replication
How do we extend Spark for Real Time?
• By pass scheduler for low
latency SQL
• Deep integration with
Spark Catalyst(SQL) –
collocation optimizations,
indexing use, etc
• Full SQL support –
Persistent Catalog,
Transaction, DML
Performance – Spark vs Snappy (TPC-H)
See ACM Sigmod 2016 paper for details
Available on snappydata.io blogs
Performance – Snappy vs in-memoryDB (YCSB)
Unified OLAP/OLTP streaming w/ Spark
● Far fewer resources: TB problem becomes GB.
○ CPU contention drops
● Far less complex
○ single cluster for stream ingestion, continuous queries, interactive
queries and machine learning
● Much faster
○ compressed data managed in distributed memory in columnar
form reduces volume and is much more responsive
www.snappydata.io
SnappyData is Open Source
● Available for download on Github today
● https://github.com/SnappyDataInc/snappydata
● Learn more www.snappydata.io/blog
● Connect:
○ twitter: www.twitter.com/snappydata
○ facebook: www.facebook.com/snappydata
○ linkedin: www.linkedin.com/snappydata
○ slack: http://snappydata-slackin.herokuapp.com
EXTRAS
www.snappydata.io
Colocated row/column Tables in Spark
Row
Table
Column
Table
Spark
Executor
TASK
Spark Block Manager
Stream
processing
Row
Table
Column
Table
Spark
Executor
TASK
Spark Block Manager
Stream
processing
Row
Table
Column
Table
Spark
Executor
TASK
Spark Block Manager
Stream
processing
● Spark Executors are long lived and shared across multiple apps
● Snappy Memory Mgr and Spark Block Mgr integrated
Table can be partitioned or replicated
Replicated
Table
Partitioned
Table
(Buckets A-H) Replicated
Table
Partitioned
Table
(Buckets I-P)
consistent replica on each node
Partition
Replica
(Buckets A-H)
Replicated
Table
Partitioned
Table
(Buckets Q-W)Partition
Replica
(Buckets I-P)
Data partitioned with one or more replicas
Use partitioned tables for large fact tables, Replicated for small dimension tables
Spark Core
Micro-batch
Streaming
Spark SQL Catalyst
T
X
N
OLAP
Job
Scheduler OLAP
Query
OLTP
Query
AQP
P2P Cluster Replication Svc
RowColumn
Table
Index
Sample
Table
Shared nothing logs
Spark
program
JDBC
ODBC
SnappyData Components

More Related Content

Intro to SnappyData Webinar

  • 1. SnappyData Unified Stream, interactive analytics in a single in-memory cluster with Spark www.snappydata.io Jags Ramnarayan CTO, Co-founder SnappyData Feb 2016
  • 2. SnappyData - an EMC/Pivotal spin out ● New Spark-based open source project started by Pivotal GemFire founders+engineers ● Decades of in-memory data management experience ● Focus on real-time, operational analytics: Spark inside an OLTP+OLAP database www.snappydata.io
  • 3. Lambda Architecture (LA) for Analytics
  • 4. Is Lambda Complex ? Impala, HBase Cassandra, Redis, .. Application
  • 5. Is Lambda Complex ? Impala, HBase Cassandra, Redis, .. Application • Application has to federate queries? • Complex Application – deal with multiple data models? Disparate APIs? • Slow? • Expensive to maintain?
  • 6. Can we simplify, optimize? Can the batch and speed layer use a single, unified serving DB?
  • 7. Deeper Look – Ad Impression Analytics
  • 8. Ad Impression Analytics Ref - https://chimpler.wordpress.com/2014/07/01/implementing-a-real-time-data-pipeline-with-spark-streaming/
  • 9. Bottlenecks in the write path Stream micro batches in parallel from Kafka to each Spark executor Emit Key-Value pairs. So, we can group By Publisher, Geo Execute GROUP BY … Expensive Spark Shuffle … Shuffle again in DB cluster … data format changes … serialization costs
  • 10. Bottlenecks in the Write Path • Aggregations – GroupBy, MapReduce • Joins with other streams, Reference data • Replication (HA) in Fast data store Shuffle Costs • Data models across Spark and FastData store are different • Data flows through too many processes • In-JVM copying Copying, Serialization Excessive copying in Java based Scale out stores
  • 11. Goal – Localize processing • Can we Localize processing with state, avoid shuffle. • Can Kafka, Spark partitions and the data store share the same partitioning policy? ( partition by Advertiser,Geo ) -- Show cassandra, memSQL ingestion performance (maybe later section?) • Embedded State : Spark’s native MapWithState, Apache Samza, Flink? Good, scalable KV stores but IS THIS ENOUGH?
  • 12. Impedance mismatch for Interactive Queries On Ingested Ad Impressions we want to run queries like: - Find total uniques for a certain AD grouped on geography, month - Impression trends for advertisers, or, detect outliers in the bidding price Unfortunately, in most KV oriented stores this is very inefficient - Not suited for scans, aggregations, distributed joins on large volumes - Memory utilization in KV store is poor
  • 14. In-memory Columns but still slow? SELECT SUBSTR(sourceIP, 1, X), SUM(adRevenue) FROM uservisits GROUP BY SUBSTR(sourceIP, 1, X) Berkeley AMPLab Big Data Benchmark -- AWS m2.4xlarge ; total of 342 GB
  • 15. Can we use Statistical methods to shrink data? • It is not always possible to store the data in full Many applications (telecoms, ISPs, search engines) can’t keep everything • It is inconvenient to work with data in full Just because we can, doesn’t mean we should • It is faster to work with a compact summary Better to explore data on a laptop than a cluster Ref: Graham Cormode - Sampling for Big Data Can we use statistical techniques to understand data, synthesize something very small but still answer Analytical queries?
  • 16. SnappyData: A new approach Single unified HA cluster: OLTP + OLAP + Stream for real-time analytics Batch design, high throughput Real-time design center - Low latency, HA, concurrent Vision: Drastically reduce the cost and complexity in modern big data Rapidly Maturing Matured over 13 years
  • 17. SnappyData: A new approach Single unified HA cluster: OLTP + OLAP + Stream for real-time analytics Batch design, high throughput Real time operational Analytics – TBs in memory RDB Rows Txn Columnar API Stream processing ODBC, JDBC, REST Spark - Scala, Java, Python, R HDFS AQP First commercial project on Approximate Query Processing(AQP) MPP DB Index
  • 18. Tenets, Guiding Principles ● Memory is the new bottleneck for speed ● Memory densities will follow Moore’s law ● 100% Spark compatible - powerful, concise. Simplify runtime ● Aim for Google Search like speed for analytic queries ● Dramatically reduce costs associated with Analytics Distributed systems are expensive – Esp. in production
  • 19. Use Case Patterns • Stream ingestion database for spark • Process streams, transform, real-time scoring, store, query • In-memory database for apps • Highly concurrent apps, SQL cache, OLTP + OLAP • Analytic caching pattern • Caching for Analytics over any “Big data” store (esp MPP) • Federate query between samples and backend
  • 20. Why Spark – Confluence of Streaming, interactive, batch Unifies batch, streaming, interactive comp. Easy to build sophisticated applications Support iterative, graph-parallel algorithms Powerful APIs in Scala, Python, Java Spark Spark Streaming SQL BlinkDB GraphX MLlib Streamin g Batch, Interactive Batch, Interactive Interactiv e Data-parallel, Iterative Sophisticated algos. Source: Spark summit presentation from Ion Stoica
  • 22. Snappy Spark Cluster Deployment topologies • Snappy store and Spark Executor share the JVM memory • Reference based access – zero copy • SnappyStore is isolated but use the same COLUMN FORMAT AS SPARK for high throughput Unified Cluster Split Cluster
  • 23. Simple API – Spark Compatible ● Access Table as DataFrame Catalog is automatically recovered ● Store RDD[T]/DataFrame can be stored in SnappyData tables ● Access from Remote SQL clients ● Addtional API for updates, inserts, deletes //Save a dataFrame using the Snappy or spark context … context.createExternalTable(”T1", "ROW", myDataFrame.schema, props ); //save using DataFrame API dataDF.write.format("ROW").mode(SaveMode.Append).options(pro ps).saveAsTable(”T1"); val impressionLogs: DataFrame = context.table(colTable) val campaignRef: DataFrame = context.table(rowTable) val parquetData: DataFrame = context.table(parquetTable) <… Now use any of DataFrame APIs … >
  • 24. Extends Spark CREATE [Temporary] TABLE [IF NOT EXISTS] table_name ( <column definition> ) USING ‘JDBC | ROW | COLUMN ’ OPTIONS ( COLOCATE_WITH 'table_name', // Default none PARTITION_BY 'PRIMARY KEY | column name', // will be a replicated table, by default REDUNDANCY '1' , // Manage HA PERSISTENT "DISKSTORE_NAME ASYNCHRONOUS | SYNCHRONOUS", // Empty string will map to default disk store. OFFHEAP "true | false" EVICTION_BY "MEMSIZE 200 | COUNT 200 | HEAPPERCENT", ….. [AS select_statement];
  • 25. Simple to Ingest Streams using SQL Consume from stream Transform raw data Continuous Analytics Ingest into in-memory Store Overflow table to HDFS Create stream table AdImpressionLog (<Columns>) using directkafka_stream options ( <socket endpoints> "topics 'adnetwork-topic’ “, "rowConverter ’ AdImpressionLogAvroDecoder’ ) streamingContext.registerCQ( "select publisher, geo, avg(bid) as avg_bid, count(*) imps, count(distinct(cookie)) uniques from AdImpressionLog window (duration '2' seconds, slide '2' seconds) where geo != 'unknown' group by publisher, geo”)// Register CQ .foreachDataFrame(df => { df.write.format("column").mode(SaveMode.Appen d) .saveAsTable("adImpressions")
  • 26. AdImpression Demo Spark, SQL Code Walkthrough, interactive SQL
  • 27. AdImpression Demo Performance comparison to Cassandra, MemSQL - TBD
  • 28. AdImpression Ingest Performance Loading from parquet files using 4 cores Stream ingest (Kafka + Spark streaming to store using 1 core)
  • 29. Why is ingestion, querying fast? Linearly scale with partition pruning Input queue, Stream, IMDB all share the same partitioning strategy
  • 30. How does it scale with concurrency? ● Parallel query engine skips Spark SQL scheduling for low latency queries ● Column tables For fast scan/aggregations Also automatically compressed ● Row tables Fast key based or selective queries Can have any number of secondary indices
  • 31. How does it scale with concurrency? ● Distributed shuffle for joins, ordering .. expensive ● Two techniques to minimize this 1) Colocation of related tables. All related records are collocated. For instance, ‘Users’ table partitioned across nodes and all related ‘Ad impressions” can be collocated on same partition. The parallel query would execute the Join locally on each partition 2) Replication of ‘Dimension’ tables Joins to dimension tables are always localized
  • 32. Low latency Interactive Analytic Queries – Exact or Approximate Select avg(volume),symbol from T1 where <time range> group by symbol Select avg(volume), symbol from T1 where <time range> group by symbol With error 0.1 confidence 0.8 Speed/Accuracy tradeoff Error 30 mins Time to Execute on Entire Dataset Interactive Queries 2 sec Execution Time (Sample Size) 32 100 secs 2 secs
  • 33. Key feature: Synopses Data ● Maintain stratified samples ○ Intelligent sampling to keep error bounds low ● Probabilistic data ○ TopK for time series (using time aggregation CMS, item aggregation) ○ Histograms, HyperLogLog, Bloom Filters, Wavelets CREATE SAMPLE TABLE sample-table-name USING columnar OPTIONS ( BASETABLE ‘table_name’ // source column table or stream table [ SAMPLINGMETHOD "stratified | uniform" ] STRATA name ( QCS (“comma-separated-column-names”) [ FRACTION “frac” ] ),+ // one or more QCS
  • 35. How do we extend Spark for Real Time? • Spark Executors are long running. Driver failure doesn’t shutdown Executors • Driver HA – Drivers run “Managed” with standby secondary • Data HA – Consensus based clustering integrated for eager replication
  • 36. How do we extend Spark for Real Time? • By pass scheduler for low latency SQL • Deep integration with Spark Catalyst(SQL) – collocation optimizations, indexing use, etc • Full SQL support – Persistent Catalog, Transaction, DML
  • 37. Performance – Spark vs Snappy (TPC-H) See ACM Sigmod 2016 paper for details Available on snappydata.io blogs
  • 38. Performance – Snappy vs in-memoryDB (YCSB)
  • 39. Unified OLAP/OLTP streaming w/ Spark ● Far fewer resources: TB problem becomes GB. ○ CPU contention drops ● Far less complex ○ single cluster for stream ingestion, continuous queries, interactive queries and machine learning ● Much faster ○ compressed data managed in distributed memory in columnar form reduces volume and is much more responsive
  • 40. www.snappydata.io SnappyData is Open Source ● Available for download on Github today ● https://github.com/SnappyDataInc/snappydata ● Learn more www.snappydata.io/blog ● Connect: ○ twitter: www.twitter.com/snappydata ○ facebook: www.facebook.com/snappydata ○ linkedin: www.linkedin.com/snappydata ○ slack: http://snappydata-slackin.herokuapp.com
  • 42. Colocated row/column Tables in Spark Row Table Column Table Spark Executor TASK Spark Block Manager Stream processing Row Table Column Table Spark Executor TASK Spark Block Manager Stream processing Row Table Column Table Spark Executor TASK Spark Block Manager Stream processing ● Spark Executors are long lived and shared across multiple apps ● Snappy Memory Mgr and Spark Block Mgr integrated
  • 43. Table can be partitioned or replicated Replicated Table Partitioned Table (Buckets A-H) Replicated Table Partitioned Table (Buckets I-P) consistent replica on each node Partition Replica (Buckets A-H) Replicated Table Partitioned Table (Buckets Q-W)Partition Replica (Buckets I-P) Data partitioned with one or more replicas Use partitioned tables for large fact tables, Replicated for small dimension tables
  • 44. Spark Core Micro-batch Streaming Spark SQL Catalyst T X N OLAP Job Scheduler OLAP Query OLTP Query AQP P2P Cluster Replication Svc RowColumn Table Index Sample Table Shared nothing logs Spark program JDBC ODBC SnappyData Components

Editor's Notes

  1. There is a reciprocal relationship with Spark RDDs/DataFrames. any table is visible as a DataFrame and vice versa. Hence, all the spark APIs, tranformations can also be applied to snappy managed tables. For instance, you can use the DataFrame data source API to save any arbitrary DataFrame into a snappy table like shown in the example. One cool aspect of Spark is its ability to take an RDD of objects (say with nested structure) and implicitly infer its schema. i.e. turn into into a DataFrame and store it.
  2. The SQL dialect will be Spark SQL ++. i.e. we are extending SQL to be much more compliant with standard SQL. A number of the extensions that dictate things like HA, disk persistence, etc are all specified through OPTIONS in spark SQL.
  3. CREATE HDFSSTORE streamingstore NameNode 'hdfs://gfxd1:8020' HomeDir 'stream-tables' BatchSize 10 BatchTimeInterval 2000 milliseconds QueuePersistent true MaxWriteOnlyFileSize 200 WriteOnlyFileRolloverInterval 1 minute;
  4. And, of course, the whole point behind colocation is to linear scale with minimal or even no shuffling. So, for instance, when using Kafka, all three components - Kafka, native RDD in Spark and Table in Snappy can all share the same partitioning strategy. As an example, in our telco case, all records associated with a subscriber can be colocated onto the same node - the queue, spark procesing of partitions and related reference data in Snappy store.
  5. When it comes to interactive analytics a lot is exploratory in nature. Folks are looking trends for different time periods, studying outlier patterns, etc. Unfortunately, like pointed out before, analytic queries can take a llong time even when in-memory. We want such exploratory analytics to ultimately happen at google like speeds. Don’t break the speed of thought. In many cases, do we really need a precise answer? like watching a trend on a visualization tool. We are thowing linear improvements to what seems like an exponential problem - like in some IoT scenarios. Stratified sampling allows the user to more intelligently sample so we can answer queries with a very small fraction of the data with good accuracy. What we do is allow the user to create one or more stratified samples on some “base” table data. The base table maybe all in-memory also or more often than not, could reside in HDFS.
  6. Manage data(mutable) in spark executors (store memory mgr works with Block mgr) Make executors long lived Which means, spark drivers run de-coupled .. they can fail. - managed Drivers - Selective scheduling - Deeply integrate with query engine for optimizations - Full SQL support: including transactions, DML, catalog integration
  7. Manage data(mutable) in spark executors (store memory mgr works with Block mgr) Make executors long lived Which means, spark drivers run de-coupled .. they can fail. - managed Drivers - Selective scheduling - Deeply integrate with query engine for optimizations - Full SQL support: including transactions, DML, catalog integration
  8. By default, we start the Spark cluster in an “embedded” mode. i.e. the in-memory store is fully collocated and in the same process space. We had to change the spark Block manager so both Gem as well as spark shares the same space for tables, cached RDDs, shuffle space, sorting, etc. This space can extend from JVM heap to off-heap. GemFire proactively monitors the JVM “old gen” so never goes beyond a certain critical threshold. i.e. we do a number of things so you don’t run OOM. Hoping to contribute this back to Spark. And, when running in the embedded mode we also make sure the executors are long lived. i.e. life cycle for these nodes are no longer tied to the Driver availability. Everything Spark does it cleaned up as expected though.
  9. Partitioning strategy, by default, is the same as Spark. We try to do uniform random distribution of the records across all the nodes designated to host a partitioned table. Any table can have one or more replicas. Replicas are always consistent with each other - sync writes. We parallely send the write to each replica and wait for ACKs. If a ACK is not received we start SUSPECT processing. Replicated tables, by default, are replicated to each node. Replicas are guaranteed to be consistent when failures occur i.e. the failed node rejoins. How do you recreate the state of the replica while thousands of other concurrent writes are in progress is a hard problem to solve.