Jack Gudenkauf sparkug_20151207_7

A noETL
Parallel Streaming Transformation Loader
using Kafka, Spark & Vertica
Jack Gudenkauf
CEO & Founder of
BigDataInfra
https://www.linkedin.com/in/jackglinkedin
1
JackGudenkauf@gmail.com

WARNING!
Slides that follow
violate Powerpoint best practices
in favor of providing densely
packed information for later review
https://www.linkedin.com/in/jackglinkedin 2

Agenda
1. Background
2. PSTL overview
Parallelized Streaming Transformation Loader
3. Parallelism in
Kafka, Spark, Vertica
4. PSTL drill down
5. Vertica Performance!

Agenda
1. Background

My Background
Playtika, VP of Big Data
Flume, Kafka, Spark, ZK, Yarn, Vertica. [Jay Kreps (Kafka), Michael Armbrust (Spark SQL),
Chris Bowden (Dev Demigod)]
MIS Director of several start-up companies
Dataflex a 4GL RDBMS. [E.F. Codd]
Self-employed Consultant
Intercept Dataflex db calls to store and retrieve data to/from Btrieve and IBM DB2
Mainframe
FoxPro, Sybase, MSSQL Server beta
Design Patterns: Elements of Reusable Object-Oriented Software [The Gang of Four]
Microsoft; Dev Manager, Architect CLR/.Net Framework,
Product Unit Manager Technical Strategy Group
Inventor of “Shuttle”, a Microsoft product in use since 1999
A distributed ETL based on MSMQ which influenced MSSQL DTS (SQL SSIS)
[Joe Celko, Ralph Kimball, Steve Butler (Microsoft Architect)]
Twitter, Manager of Analytics Data Warehouse
Core Storage; Hadoop, HBase, Cassandra, Blob Store
Analytics Infra; MySQL, PIG, Vertica (n-Petabyte capacity with Multi-DC DR)
[Prof. Michael Stonebraker, Ron Cormier (Vertica Internals)]
5

A Quest
With attributes of
 Operational Robustness
 High Availabilty
 Stronger durability guarantees
 Idempotent (an operation that is safe to repeat)
 Productivity
 Analytics
 Streaming, Machine Learning, BI, BA, Data Science
 Rich Development env.
 Strongly typed, OO, Functional, with support for set based logic and
aggregations (SQL)
 Performance
 Scalable in every tier
 MPP for Transformations, Reads & Writes
A Unified Data Pipeline with Parallelism
from Streaming Data
through Data Transformations
to Data Storage (Semi-Structured, Structured, and Relational Data)

https://www.linkedin.com/in/jackglinkedin 7https://en.wikipedia.org/wiki/Extract,_transform,_load

ELT
“Extract, Load, Transform is an alternative to Extract,
transform, load (ETL) used with data lake implementations.
In ELT models the data is not processed on entry to the data
lake which enables faster loading times.
But does require sufficient processing within the data
processing engine to carry out the transform on demand and
return the results to the consumer in a timely manner.
Since the data is not processed on entry to the data lake the
query and schema do not need to be defined a-priori (often
the schema will be available during load since many data
sources are extracts from databases or similar structured data
systems and hence have an associated schema).”
https://en.wikipedia.org/wiki/Extract,_load,_transform

Lambda Architecture
9
“Essentially, the speed layer (streaming) is responsible for filling the "gap" caused by the batch
layer's lag in providing views based on the most recent data” -
https://en.wikipedia.org/wiki/Lambda_architecture

Questioning the Lambda Architecture
by Jay Kreps
The Lambda Architecture has its merits, but alternatives
are worth exploring.
“As someone who designs infrastructure, I think the
glaring question is this: why can’t the stream processing
system just be improved to handle the full problem set in
its target domain? Why do you need to glue on another
system? Why can’t you do both real-time processing and
also handle the reprocessing when code changes? Stream
processing systems already have a notion of parallelism;
why not just handle reprocessing by increasing the
parallelism and replaying history very, very fast? The
answer is that you can do this, and I think this it is
actually a reasonable alternative architecture if you are
building this type of system today.”
10

REST API
Flume
Apache
Flume™
ETL
JAVA ™
Parser & Loader
MPP Columnar
DW
HP Vertica™ Cluster
UserId <-> UserGId 
Analytics of Relational Data
 Structured Relational and Aggregated Data
Application
Application
Game
Applications
GameX
GameY
GameZ
COPY
Playtika Santa Monica
original ETL Architecture
Extract Transform Load
Single Source of Truths to Global SOT
Unified
Schema
JSON
Local Data Warehouses
Original Architecture (ETL)
1
2 3 4
5
UserId: INT
SessionId: UUId (36)
UserId: INT
UserId: varchar(32)
SessionId: varchar(255)

Agenda
1. Background
2. PSTL overview

Real-Time
Messaging
Apache
Kafka™
Analytics of [semi]Structured [non]Relational Data Stores
Real-Time Streaming ✓Machine Learning
✓ Semi-Structured Raw JSON Data
✖Structured (non)relational Parquet Data
 Structured Relational and Aggregated Data
Resilient Distributed
Datasets
Apache Spark™ Hadoop™
Parquet™ ✓ ✓ ✖ 
REST API
Or Local
Kafka
Application
Application
Game
Applications
Unified
Schema
JSON
Local Data Warehouses
MPP Columnar DW
HP Vertica™

MPP
1 2
3
P a r a l l e l i z e d S t r e a m i n g
T r a n s f o r m a t i o n
L o a d e r
4
5
New PSTL
Architecture
New PSTL
Architecture
13
Bingo Blitz
UserId: INT
SessionId: UUId (36) Slotomania
UserId: INT
WSOP
UserId: varchar(32)
SessionId: varchar(255)

14
Agenda
1. Background
2. PSTL overview
3. Parallelism in

Apache Kafka ™
is a distributed, partitioned, replicated
commit log service
Producer Producer Producer
Kafka Cluster
(Broker)
Consumer Consumer Consumer

A topic is a category or feed name to which messages are published.
For each topic, the Kafka cluster maintains a partitioned log that looks like
Each partition is an ordered, immutable sequence of messages that is continually appended to—a commit log.
The messages in the partitions are each assigned a sequential id number called the offset that uniquely identifies each
message within the partition.
The Kafka cluster retains all published messages—whether or not they have been consumed
—for a configurable period of time.
Apache Kafka
™

Spark RDD
A Resilient Distributed Dataset [in Memory]
Represents an immutable, partitioned collection of elements that can be operated on in parallel
Node 1 Node 2 Node 3 Node…
RDD 1
RDD 1
Partition 1
RDD 1
Partition 2
RDD 3 RDD 3
Partition 2
RDD 3
Partition 3
RDD 3
Partition 1
RDD 2
RDD 2
Partition
1 to 64
RDD 2
Partition
65 to 128
RDD 2
Partition
193 to 256
RDD 2
Partition
129 to 192
RDD 1
Partition 3

18Initiator Node
An Initiator Node shuffles
data to storage nodes
Vertica Hashing & Partitioning

19
Agenda
1. Background
2. PSTL overview
3. Parallelism in
4. PSTL drill down

{"appId": 3, "sessionId": ”7”,
"userId": ”42” }
"userId": ”42” }
Node 1 Node 2 Node 3 Node 4
3 Import recent Sessions
Apache Kafka Cluster
Topic: “appId_1” Topic: “appId_2” Topic: “appId_3”
old new
Kafka Table
appId,
TopicOffsetRange,
Batch_Id
SessionMax Table
sessionGIdMax Int
UserMax Table
userGIdMax Int
appSessionMap_RDD
appId: Int
sessionId: String
sessionGId: Int
appUserMap_RDD
appId: Int
userId: String
userGId: Int
appSession
appId: Int
sessionId:
varchar(255)
sessionGId: Int
appUser
appId: Int
userId:
varchar(255)
userGId: Int
1 Start a Spark Driver
per APP
Node 1 Node 2 Node 3
4 Spark Kafka [non]Streaming job per APP
(read partition/offset range)
5 select for
update;
update max
GId
5 Assign userGIds To
userId
sessionGIds To
sessionId
6 Hash(userGId) to
RDD partitions with
affinity
To Vertica Node(s)
7
userGIdRDD.foreachPartition
{…stream.writeTo(socket)...}
8 Idempotent: Write
Raw JSON to hdfs
9 Idempotent: Write
Parsed JSON to .ORC
hdfs
10 Update
MySQL
Kafka Offsets
"userId": ”KA” }
"userId": ”KY” }{"appId": 1, "sessionId": ”2”,
"userId": ”CB” }
{"appId": 1, "sessionId": "1”,
"userId": ”JG” }
4 appId {Game events, Users, Sessions,…}
Partition 1..n RDDs
5 appId Users & Sessions
Partition 1..n RDDs
5 appId
appUserMap_RDD.union(assignedID_RDD)
6 appId Users & Sessions
Partition 1..n RDDs
7 copy jackg.DIM_USER
with source SPARK(port='12345’,
nodes=‘node0001:4, node0002:4,
node0003:4’) direct;
2 Import Users
Apache Hadoop™
Spark™ Cluster
HPE Vertica™ Cluster

21
Agenda
1. Background
2. PSTL overview
3. Parallelism in
4. PSTL drill down
5. Vertica Performance!

Impressive Parallel COPY Performance
Loaded 2.42 Billion Rows (451 GB)
in 7min 35sec on an 8 Node Cluster
Key Takeaways
Parallel Kafka Reads to Spark RDD (in memory) with Parallel
writes to a Vertica via tcp server – ROCKS!
COPY 36 TB/Hour with 81 Node cluster
No ephemeral nodes needed for ingest
Kafka read parallelism to Spark RDD partitions
A priori hash() in Spark RDD Partitions (in Memory)
TCP Server as a Vertica User Define Copy Source
Single COPY does not preallocate Memory across nodes
http://www.vertica.com/2014/09/17/how-vertica-met-facebooks-35-tbhour-ingest-sla/
* 270 Nodes ( 215 Data Nodes )

A noETL
Parallel Streaming Transformation Loader
using Kafka, Spark & Vertica
Jack Gudenkauf
CEO & Founder of
BigDataInfra
23
JackGudenkauf@gmail.com
THANK YOU

Jack Gudenkauf sparkug_20151207_7

More Related Content

Jack Gudenkauf sparkug_20151207_7

Editor's Notes