SlideShare a Scribd company logo
A noETL
Parallel Streaming Transformation Loader
using Kafka, Spark & Vertica
Jack Gudenkauf
CEO & Founder of
BigDataInfra
https://www.linkedin.com/in/jackglinkedin
1
JackGudenkauf@gmail.com
WARNING!
Slides that follow
violate Powerpoint best practices
in favor of providing densely
packed information for later review
https://www.linkedin.com/in/jackglinkedin 2
Agenda
1. Background
2. PSTL overview
Parallelized Streaming Transformation Loader
3. Parallelism in
Kafka, Spark, Vertica
4. PSTL drill down
Parallelized Streaming Transformation Loader
5. Vertica Performance!
https://www.linkedin.com/in/jackglinkedin 3
Agenda
1. Background
https://www.linkedin.com/in/jackglinkedin 4
My Background
Playtika, VP of Big Data
Flume, Kafka, Spark, ZK, Yarn, Vertica. [Jay Kreps (Kafka), Michael Armbrust (Spark SQL),
Chris Bowden (Dev Demigod)]
MIS Director of several start-up companies
Dataflex a 4GL RDBMS. [E.F. Codd]
Self-employed Consultant
Intercept Dataflex db calls to store and retrieve data to/from Btrieve and IBM DB2
Mainframe
FoxPro, Sybase, MSSQL Server beta
Design Patterns: Elements of Reusable Object-Oriented Software [The Gang of Four]
Microsoft; Dev Manager, Architect CLR/.Net Framework,
Product Unit Manager Technical Strategy Group
Inventor of “Shuttle”, a Microsoft product in use since 1999
A distributed ETL based on MSMQ which influenced MSSQL DTS (SQL SSIS)
[Joe Celko, Ralph Kimball, Steve Butler (Microsoft Architect)]
Twitter, Manager of Analytics Data Warehouse
Core Storage; Hadoop, HBase, Cassandra, Blob Store
Analytics Infra; MySQL, PIG, Vertica (n-Petabyte capacity with Multi-DC DR)
[Prof. Michael Stonebraker, Ron Cormier (Vertica Internals)]
https://www.linkedin.com/in/jackglinkedin
5
A Quest
With attributes of
 Operational Robustness
 High Availabilty
 Stronger durability guarantees
 Idempotent (an operation that is safe to repeat)
 Productivity
 Analytics
 Streaming, Machine Learning, BI, BA, Data Science
 Rich Development env.
 Strongly typed, OO, Functional, with support for set based logic and
aggregations (SQL)
 Performance
 Scalable in every tier
 MPP for Transformations, Reads & Writes
A Unified Data Pipeline with Parallelism
from Streaming Data
through Data Transformations
to Data Storage (Semi-Structured, Structured, and Relational Data)
https://www.linkedin.com/in/jackglinkedin 6
https://www.linkedin.com/in/jackglinkedin 7https://en.wikipedia.org/wiki/Extract,_transform,_load
ELT
“Extract, Load, Transform is an alternative to Extract,
transform, load (ETL) used with data lake implementations.
In ELT models the data is not processed on entry to the data
lake which enables faster loading times.
But does require sufficient processing within the data
processing engine to carry out the transform on demand and
return the results to the consumer in a timely manner.
Since the data is not processed on entry to the data lake the
query and schema do not need to be defined a-priori (often
the schema will be available during load since many data
sources are extracts from databases or similar structured data
systems and hence have an associated schema).”
https://www.linkedin.com/in/jackglinkedin 8
https://en.wikipedia.org/wiki/Extract,_load,_transform
Lambda Architecture
9
“Essentially, the speed layer (streaming) is responsible for filling the "gap" caused by the batch
layer's lag in providing views based on the most recent data” -
https://en.wikipedia.org/wiki/Lambda_architecture
Questioning the Lambda Architecture
by Jay Kreps
The Lambda Architecture has its merits, but alternatives
are worth exploring.
“As someone who designs infrastructure, I think the
glaring question is this: why can’t the stream processing
system just be improved to handle the full problem set in
its target domain? Why do you need to glue on another
system? Why can’t you do both real-time processing and
also handle the reprocessing when code changes? Stream
processing systems already have a notion of parallelism;
why not just handle reprocessing by increasing the
parallelism and replaying history very, very fast? The
answer is that you can do this, and I think this it is
actually a reasonable alternative architecture if you are
building this type of system today.”
10
REST API
Flume
Apache
Flume™
ETL
JAVA ™
Parser & Loader
MPP Columnar
DW
HP Vertica™ Cluster
UserId <-> UserGId 
Analytics of Relational Data
 Structured Relational and Aggregated Data
Application
Application
Game
Applications
GameX
GameY
GameZ
COPY
Playtika Santa Monica
original ETL Architecture
Extract Transform Load
Single Source of Truths to Global SOT
Unified
Schema
JSON
Local Data Warehouses
Original Architecture (ETL)
1
2 3 4
5
https://www.linkedin.com/in/jackglinkedin 11
UserId: INT
SessionId: UUId (36)
UserId: INT
SessionId: UUId (32)
UserId: varchar(32)
SessionId: varchar(255)
Agenda
1. Background
2. PSTL overview
Parallelized Streaming Transformation Loader
https://www.linkedin.com/in/jackglinkedin 12
Real-Time
Messaging
Apache
Kafka™
Analytics of [semi]Structured [non]Relational Data Stores
Real-Time Streaming ✓Machine Learning
✓ Semi-Structured Raw JSON Data
✖Structured (non)relational Parquet Data
 Structured Relational and Aggregated Data
Resilient Distributed
Datasets
Apache Spark™ Hadoop™
Parquet™ ✓ ✓ ✖ 
REST API
Or Local
Kafka
Application
Application
Game
Applications
Unified
Schema
JSON
Local Data Warehouses
MPP Columnar DW
HP Vertica™

MPP
1 2
3
P a r a l l e l i z e d S t r e a m i n g
T r a n s f o r m a t i o n
L o a d e r
4
5
New PSTL
Architecture
New PSTL
Architecture
https://www.linkedin.com/in/jackglinkedin
13
Bingo Blitz
UserId: INT
SessionId: UUId (36) Slotomania
UserId: INT
SessionId: UUId (32)
WSOP
UserId: varchar(32)
SessionId: varchar(255)
14
Agenda
1. Background
2. PSTL overview
Parallelized Streaming Transformation Loader
3. Parallelism in
Kafka, Spark, Vertica
Apache Kafka ™
is a distributed, partitioned, replicated
commit log service
Producer Producer Producer
Kafka Cluster
(Broker)
Consumer Consumer Consumer
A topic is a category or feed name to which messages are published.
For each topic, the Kafka cluster maintains a partitioned log that looks like
Each partition is an ordered, immutable sequence of messages that is continually appended to—a commit log.
The messages in the partitions are each assigned a sequential id number called the offset that uniquely identifies each
message within the partition.
The Kafka cluster retains all published messages—whether or not they have been consumed
—for a configurable period of time.
Apache Kafka
™
Spark RDD
A Resilient Distributed Dataset [in Memory]
Represents an immutable, partitioned collection of elements that can be operated on in parallel
Node 1 Node 2 Node 3 Node…
RDD 1
RDD 1
Partition 1
RDD 1
Partition 2
RDD 3 RDD 3
Partition 2
RDD 3
Partition 3
RDD 3
Partition 1
RDD 2
RDD 2
Partition
1 to 64
RDD 2
Partition
65 to 128
RDD 2
Partition
193 to 256
RDD 2
Partition
129 to 192
RDD 1
Partition 3
18Initiator Node
An Initiator Node shuffles
data to storage nodes
Vertica Hashing & Partitioning
19
Agenda
1. Background
2. PSTL overview
Parallelized Streaming Transformation Loader
3. Parallelism in
Kafka, Spark, Vertica
4. PSTL drill down
Parallelized Streaming Transformation Loader
{"appId": 3, "sessionId": ”7”,
"userId": ”42” }
{"appId": 3, "sessionId": ”6”,
"userId": ”42” }
Node 1 Node 2 Node 3 Node 4
3 Import recent Sessions
Apache Kafka Cluster
Topic: “appId_1” Topic: “appId_2” Topic: “appId_3”
old new
Kafka Table
appId,
TopicOffsetRange,
Batch_Id
SessionMax Table
sessionGIdMax Int
UserMax Table
userGIdMax Int
appSessionMap_RDD
appId: Int
sessionId: String
sessionGId: Int
appUserMap_RDD
appId: Int
userId: String
userGId: Int
appSession
appId: Int
sessionId:
varchar(255)
sessionGId: Int
appUser
appId: Int
userId:
varchar(255)
userGId: Int
1 Start a Spark Driver
per APP
Node 1 Node 2 Node 3
4 Spark Kafka [non]Streaming job per APP
(read partition/offset range)
5 select for
update;
update max
GId
5 Assign userGIds To
userId
sessionGIds To
sessionId
6 Hash(userGId) to
RDD partitions with
affinity
To Vertica Node(s)
7
userGIdRDD.foreachPartition
{…stream.writeTo(socket)...}
8 Idempotent: Write
Raw JSON to hdfs
9 Idempotent: Write
Parsed JSON to .ORC
hdfs
10 Update
MySQL
Kafka Offsets
{"appId": 2, "sessionId": ”4”,
"userId": ”KA” }
{"appId": 2, "sessionId": ”3”,
"userId": ”KY” }{"appId": 1, "sessionId": ”2”,
"userId": ”CB” }
{"appId": 1, "sessionId": "1”,
"userId": ”JG” }
4 appId {Game events, Users, Sessions,…}
Partition 1..n RDDs
5 appId Users & Sessions
Partition 1..n RDDs
5 appId
appUserMap_RDD.union(assignedID_RDD)
6 appId Users & Sessions
Partition 1..n RDDs
7 copy jackg.DIM_USER
with source SPARK(port='12345’,
nodes=‘node0001:4, node0002:4,
node0003:4’) direct;
2 Import Users
Apache Hadoop™
Spark™ Cluster
HPE Vertica™ Cluster
21
Agenda
1. Background
2. PSTL overview
Parallelized Streaming Transformation Loader
3. Parallelism in
Kafka, Spark, Vertica
4. PSTL drill down
Parallelized Streaming Transformation Loader
5. Vertica Performance!
Impressive Parallel COPY Performance
Loaded 2.42 Billion Rows (451 GB)
in 7min 35sec on an 8 Node Cluster
Key Takeaways
Parallel Kafka Reads to Spark RDD (in memory) with Parallel
writes to a Vertica via tcp server – ROCKS!
COPY 36 TB/Hour with 81 Node cluster
No ephemeral nodes needed for ingest
Kafka read parallelism to Spark RDD partitions
A priori hash() in Spark RDD Partitions (in Memory)
TCP Server as a Vertica User Define Copy Source
Single COPY does not preallocate Memory across nodes
http://www.vertica.com/2014/09/17/how-vertica-met-facebooks-35-tbhour-ingest-sla/
* 270 Nodes ( 215 Data Nodes )
A noETL
Parallel Streaming Transformation Loader
using Kafka, Spark & Vertica
Jack Gudenkauf
CEO & Founder of
BigDataInfra
https://www.linkedin.com/in/jackglinkedin
23
JackGudenkauf@gmail.com
THANK YOU

More Related Content

Jack Gudenkauf sparkug_20151207_7

  • 1. A noETL Parallel Streaming Transformation Loader using Kafka, Spark & Vertica Jack Gudenkauf CEO & Founder of BigDataInfra https://www.linkedin.com/in/jackglinkedin 1 JackGudenkauf@gmail.com
  • 2. WARNING! Slides that follow violate Powerpoint best practices in favor of providing densely packed information for later review https://www.linkedin.com/in/jackglinkedin 2
  • 3. Agenda 1. Background 2. PSTL overview Parallelized Streaming Transformation Loader 3. Parallelism in Kafka, Spark, Vertica 4. PSTL drill down Parallelized Streaming Transformation Loader 5. Vertica Performance! https://www.linkedin.com/in/jackglinkedin 3
  • 5. My Background Playtika, VP of Big Data Flume, Kafka, Spark, ZK, Yarn, Vertica. [Jay Kreps (Kafka), Michael Armbrust (Spark SQL), Chris Bowden (Dev Demigod)] MIS Director of several start-up companies Dataflex a 4GL RDBMS. [E.F. Codd] Self-employed Consultant Intercept Dataflex db calls to store and retrieve data to/from Btrieve and IBM DB2 Mainframe FoxPro, Sybase, MSSQL Server beta Design Patterns: Elements of Reusable Object-Oriented Software [The Gang of Four] Microsoft; Dev Manager, Architect CLR/.Net Framework, Product Unit Manager Technical Strategy Group Inventor of “Shuttle”, a Microsoft product in use since 1999 A distributed ETL based on MSMQ which influenced MSSQL DTS (SQL SSIS) [Joe Celko, Ralph Kimball, Steve Butler (Microsoft Architect)] Twitter, Manager of Analytics Data Warehouse Core Storage; Hadoop, HBase, Cassandra, Blob Store Analytics Infra; MySQL, PIG, Vertica (n-Petabyte capacity with Multi-DC DR) [Prof. Michael Stonebraker, Ron Cormier (Vertica Internals)] https://www.linkedin.com/in/jackglinkedin 5
  • 6. A Quest With attributes of  Operational Robustness  High Availabilty  Stronger durability guarantees  Idempotent (an operation that is safe to repeat)  Productivity  Analytics  Streaming, Machine Learning, BI, BA, Data Science  Rich Development env.  Strongly typed, OO, Functional, with support for set based logic and aggregations (SQL)  Performance  Scalable in every tier  MPP for Transformations, Reads & Writes A Unified Data Pipeline with Parallelism from Streaming Data through Data Transformations to Data Storage (Semi-Structured, Structured, and Relational Data) https://www.linkedin.com/in/jackglinkedin 6
  • 8. ELT “Extract, Load, Transform is an alternative to Extract, transform, load (ETL) used with data lake implementations. In ELT models the data is not processed on entry to the data lake which enables faster loading times. But does require sufficient processing within the data processing engine to carry out the transform on demand and return the results to the consumer in a timely manner. Since the data is not processed on entry to the data lake the query and schema do not need to be defined a-priori (often the schema will be available during load since many data sources are extracts from databases or similar structured data systems and hence have an associated schema).” https://www.linkedin.com/in/jackglinkedin 8 https://en.wikipedia.org/wiki/Extract,_load,_transform
  • 9. Lambda Architecture 9 “Essentially, the speed layer (streaming) is responsible for filling the "gap" caused by the batch layer's lag in providing views based on the most recent data” - https://en.wikipedia.org/wiki/Lambda_architecture
  • 10. Questioning the Lambda Architecture by Jay Kreps The Lambda Architecture has its merits, but alternatives are worth exploring. “As someone who designs infrastructure, I think the glaring question is this: why can’t the stream processing system just be improved to handle the full problem set in its target domain? Why do you need to glue on another system? Why can’t you do both real-time processing and also handle the reprocessing when code changes? Stream processing systems already have a notion of parallelism; why not just handle reprocessing by increasing the parallelism and replaying history very, very fast? The answer is that you can do this, and I think this it is actually a reasonable alternative architecture if you are building this type of system today.” 10
  • 11. REST API Flume Apache Flume™ ETL JAVA ™ Parser & Loader MPP Columnar DW HP Vertica™ Cluster UserId <-> UserGId  Analytics of Relational Data  Structured Relational and Aggregated Data Application Application Game Applications GameX GameY GameZ COPY Playtika Santa Monica original ETL Architecture Extract Transform Load Single Source of Truths to Global SOT Unified Schema JSON Local Data Warehouses Original Architecture (ETL) 1 2 3 4 5 https://www.linkedin.com/in/jackglinkedin 11 UserId: INT SessionId: UUId (36) UserId: INT SessionId: UUId (32) UserId: varchar(32) SessionId: varchar(255)
  • 12. Agenda 1. Background 2. PSTL overview Parallelized Streaming Transformation Loader https://www.linkedin.com/in/jackglinkedin 12
  • 13. Real-Time Messaging Apache Kafka™ Analytics of [semi]Structured [non]Relational Data Stores Real-Time Streaming ✓Machine Learning ✓ Semi-Structured Raw JSON Data ✖Structured (non)relational Parquet Data  Structured Relational and Aggregated Data Resilient Distributed Datasets Apache Spark™ Hadoop™ Parquet™ ✓ ✓ ✖  REST API Or Local Kafka Application Application Game Applications Unified Schema JSON Local Data Warehouses MPP Columnar DW HP Vertica™  MPP 1 2 3 P a r a l l e l i z e d S t r e a m i n g T r a n s f o r m a t i o n L o a d e r 4 5 New PSTL Architecture New PSTL Architecture https://www.linkedin.com/in/jackglinkedin 13 Bingo Blitz UserId: INT SessionId: UUId (36) Slotomania UserId: INT SessionId: UUId (32) WSOP UserId: varchar(32) SessionId: varchar(255)
  • 14. 14 Agenda 1. Background 2. PSTL overview Parallelized Streaming Transformation Loader 3. Parallelism in Kafka, Spark, Vertica
  • 15. Apache Kafka ™ is a distributed, partitioned, replicated commit log service Producer Producer Producer Kafka Cluster (Broker) Consumer Consumer Consumer
  • 16. A topic is a category or feed name to which messages are published. For each topic, the Kafka cluster maintains a partitioned log that looks like Each partition is an ordered, immutable sequence of messages that is continually appended to—a commit log. The messages in the partitions are each assigned a sequential id number called the offset that uniquely identifies each message within the partition. The Kafka cluster retains all published messages—whether or not they have been consumed —for a configurable period of time. Apache Kafka ™
  • 17. Spark RDD A Resilient Distributed Dataset [in Memory] Represents an immutable, partitioned collection of elements that can be operated on in parallel Node 1 Node 2 Node 3 Node… RDD 1 RDD 1 Partition 1 RDD 1 Partition 2 RDD 3 RDD 3 Partition 2 RDD 3 Partition 3 RDD 3 Partition 1 RDD 2 RDD 2 Partition 1 to 64 RDD 2 Partition 65 to 128 RDD 2 Partition 193 to 256 RDD 2 Partition 129 to 192 RDD 1 Partition 3
  • 18. 18Initiator Node An Initiator Node shuffles data to storage nodes Vertica Hashing & Partitioning
  • 19. 19 Agenda 1. Background 2. PSTL overview Parallelized Streaming Transformation Loader 3. Parallelism in Kafka, Spark, Vertica 4. PSTL drill down Parallelized Streaming Transformation Loader
  • 20. {"appId": 3, "sessionId": ”7”, "userId": ”42” } {"appId": 3, "sessionId": ”6”, "userId": ”42” } Node 1 Node 2 Node 3 Node 4 3 Import recent Sessions Apache Kafka Cluster Topic: “appId_1” Topic: “appId_2” Topic: “appId_3” old new Kafka Table appId, TopicOffsetRange, Batch_Id SessionMax Table sessionGIdMax Int UserMax Table userGIdMax Int appSessionMap_RDD appId: Int sessionId: String sessionGId: Int appUserMap_RDD appId: Int userId: String userGId: Int appSession appId: Int sessionId: varchar(255) sessionGId: Int appUser appId: Int userId: varchar(255) userGId: Int 1 Start a Spark Driver per APP Node 1 Node 2 Node 3 4 Spark Kafka [non]Streaming job per APP (read partition/offset range) 5 select for update; update max GId 5 Assign userGIds To userId sessionGIds To sessionId 6 Hash(userGId) to RDD partitions with affinity To Vertica Node(s) 7 userGIdRDD.foreachPartition {…stream.writeTo(socket)...} 8 Idempotent: Write Raw JSON to hdfs 9 Idempotent: Write Parsed JSON to .ORC hdfs 10 Update MySQL Kafka Offsets {"appId": 2, "sessionId": ”4”, "userId": ”KA” } {"appId": 2, "sessionId": ”3”, "userId": ”KY” }{"appId": 1, "sessionId": ”2”, "userId": ”CB” } {"appId": 1, "sessionId": "1”, "userId": ”JG” } 4 appId {Game events, Users, Sessions,…} Partition 1..n RDDs 5 appId Users & Sessions Partition 1..n RDDs 5 appId appUserMap_RDD.union(assignedID_RDD) 6 appId Users & Sessions Partition 1..n RDDs 7 copy jackg.DIM_USER with source SPARK(port='12345’, nodes=‘node0001:4, node0002:4, node0003:4’) direct; 2 Import Users Apache Hadoop™ Spark™ Cluster HPE Vertica™ Cluster
  • 21. 21 Agenda 1. Background 2. PSTL overview Parallelized Streaming Transformation Loader 3. Parallelism in Kafka, Spark, Vertica 4. PSTL drill down Parallelized Streaming Transformation Loader 5. Vertica Performance!
  • 22. Impressive Parallel COPY Performance Loaded 2.42 Billion Rows (451 GB) in 7min 35sec on an 8 Node Cluster Key Takeaways Parallel Kafka Reads to Spark RDD (in memory) with Parallel writes to a Vertica via tcp server – ROCKS! COPY 36 TB/Hour with 81 Node cluster No ephemeral nodes needed for ingest Kafka read parallelism to Spark RDD partitions A priori hash() in Spark RDD Partitions (in Memory) TCP Server as a Vertica User Define Copy Source Single COPY does not preallocate Memory across nodes http://www.vertica.com/2014/09/17/how-vertica-met-facebooks-35-tbhour-ingest-sla/ * 270 Nodes ( 215 Data Nodes )
  • 23. A noETL Parallel Streaming Transformation Loader using Kafka, Spark & Vertica Jack Gudenkauf CEO & Founder of BigDataInfra https://www.linkedin.com/in/jackglinkedin 23 JackGudenkauf@gmail.com THANK YOU

Editor's Notes

  1. BigDataInfra – compelled to start a company to help companies with the same issues I have seen throughout my career. BigDataInfra specializes in the system architectures we will be discussing Imho - noETL is the latest discussion of noE as T&L will always need to eventualy happen, even with “schema-less” schema on read, Data Lakes, etc.. noSQL should really be noRel and is being proven out by everyone putting SQL consumption on non-RDBMS.
  2. My experience and influencers framed my architectural decisions
  3. http://kafka.apache.org/documentation.html Authored at LinkedIn, used by Twitter, …
  4. http://kafka.apache.org/documentation.html Authored at LinkedIn, used by Twitter, …
  5. We use Spark RDD partitioned data to parallelize opertaions to/from affinitized Vertica nodes e.g., 3 Kafka Partitions would read in parallel into 3 Spark RDD Partitions Typically you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to parallelize (e.g. sc.parallelize(data, 10))
  6. SHUFFLE!
  7. BigDataInfra – compelled to start a company to help companies with the same issues I have seen throughout my career. BigDataInfra specializes in the system architectures we will be discussing Imho - noETL is the latest discussion of noE as T&L will always need to eventualy happen, even with “schema-less” schema on read, Data Lakes, etc.. noSQL should really be noRel and is being proven out by everyone putting SQL consumption on non-RDBMS.