Cloud Native Data Pipelines (DataEngConf SF 2017)

Cloud Native Data
Pipelines
1
Sid Anand (@r39132)
DataEngConf SF 2017

About Me
2
Work [ed | s] @
Committer &
PPMC on
Father of 2
Co-Chair for
Apache Airﬂow

9
Enterprise
Customers
email
metadata
apply
trust
models
email md +
trust score
Agari’s Previous EP Version
Agari : What We Do
Batch

10
email
metadata
apply
trust
models
email md +
trust score
Agari’s Current EP VersionEnterprise
Customers
Agari : What We Do
Near-real
time
Quarantine,
Label,
PassThrough

Data Pipelines
BI vs Predictive
11

Data Pipelines (BI)
12
Web Servers
OLTP
DB
Data
Warehouse
Repor6ng
Tools
Query
Browsers
ETL (batch)
MySQL,
Oracle,
Cassandra
Terradata,
RedShi;
BigQuery

OLTP DB
or cache
ETL (batch or streaming)
MySQL,
Oracle,
Cassandra,
Redis
Spark,
Flink,
Beam,
Storm
Web Servers
Data Products
Ranking (Search, News Feed),
Recommender Products,
Fraud DetecGon / PrevenGon
Data
Source
Data Pipelines (Predictive)
13

BI Predictive
Common Focus of this talk
Data Pipelines
15
Web Servers
OLTP
DB
Data
Warehouse
Repor6ng
Tools
Query
Browsers
ETL (batch)
MySQL,
Oracle,
Cassandra
Terradata,
RedShi;
BigQuery
OLTP DB
or cache
ETL (batch or streaming)
MySQL,
Oracle,
Cassandra,
Redis
Spark,
Flink,
Beam,
Storm
Web Servers
Ranking (Search, News Feed),
Recommender Products,
Fraud DetecGon / PrevenGon
Data
Source

Motivation
Cloud Native Data Pipelines
16

17
Big Data Companies like LinkedIn, Facebook, Twitter, & Google
have large teams to manage their data pipelines

Most start-ups run in the public cloud. Can they leverage
aspects of the public cloud to build comparable pipelines?

18
Cloud Native
Techniques

Open Source
Technogies
Data Pipelines seen
in Big Data companies

~

Design Goals
Desirable Qualities of a Resilient Data Pipeline
19

20
Desirable Qualities of a Resilient
Data Pipeline
OperabilityCorrectness
Timeliness Cost

21
Data Pipeline
Timeliness Cost
• Data Integrity (no loss, etc…)
• Expected data distributions
• All output within time-bound SLAs
• Minimize Operational Fatigue /
Automate Everything
• Fine-grained Monitoring & Alerting of
Correctness & Timeliness SLAs
• Quick Recoverability
• Pay-as-you-go

Predictive Analytics @ Agari
Use Cases
22

Use Cases
23
Apply trust models
(message scoring)
batch + near real
time
Build trust models
batch
(Enterprise Protect)

Use Cases
24
Apply trust models
(message scoring)
batch + near real
time
Build trust models
batch
(Enterprise Protect)
Focus of this talk

Use-Case : Message
Scoring (batch)
Batch Pipeline Architecture
25

Use-Case : Message Scoring
26
enterprise A
enterprise B
enterprise C
S3
S3 uploads an Avro ﬁle
every 15 minutes

27
enterprise A
enterprise B
enterprise C
S3
Airﬂow kicks of a Spark
message scoring job
every hour (EMR)

28
enterprise A
enterprise B
enterprise C
S3
Spark job writes scored
messages and stats to
another S3 bucket
S3

29
enterprise A
enterprise B
enterprise C
S3
This triggers SNS/SQS
messages events
S3
SNS
SQS

30
enterprise A
enterprise B
enterprise C
S3
An Autoscale Group
(ASG) of Importers spins
up when it detects SQS
messages
S3
SNS
SQS
Importers
ASG

31
enterprise A
enterprise B
enterprise C
S3
The importers rapidly ingest scored
messages and aggregate statistics into
the DB
S3
SNS
SQS
Importers
ASG
DB

32
enterprise A
enterprise B
enterprise C
S3
Users receive alerts of
untrusted emails &
can review them in
the web app
S3
SNS
SQS
Importers
ASG
DB

33
enterprise A
enterprise B
enterprise C
S3 S3
SNS
SQS
Importers
ASG
DB
Airﬂow manages the entire process

34
Architectural Components
Component Role Uses Salient Features Operability Model
Data Lake
• All data stored in S3
• All processing uses S3
Scalable, Available,
Performant
Serverless
Messaging
• Reliable, Transactional,
Pub/Sub
Performant
Serverless
ASG
General
Processing
• Used for importing,
data cleansing,
business logic
Performant
Managed
Data Science
Processing
• Aggregation
• Model Building
• Scoring
Nice programming
model at the cost of
debugging complexity
We Operate
Workﬂow
Engine
• Coordinates all Spark
Jobs & complex ﬂows
Lightweight, DAGs as
Code, Steep learning
curve
We Operate
DB
Persistence for
WebApp
• Holds subset of data
needed for Web App
Rails + Postgres
‘nuff said
We Operate
S3
SNS SQS

Tackling Cost & Timeliness
Leveraging the AWS Cloud
35

Tackling Cost
36
Between Daily Runs During Daily Runs
When running daily, for 23 hours of a day, we didn’t
pay for instances in the ASG or EMR

Tackling Cost
37
Between Hourly Runs During Hourly Runs
When running daily, for 23 hours of a day, we didn’t pay for
instances in the ASG or EMR
This does not help when runs are hourly since AWS charges at
an hourly rate for EC2 instances!

Tackling Timeliness
Auto Scaling Group (ASG)
38

ASG - Overview
39
What is it?
A means to automatically scale out/in clusters to handle
variable load/trafﬁc
A means to keep a cluster/service of a ﬁxed size always up

ASG - Data Pipeline
40
importer
importer
importer
importer
Importer
ASG
scaleout/in
SQS
DB

41
Sent
CPU
ACKd/Recvd
CPU-based auto-scaling is
good at scaling in/out to
keep the average CPU
constant
ASG : CPU-based

ASG : CPU-based
42
Sent
CPU
Recv
Premature
Scale-in
Premature Scale-in:
• The CPU drops to noise-levels before all messages are
consumed
• This causes scale in to occur while the last few
messages are still being committed

43
Scale-out: When Visible Messages > 0 (a.k.a. when queue depth > 0)
Scale-in: When Invisible Messages = 0 (a.k.a. when the last in-ﬂight
message is ACK’d)
This causes the
ASG to grow
This causes the
ASG to shrink
ASG : Queue-based

Auto Scaling Groups
Build & Deploy
44

ASG - Build & Deploy
45
Component Role Details
Spins up Cloud Resources
• Spins up SQS, Kinesis, EC2, ASG,
ELB, etc.. and associate them
using Terraform
• A better version of Chef &
Puppet
• Sets up an EC2 instance
• Agentless, idempotent, &
declarative tool to set up EC2
instances, by installing &
conﬁguring packages, and more
• Spins up an EC2 instance
for the purposes of building
an AMI!
• Can be used with Ansible &
Terraform to bake AMIs & Launch
Auto-Scaling Groups

46
EC2 Step 1 : Packer spins up a temporary
EC2 node - a blank canvas!

EC2
47
EC2 Step 1 : Packer spins up a temporary
Step 2 : Packer runs an Ansible role against the
EC2 node to set it up.

EC2
48
EC2
Step 3 : Snapshots the machine & register the
AMI.EC2
Step 1 : Packer spins up a temporary

EC2
49
EC2
AMI.EC2
Step 4 : Terminates the EC2 instance!

EC2
50
EC2
AMI.EC2
Step 4 : Terminates the EC2 instance!
Step 5 : Using the AMI, Terraform spins up an
auto-scaled compute cluster (ASG)
ASG

51
Data Pipeline
Timeliness Cost
• ASG
• EMR Spark
Daily
• ASG
• EMR Spark
Hourly ASG
• No Cost Savings

Tackling Operability &
Correctness
Leveraging Tooling
52

53
A simple way to author, configure, manage workflows
Provides visual insight into the state & performance of workflow
runs
Integrates with our alerting and monitoring tools
Tackling Operability : Requirements

Apache Airﬂow
Workﬂow Automation & Scheduling
54

55
Airflow: Author DAGs in Python! No need to bundle many config files!
Apache Airflow - Authoring DAGs

56
Airﬂow: Visualizing a DAG
Apache Airﬂow - Authoring DAGs

57
Airﬂow: It’s easy to manage multiple DAGs
Apache Airﬂow - Managing DAGs

Apache Airﬂow - Perf. Insights
58
Airﬂow: Gantt chart view reveals the slowest tasks for a run!

59
Apache Airﬂow - Perf. Insights
Airﬂow: Task Duration chart view show task completion time trends!

60
Airﬂow: …And easy to integrate with Ops tools!
Apache Airﬂow - Alerting

61
Apache Airﬂow - Correctness

62
Data Pipeline
Timeliness Cost

Use-Case : Message
Scoring (near-real time)
NRT Pipeline Architecture
63

64
enterprise A
enterprise B
enterprise C
Kinesis batch put every
second
K

65
enterprise A
enterprise B
enterprise C
K
As ASG of scorers is
scaled up to one process
per core per kinesis shard
Scorers
ASG

66
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Scorers apply the trust
model and send scored
messages downstream

67
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
As ASG of importers is
scaled up to rapidly
import messages
DB

68
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
Imported messages are
also consumed by the
alerter
DB
K
Alerters
ASG

69
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
alerter
DB
K
Alerters
ASG
Quarantine Email

70
Stream Processing Architecture
Component Role Details Pros Operability Model
Data Lake
• All data stored in S3 via
Kinesis Firehose
Performant, Serverless
Serverless
Kinesis Messaging
• Streaming transport
modeled on Kafka
Serverless
Serverless
General
Processing
• ASG Replacement except
for Rails Apps
Serverless
Serverless
ASG
General
Processing
• Used for importing, data
cleansing, business logic
Managed
Managed
Data Science
Processing
• Model Building
We Operate
Workﬂow Engine
• Nightly model builds +
some classic Ops cron
workloads
Lightweight, DAGs as
Code
We Operate
DB
Persistence for
WebApp
• Holds smaller subset of
data needed for Web App
Rails + Postgres
‘nuff said
We Operate
Persistence for
WebApp
• Aggregation + Search
moved from DB to ES
• Model Building queries
moved to Elasticache
Redis
Faster. more accurate for
aggregates, frees up
headroom for DB (polyglot
persistence)
Managed
S3

Innovations
NRT Pipeline Architecture
71

73
What is Avro?
Avro is a self-describing serialization format that supports
primitive data types : int, long, boolean, ﬂoat, string, bytes, etc…
complex data types : records, arrays, unions, maps, enums, etc…
many language bindings : Java, Scala, Python, Ruby, etc…

74
What is Avro?
Avro is a self-describing serialization format that supports
primitive data types : int, long, boolean, ﬂoat, string, bytes, etc…
complex data types : records, arrays, unions, maps, enums, etc…
many language bindings : Java, Scala, Python, Ruby, etc…
The most common format for storing structured Big Data at rest in
HDFS, S3, Google Cloud Storage, etc…
Supports Schema Evolution!

Apache Avro
Why is it useful?
75

76
Why is Avro Useful?
Agari is an IoT company!
Agari Sensors, deployed at customer sites, stream data to Agari’s
Cloud SAAS
Data is sent via Kinesis!
enterprise A
enterprise B
enterprise C Kinesis
Agari SAAS
in AWS

77
Why is Avro Useful?
enterprise A :
enterprise B :
enterprise C : Kinesis
v1
v2
v3
Agari Sensors, deployed at customer sites, stream data to Agari’s
Cloud SAAS
At any point in time, customers run different versions of the Agari
Sensor
Agari SAAS
in AWS

78
Why is Avro Useful?
enterprise A :
enterprise B :
v1
v2
v3
Agari Sensors, deployed at customer sites, stream data to
Agari’s Cloud SAAS
At any point in time, customers run different versions of the
Agari Sensor
These Sensors might send different format versions of the
data!
Agari SAAS
in AWS

79
Why is Avro Useful?
enterprise A :
enterprise B :
v1
v2
v3
Agari SAAS
in AWS
v4
Agari Sensors, deployed at customer sites, stream data to
Agari’s Cloud SAAS
At any point in time, customers run different versions of the
Agari Sensor
These Sensors might send different format versions of the
data!

80
Why is Avro Useful?
enterprise A :
enterprise B :
enterprise C :
v1
v2
v3
Avro allows Agari to seamlessly handle different IoT data format
versions
Agari SAAS
in AWS
Kinesis v4
datum_reader = DatumReader( writers_schema = writers_schema,
readers_schema = readers_schema)
Requirements:
• Schemas are backward-compatible

81
Why is Avro Useful?
Agari SAAS in AWS
S1 S2 S3
s3 Spark
Avro Everywhere!
Avro is so useful, we don’t just to communicate between our
Sensors & our SAAS infrastructure
We also use it as the common data-interchange format between all
services (streaming & batch) within our AWS deployment

82
Why is Avro Useful?
Agari SAAS in AWS
S1 S2 S3
s3 Spark
Avro Everywhere!
Good Language Bindings :
Data Pipelines services are written in Java, Ruby, & Python

84
Avro Schema Example
{"namespace": "agari",
"type": "record",
"name": "User",
"ﬁelds": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}

85
"type": "record",
"name": "User",
"ﬁelds": [
]
}
complex type (record)
Avro Schema Example

86
"type": "record",
"name": "User",
"ﬁelds": [
]
}
Schema name : User
Avro Schema Example

87
"type": "record",
"name": "User",
"ﬁelds": [
]
}
Schema name : User
3 ﬁelds in the record: 1 required, 2
optional
Avro Schema Example

88
"type": "record",
"name": "User",
"ﬁelds": [
]
}
Data
x 1,000,000,000
Avro Schema Data File Example
Schema
Data
0.0001 %
99.999 %
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data

89
"type": "record",
"name": "User",
"ﬁelds": [
]
}
Binary Data block
Avro Schema Streaming Example
Schema
Data
99 %
1 %
Data

90
"type": "record",
"name": "User",
"ﬁelds": [
]
}
Binary Data block
Avro Schema Streaming Example
Schema
Data
99 %
1 %
Data
OVERHEAD!!

Apache Avro
Schema Registry
91

92
Schema
Registry
(Lambda)
Avro Schema Registry
"type": "record",
"name": "User",
"ﬁelds": [
]
}
register_schema
Message
Producer (P)

93
Schema
Registry
(Lambda)
register_schema returns a UUID
Message
Producer (P)

94
Schema
Registry
(Lambda)
Message Producer sends UUID +
Message
Producer (P)
Data
Message
Consumer (C)

95
Schema
Registry
(Lambda)
Message
Producer (P)
Data
Message
Consumer (C)
getSchemaById (UUID)

96
Schema
Registry
(Lambda)
Message
Producer (P)
Data
Message
Consumer (C)
"type": "record",
"name": "User",
"ﬁelds": [
]
}

97
Schema
Registry
(Lambda)
Message
Producer (P)
Message
Consumer (C)
"type": "record",
"name": "User",
"ﬁelds": [
]
}
Message Consumers
• download & cache the schema
• then decode the data

98
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
alerter
DB
K
Alerters
ASG
SR
SR
SR

99
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
alerter
DB
K
Alerters
ASG
SR
SR
SR

Acknowledgments
100
• Vidur Apparao
• Stephen Cattaneo
• Jon Chase
• Andrew Flury
• William Forrester
• Chris Haag
• Chris Buchanan
• Neil Chapin
• Wil Collins
• Don Spencer
• Scot Kennedy
• Natia Chachkhiani
• Patrick Cockwell
• Kevin Mandich
• Gabriel Ortiz
• Jacob Rideout
• Josh Yang
• Julian Mehnle
• Gabriel Poon
• Spencer Sun
• Nathan Bryant
None of this work would be possible without the
essential contributions of the team below

Cloud Native Data Pipelines (DataEngConf SF 2017)

Related slideshows

More Related Content

Cloud Native Data Pipelines (DataEngConf SF 2017)