Serverless Data Platform

Serverless Data Platform
Scott （考特）
July 22nd, 2021 (Thur)
AWS User Group

Scott (考特)
Shu-Jeng, Hsieh
● Sr. Data Engineer, the 104
● AWS Community Builder

Agenda
Prologue
What is a lakehouse?
The lakehouse built on AWS
Summary

104 Corporation Region
Production
Research Environment
Production
Research Environment
Analysts Scientists

Amazon EMR
Analysts
Developers
PMs
Managers

● a SAP tool to ETL
● SPSS for insight &
stats modeling
Analysts
● license fee
● outdated front-end
technology
Scientists
● got power from the
AWS computing
● explore potential
ML applications
● unavailability of data
○ external
○ internal
● data difference
Data Engineers
● open-sourced
tools to ETL
● productionize
scientists’
invention
● maintain & evolve
existing services
● explore potential
ML applications
Other BUs
● Make wishes
● repetitive tasks on
data expansion
● unavailability of
data

Data Lake
Unstructured
Semi-structured
structured
Machine
Learning
Data
Science
Data Warehouse
Data Warehouse
Data Warehouse
BI
BI
BI

External Data Operational Data
BI Reports
Data Warehouse
ETL
Data Warehouses
- Built for BI and reporting
- No support for video, audio, text
- No support for data science, ML
- Limited support for streaming
- Closed & proprietary formats

structured, semi structured, and unstructured data
BI Reports
Data Warehouse
ETL
Data Lakes
- Poor BI support
- Complex to set up
- Poor performance
- Unreliable data swamps
Data Prep and
Validation
Real-time
Database
Data Lake
Machine
Learning
Data
Science

structured, semi structured, and unstructured data
BI
Lakehouse
Machine
Learning
Data
Science
Streaming
Analytics
One platform for every use case
Data Lake for all your data

CloudFormation CDK
● With elapsed time,
YAML/JSON became larger
● Difficult to work with large
YAML/JSON files
○ High error ratio when
copying/pasting
○ It’s a text file, not
programming language
● Infrastructure AS code
● No abstraction
● IDE integration
○ Multiple languages
○ Syntax check,
autocompletion, etc.
● Higher level abstraction
○ Simplified statements
○ 500 lines of CFN to 30
lines of CDK code
● Infrastructure IS code

Why S3
S3 Standard S3 INT S3 S-IA S3 O-IA S3 Glacier
Frequent Infrequent
Access Frequency

● Efficient, columnar data representation.
● Utilizes the record shredding and assembly algorithm.
● Supports schema evolution.

Dataset
Size on
Amazon S3
Query
Run
Time
Data
Scanned
Cost
Data stored as CSV files 1 TB
236
seconds
1.15 TB $5.75
Data stored in
Apache Parquet Format
130 GB
6.78
seconds
2.51 GB $0.01
Savings
87% less when
using Parquet
34x
faster
99% less data
scanned
99.7% savings

The lakehouse built on
AWS
CDC pipeline

Database Migration Service
Source Database Target Database
AWS DMS
Source
Endpoint
Target
Endpoint
Replication Instance
Replication
Task

● CDC listening supported for some dbs
● C-language-based
● Configuration-oriented
● Abundant supported sources and targets
AWS DMS

AWS
Big data tools

- Spark APIs
1. RDD
2. DataFrame
3. Dataset
- Various modes
1. Local
2. Standalone
3. Mesos
4. k8s
- Supported interfaces
1. Java/Scala
2. Python
3. R
4. SQL
- Unified interfaces for
batch and streaming
Worker
JVM
Executor
Worker
JVM
Executor
Worker
JVM
Executor
Driver

Apache Hudi
Apache Iceberg
Delta Lake

Delta Lake (open source) Apache Iceberg Apache Hudi
Transaction (ACID) Y Y Y
MVCC Y Y Y
Time travel Y Y Y
Schema Evolution Y Y Y
Data Mutation Y (update/delete/merge/ merge into) N Y (upsert)
Streaming Sink and source for Spark struct streaming
Sink and source (wip) for Spark
struct streaming, Flink (wip)
DeltaStreamer
HiveincrementalPuller
File Format Parquet Parquet, ORC, AVRO Parquet
Compaction/Cleanup Manual API available Manual and Auto
Integration DSv1, Delta connector DSv2, InputFormat DSv1, InputFormat
Multiple language support Scala/Java/Python Java/Python Java/Python
Storage Abstraction Y Y N
API dependency Spark-bundled Native/Engine bundled DeltaStreamer
Data ingestion Spark, Presto, Hive Spark, Hive DeltaStreamer

AWS
Databricks

ACID transactions on Spark
Delta Lake
Scalable metadata handling
Streaming and batch unification
Schema enforcement
Time travel
Upserts and deletes

https://github.com/databrickslabs/terraform-provider-databricks

Databricks
AWS account
User
AWS account Data plane
in user account
Control plane network in Databricks account
Workspace web application, APIs, and other core services
AWS VPC
endpoint service
AWS VPC
endpoint service
Back-end VPC endpoint
for secure cluster connectivity
relay
Back-end VPC endpoint
for
REST APIs
AWS PrivateLink connection (back-end) AWS PrivateLink connection (back-end)
User on-premise
or VPN network
User transit
AWS account Front-end VPC endpoint
For user access to web
App and REST APIs
User request
to web app
or REST APIs
VPC
AWS PrivateLink
Connection (front-end)

Languages
Scala Rust Python ruby Golang
**
**
Services
Connectors
Databases
Databricks
dafka-
delta-
ingest
Airbyte *
* Currently in development ** coming soon

AWS
Data Platform

CDK Application
Construct
utility components
Construct
S3-related
Construct
Databricks
Construct
DMS instances
Construct
DMS endpoints
Construct
DMS tasks

AWS
During the build time

Glue ETL
1. Serverless Spark
2. DynamicFrame
3. Self-describing, no schema required initially
4. Some feature functions
a. ResolveChoice
b. Unbox => similar to from_json
c. Spigot => similar to TABLESAMPLE

Glue crawlers
Native client JDBC MongoDB client

Glue workflow
Mohit, S., 2020, Simplify data pipelines with AWS Glue automatic code generation and Workflows

Developer
Endpoint
Glue scripts
https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-workflow.html

New log
data arrives
Start
Glue crawler
Needed stats
by requirements
Glue crawler
that deals with
metadata
Glue job
that executes
the ETL
Start
Glue job
CloudWatch
event
Before Jun 20, 2019….

Glue workflow
New log
data arrives
Start
Glue workflow
Needed stats
by requirements
Glue job
that executes
the ETL
Glue crawler
that deals with
metadata

CDK Application
Construct
S3 Event
Notification
Construct
S3 Event
Notification

Version
Supported Spark
and Python
Minimum DPU Default DPU
AW Glue 0.9
● Spark 2.2.1
● Python 2.7
2 10
AWS Glue 1.0
● Spark 2.4.3
● Python 2.7
● Python 3.6
2 10
AWS Glue 2.0
● Spark 2.4.3
● Python 3.7
2 10

AWS Glue Delta Lake
Spark version Spark 2.2.1 & 2.4.3 Always the latest
UI experience
More clicks to job
logs
On hover
Reliability & quality - ACID transactions
● Dynamic Partition Pruning
● Adaptive Query Execution
● accelerator -aware scheduler
● Unified table creation SQL syntax
● Shuffled hash join improvement

Amazon
S3
AWS
Snowball
AWS
Snowmobile
Amazon Kinesis
Video Streams
Amazon Kinesis
Data Firehose
Amazon Kinesis
Data Streams
Amazon
Kinesis
Amazon
Redshift
Amazon
EMR Amazon
Athena
Amazon
Elasticsearch
Service
AI services
● Any type of data
● Security, compliance, and
audit capabilities across
data lake
● Empower all personas
● Democratize ML with SQL
● Unified analytics

Open Data Lake (S3)
Data Management & Governance
Data Science
& Machine Learning
Real-time Data
Applications
BI & SQL
Analytics
Data Engineering
Structured Semi-structured Unstructured Streaming
Lakehouse Platform

Cheng-Wei
Bill Chih-Yi
Daniel Luffy Neil
TItan
Nick Scott

Serverless Data Platform

Related slideshows

More Related Content

Serverless Data Platform