SlideShare a Scribd company logo
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
S E O U L | M A Y 4 , 2 0 2 3
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CDC !
Modern Transactional Data Lake
AWS
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
• Append-Only
• CDC-based UPSERT
▪ View
▪ Open Table Formats – Apache Iceberg, Hudi, Delta Lake
• Modern Transactional Data Lake Architecture
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CRM
IoT
WEB
Messages
CDC*
Event Streams
* CDC: Change Data Capture
RDBMS Data Insights
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
RDBMS Scalability
RDBMS
(Replica)
RDBMS
(Primary)
Query
Engine
(1)
Storage
Query
Engine
(2)
Query
Engine
(3)
Storage
interface
Scale-Out
Scale-Out
Primary-Replica Cluster
RDBMS
(Primary)
Scale-Up
RDBMS
(Replica)
Scale-Out
Replica
Primary
Distributed File System
RDBMS
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DFS*
Stream
Storage
Data Lake
Data
Mart
AI/ML
CRM
IoT
WEB
Messages
CDC
Event Streams
Data Lake
* DFS: Distributed File System
Data
Ware
house
Stream
Delivery
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CRM
IoT
WEB
Messages
CDC
Event Streams
Data Lake
Amazon Kinesis
Data Streams
Amazon Kinesis
Data Firehose
Amazon Athena
Amazon S3
Data Lake
Amazon QuickSight
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
IMMUTABLE Objects
Distributed
CAN NOT Update/Delete In-Place
Insert (Append)-Only
interface (HTTPS, SDK APIs)
Transactional (X)
MUTABLE Records
Files per tables
Update/Delete In-Place
Insert/Update/Delete
table1
table2
table3
RDBMS
Transactional (O)
RDBMS vs. S3 (≈ Distributed Object Storage)
File
System
File
System
File
System
Amazon S3
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
RDBMS
CDC
CDC Update/Delete ?
Amazon Kinesis
Data Streams
Amazon Kinesis
Data Firehose
Amazon Athena
Amazon S3
AWS DMS
datalake/
year=2023/month=05/day=03/hour=01/
obj1.parquet
obj2.parquet
…
year=2023/month=05/day=03/hour=02/
updated-obj1.parquet
…
Data Lake
Operation
Changed Data
I, pk1, c1, c2, t1
U, pk1, c1, c2, t2
D, pk0, c1, c2, t3
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
View UPSERT : Merge-On-Read
RDBMS
Updated/
Deleted
Data
Inserted Data
View Table
Operation
Changed Data
I, pk1, c1, c2, t1
U, pk1, c1, c2, t2
I, pk1, c1, c2, t1
U, pk1, c1, c2, t2
D, pk0, c1, c2, t3
I, pk1, c1, c2, t1
U, pk1, c1, c2, t2
I, pk0, c1, c2, t0
D, pk0, c1, c2, t3
I, pk0, c1, c2, t0
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
View UPSERT : Merge-On-Read
RDBMS
Updated/Deleted
Data
Inserted Data
View Table
Amazon S3
Amazon Athena
Amazon Redshift
Logical View
Materialized
View
CDC
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Logical View vs. Materialized View
CREATE VIEW view_tbl AS
SELECT *
FROM org_tbl, delta_tbl
SELECT *
FROM view_tbl
SELECT *
FROM (
SELECT *
FROM org_tbl, delta_tbl
)
SELECT *
FROM view_tbl
Materialized View
Logical View
org_tbl
Amazon S3
view_tbl
+
delta_tbl
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Redshift Materialized Views
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Kinesis
Data Streams
Amazon Redshift / Redshift Serverless
Permanent
Tables
Real-time
Materialized
View
Streaming
Table
…
…
Amazon
QuickSight
Amazon MSK
Amazon Redshift Streaming Ingestion
M A T E R I A L I Z E D V I E W
Auto Refresh
Data Source
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
t1 t2
Inserted Data
(t1)
Amazon S3
Inserted Data
(t2)
+
+ a b c d e f
Merge & Compaction
time
Data Size
Updated/
Deleted Data
(t1)
Updated/
Deleted Data
(t2)
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
year=2022/month=01/day=01/hour=00/
p1.parquet
p2.parauet
year=2022/month=02/day=01/hour=00/
...
year=2022/month=12/day=01/hour=00/
...
year=2023/month=01/day=02/hour=00/
p1.parquet
p2.parauet
year=2023/month=01/day=02/hour=01/
p1.parquet
p2.parauet
S3 Glacier
Deep
Archive
S3
Standard
Logical View
Update/
Delete
View
Merge-On-Read
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Logical View
• – Read ,
•
• = Merge & Compaction +
•
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Real-time
Materialized View
org_tbl
delta_tbl
Auto Refresh
Streaming
Table
Permanent
Table
Materialized View
Amazon Redshift
Data
Volume
Data
Volume
Data
Volume
t1
tN time
t2
Data Size Unlimited Data Volume
.....
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Real-time
Materialized View
org_tbl
delta_tbl
Auto Refresh
Table
data files commit log
Merge-On-Read
Streaming
Table
Permanent
Table
Amazon S3
Materialized View S3 ?
Amazon Redshift
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Table
data files commit log
Merge-On-Read
Amazon S3
“Table Format” = Layout of Files in Table
commit_log
date=2023-01-01
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon S3 RDBMS
RDBMS
Index
Field1
(v1, t1)
Files
binlog
Read
Field1
(v2, t2)
my_table/
date=2023-01-01/
file-1.parquet
......
file-2.parquet
......
commit_log/
00000.json
00001.json
......
Amazon S3
Write
t1 t2 time
Table
data files
Merge-On-Read
commit log
Insert file-1.parquet
Insert file-2.parquet
Delete file-1.parquet
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
“Table Format” = Layout of Files in Table
O P E N T A B L E F O R M A T S
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Apache Hudi
© hudi.apache.org
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Apache Hudi
© hudi.apache.org
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Apache Iceberg
s0
Data
Snapshots
t0 t1
Partition
File
Location
Schema
Format
Stats
Write & Commit
time
Snapshots: State of table at some time
s1
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Apache Iceberg
M E T A D A T A F I L E S T O T R A C K D A T A
schema, partitions, snapshots
list of files and mappings to snapshots
tracks data files and statistics
© iceberg.apache.org
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Apache Iceberg
M E T A D A T A F I L E S T O T R A C K D A T A
my_table/
├── metadata/
│ ├── 00000.metadata.json
│ ├── 00001.metadata.json
│ ├── 00002.metadata.json
│ .......
│ ├── a39f-e190-b871-ac8e5b-m0.avro
│ ├── a39f-e190-b871-ac8e5b-m1.avro
│ ├── a39f-e190-b871-ac8e5b-m2.avro
│ .......
│ ├── snap-1954-1-2e934.avro
│ ├── snap-4381-1-255b.avro
│ ├── snap-4866-1-8bf57.avro
└── data/
├── date=2023-01-01
│ └── file-1.parquet
└── date=2023-01-02
└── file-2.parquet
© iceberg.apache.org
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Delta Lake
my_table/
├── _delta_log
│ ├── 00000.json
│ ├── 00001.json
│ ├── 00002.json
│ .......
│ ├── 00010.json
│ └── 00010.checkpoint.parquet
├── date=2023-01-01
│ └── file-1.parquet
└── date=2023-01-02
└── file-2.parquet
Transaction Log
Single commits
Checkpoint Files
(Optional) Partition Directories
Data Files
Add 1.parquet
Add 2.parquet
Remove 1.parquet
Remove 2.parquet
Add 3.parquet
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Open Table Formats – Iceberg, Hudi, Delta Lake
Apache Iceberg Hudi Delta Lake
ACID Yes Yes Yes
Partition Evolution Yes No No
Schema Evolution Yes Partial Limited
Time Travel Yes Yes Yes
Merge Yes Yes Yes
Compaction API based Manual Automated
Data Format Parquet, Avro, ORC, CSV Parquet, ORC Parquet
Current Pointer Metastore, File system with
version File
Timeline commit Transaction log
Conflict Resolution Optimistic Optimistic Optimistic
Programming
Language
Java & Python Scala, Java & Python Java & Python
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Modern Transactional Data Lake
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Typical Data Pipeline & Data Lake
AWS DMS Amazon Kinesis
Data Streams
Amazon Athena
Amazon S3
Amazon RDS
Payments
• : Insert
• : Update
• : Delete
• :
Append Only
Amazon Kinesis
Data Firehose
Data Source Data Pipeline Data Lake
User Profile
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CDC-based UPSERT Data Lake
AWS DMS Amazon Kinesis
Data Streams
Amazon Athena
Amazon S3
Amazon RDS Amazon Kinesis
Data Firehose
S3
User Profile iceberg
Payments
parquet, orc, avro
iceberg, hudi, delta lake
Athena Hudi Iceberg Delta Lake
Insert X O X
Delete X O X
Select O O O
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CDC-based UPSERT Data Lake
AWS DMS Amazon Kinesis
Data Streams
Amazon Athena
Amazon S3
Amazon RDS
S3
User Profile iceberg
Payments
parquet, orc, avro
iceberg, hudi, delta lake
Athena Hudi Iceberg Delta Lake
Insert X O X
Delete X O X
Select O O O
AWS Glue
Flink /
Spark
Amazon EMR
Open Source
Serverless Fully Managed
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CDC-based UPSERT Data Lake
AWS DMS Amazon Kinesis
Data Streams
Amazon Athena
Amazon S3
Amazon RDS AWS Glue
Streaming
Operation
Changed Data
I, pk1, c1, c2, t1
U, pk1, c1, c2, t2
D, pk0, c1, c2, t3
CDC
{ JSON }
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Transactional Data Lake
AWS DMS Amazon Kinesis
Data Streams
AWS Glue
Streaming
Amazon Athena
Amazon S3
Amazon RDS
AWS DMS Amazon Kinesis
Data Streams
Amazon Athena
Amazon S3
Amazon RDS Amazon Kinesis
Data Firehose
{JSON}
{JSON}
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Demo
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Reference Architecture
https://github.com/aws-samples/transactional-datalake-using-apache-iceberg-on-aws-glue
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Spark + Glue Context
Kinesis Data Streams
Apache Iceberg
Insert/Update/Delete
1
2
3
Glue Streaming Job Code
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Glue Streaming Job Code
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
5
Glue Streaming
Upsert
Delete
1
2
3
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Summary
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
“Table Format” = Layout of Files in Table
O P E N T A B L E F O R M A T S
Amazon S3
Update/Delete In-Place
table1
table2
table3
RDBMS
Transactional
Data Lake RDBMS
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Transactional Data Lake:
AWS DMS Amazon Kinesis
Data Streams
AWS Glue
ETL
Amazon Athena
Amazon S3
Amazon RDS
(Apache Iceberg,
Hudi, Delta Lake)
Amazon S3
Amazon Kinesis
Data Firehose
Raw Zone Curated Zone
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Transactional Data Lake: +
L A M B D A A R C H I T E C T U R E
AWS DMS Amazon Kinesis
Data Streams
AWS Glue
ETL
Amazon Athena
Amazon S3
Amazon RDS
Amazon Redshift / Redshift Serverless
Real-Time
Materialized
View
Streaming
Table
Permanent
Tables
(Apache Iceberg,
Hudi, Delta Lake)
Amazon S3
Amazon Kinesis
Data Firehose
Raw Zone Curated Zone
Batch Layer
Speed Layer
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Transactional Data Lake:
AWS DMS Amazon Kinesis
Data Streams
AWS Glue
Streaming
Amazon Athena
Amazon S3
Amazon RDS
(Apache Iceberg,
Hudi, Delta Lake)
Amazon Redshift / Redshift Serverless
Real-Time
Materialized
View
Streaming
Table
Permanent
Tables
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
On-Premise Transactional Data Lake
Generic
database
Corporate
data center
Long Time-to-build High Cost in TCO
Deep Expertise
Required
Security
HDFS
Kafka
Connect
Connect
Hive /
Presto
Flink /
Spark
Streaming
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Generic
database
AWS DMS Amazon Kinesis
Data Streams
AWS Glue
Streaming
Amazon Athena
Amazon S3
Corporate
data center
AWS Cloud
Streaming Migrations for Analytics on
Generic
database
Corporate
data center
HDFS
Hive /
Presto
Kafka
Connect
Connect
(Apache Iceberg,
Hudi, Delta Lake)
(Apache Iceberg,
Hudi, Delta Lake)
Flink /
Spark
S
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data Lake
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
감사합니다
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.

More Related Content

AWS Summit Seoul 2023 | 실시간 CDC 데이터 처리! Modern Transactional Data Lake 구축하기

  • 1. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. S E O U L | M A Y 4 , 2 0 2 3
  • 2. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. CDC ! Modern Transactional Data Lake AWS
  • 3. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda • Append-Only • CDC-based UPSERT ▪ View ▪ Open Table Formats – Apache Iceberg, Hudi, Delta Lake • Modern Transactional Data Lake Architecture
  • 4. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. CRM IoT WEB Messages CDC* Event Streams * CDC: Change Data Capture RDBMS Data Insights
  • 5. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. RDBMS Scalability RDBMS (Replica) RDBMS (Primary) Query Engine (1) Storage Query Engine (2) Query Engine (3) Storage interface Scale-Out Scale-Out Primary-Replica Cluster RDBMS (Primary) Scale-Up RDBMS (Replica) Scale-Out Replica Primary Distributed File System RDBMS
  • 6. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. DFS* Stream Storage Data Lake Data Mart AI/ML CRM IoT WEB Messages CDC Event Streams Data Lake * DFS: Distributed File System Data Ware house Stream Delivery
  • 7. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. CRM IoT WEB Messages CDC Event Streams Data Lake Amazon Kinesis Data Streams Amazon Kinesis Data Firehose Amazon Athena Amazon S3 Data Lake Amazon QuickSight
  • 8. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. IMMUTABLE Objects Distributed CAN NOT Update/Delete In-Place Insert (Append)-Only interface (HTTPS, SDK APIs) Transactional (X) MUTABLE Records Files per tables Update/Delete In-Place Insert/Update/Delete table1 table2 table3 RDBMS Transactional (O) RDBMS vs. S3 (≈ Distributed Object Storage) File System File System File System Amazon S3
  • 9. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. RDBMS CDC CDC Update/Delete ? Amazon Kinesis Data Streams Amazon Kinesis Data Firehose Amazon Athena Amazon S3 AWS DMS datalake/ year=2023/month=05/day=03/hour=01/ obj1.parquet obj2.parquet … year=2023/month=05/day=03/hour=02/ updated-obj1.parquet … Data Lake Operation Changed Data I, pk1, c1, c2, t1 U, pk1, c1, c2, t2 D, pk0, c1, c2, t3
  • 10. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. View UPSERT : Merge-On-Read RDBMS Updated/ Deleted Data Inserted Data View Table Operation Changed Data I, pk1, c1, c2, t1 U, pk1, c1, c2, t2 I, pk1, c1, c2, t1 U, pk1, c1, c2, t2 D, pk0, c1, c2, t3 I, pk1, c1, c2, t1 U, pk1, c1, c2, t2 I, pk0, c1, c2, t0 D, pk0, c1, c2, t3 I, pk0, c1, c2, t0
  • 11. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. View UPSERT : Merge-On-Read RDBMS Updated/Deleted Data Inserted Data View Table Amazon S3 Amazon Athena Amazon Redshift Logical View Materialized View CDC
  • 12. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Logical View vs. Materialized View CREATE VIEW view_tbl AS SELECT * FROM org_tbl, delta_tbl SELECT * FROM view_tbl SELECT * FROM ( SELECT * FROM org_tbl, delta_tbl ) SELECT * FROM view_tbl Materialized View Logical View org_tbl Amazon S3 view_tbl + delta_tbl
  • 13. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift Materialized Views
  • 14. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Kinesis Data Streams Amazon Redshift / Redshift Serverless Permanent Tables Real-time Materialized View Streaming Table … … Amazon QuickSight Amazon MSK Amazon Redshift Streaming Ingestion M A T E R I A L I Z E D V I E W Auto Refresh Data Source
  • 15. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. t1 t2 Inserted Data (t1) Amazon S3 Inserted Data (t2) + + a b c d e f Merge & Compaction time Data Size Updated/ Deleted Data (t1) Updated/ Deleted Data (t2)
  • 16. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. year=2022/month=01/day=01/hour=00/ p1.parquet p2.parauet year=2022/month=02/day=01/hour=00/ ... year=2022/month=12/day=01/hour=00/ ... year=2023/month=01/day=02/hour=00/ p1.parquet p2.parauet year=2023/month=01/day=02/hour=01/ p1.parquet p2.parauet S3 Glacier Deep Archive S3 Standard Logical View Update/ Delete View Merge-On-Read
  • 17. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Logical View • – Read , • • = Merge & Compaction + •
  • 18. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Real-time Materialized View org_tbl delta_tbl Auto Refresh Streaming Table Permanent Table Materialized View Amazon Redshift Data Volume Data Volume Data Volume t1 tN time t2 Data Size Unlimited Data Volume .....
  • 19. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Real-time Materialized View org_tbl delta_tbl Auto Refresh Table data files commit log Merge-On-Read Streaming Table Permanent Table Amazon S3 Materialized View S3 ? Amazon Redshift
  • 20. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Table data files commit log Merge-On-Read Amazon S3 “Table Format” = Layout of Files in Table commit_log date=2023-01-01
  • 21. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon S3 RDBMS RDBMS Index Field1 (v1, t1) Files binlog Read Field1 (v2, t2) my_table/ date=2023-01-01/ file-1.parquet ...... file-2.parquet ...... commit_log/ 00000.json 00001.json ...... Amazon S3 Write t1 t2 time Table data files Merge-On-Read commit log Insert file-1.parquet Insert file-2.parquet Delete file-1.parquet
  • 22. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. “Table Format” = Layout of Files in Table O P E N T A B L E F O R M A T S
  • 23. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Apache Hudi © hudi.apache.org
  • 24. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Apache Hudi © hudi.apache.org
  • 25. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Apache Iceberg s0 Data Snapshots t0 t1 Partition File Location Schema Format Stats Write & Commit time Snapshots: State of table at some time s1
  • 26. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Apache Iceberg M E T A D A T A F I L E S T O T R A C K D A T A schema, partitions, snapshots list of files and mappings to snapshots tracks data files and statistics © iceberg.apache.org
  • 27. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Apache Iceberg M E T A D A T A F I L E S T O T R A C K D A T A my_table/ ├── metadata/ │ ├── 00000.metadata.json │ ├── 00001.metadata.json │ ├── 00002.metadata.json │ ....... │ ├── a39f-e190-b871-ac8e5b-m0.avro │ ├── a39f-e190-b871-ac8e5b-m1.avro │ ├── a39f-e190-b871-ac8e5b-m2.avro │ ....... │ ├── snap-1954-1-2e934.avro │ ├── snap-4381-1-255b.avro │ ├── snap-4866-1-8bf57.avro └── data/ ├── date=2023-01-01 │ └── file-1.parquet └── date=2023-01-02 └── file-2.parquet © iceberg.apache.org
  • 28. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Delta Lake my_table/ ├── _delta_log │ ├── 00000.json │ ├── 00001.json │ ├── 00002.json │ ....... │ ├── 00010.json │ └── 00010.checkpoint.parquet ├── date=2023-01-01 │ └── file-1.parquet └── date=2023-01-02 └── file-2.parquet Transaction Log Single commits Checkpoint Files (Optional) Partition Directories Data Files Add 1.parquet Add 2.parquet Remove 1.parquet Remove 2.parquet Add 3.parquet
  • 29. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Open Table Formats – Iceberg, Hudi, Delta Lake Apache Iceberg Hudi Delta Lake ACID Yes Yes Yes Partition Evolution Yes No No Schema Evolution Yes Partial Limited Time Travel Yes Yes Yes Merge Yes Yes Yes Compaction API based Manual Automated Data Format Parquet, Avro, ORC, CSV Parquet, ORC Parquet Current Pointer Metastore, File system with version File Timeline commit Transaction log Conflict Resolution Optimistic Optimistic Optimistic Programming Language Java & Python Scala, Java & Python Java & Python
  • 30. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Modern Transactional Data Lake
  • 31. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Typical Data Pipeline & Data Lake AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS Payments • : Insert • : Update • : Delete • : Append Only Amazon Kinesis Data Firehose Data Source Data Pipeline Data Lake User Profile
  • 32. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. CDC-based UPSERT Data Lake AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS Amazon Kinesis Data Firehose S3 User Profile iceberg Payments parquet, orc, avro iceberg, hudi, delta lake Athena Hudi Iceberg Delta Lake Insert X O X Delete X O X Select O O O
  • 33. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. CDC-based UPSERT Data Lake AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS S3 User Profile iceberg Payments parquet, orc, avro iceberg, hudi, delta lake Athena Hudi Iceberg Delta Lake Insert X O X Delete X O X Select O O O AWS Glue Flink / Spark Amazon EMR Open Source Serverless Fully Managed
  • 34. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. CDC-based UPSERT Data Lake AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS AWS Glue Streaming Operation Changed Data I, pk1, c1, c2, t1 U, pk1, c1, c2, t2 D, pk0, c1, c2, t3 CDC { JSON }
  • 35. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Transactional Data Lake AWS DMS Amazon Kinesis Data Streams AWS Glue Streaming Amazon Athena Amazon S3 Amazon RDS AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS Amazon Kinesis Data Firehose {JSON} {JSON}
  • 36. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Demo
  • 37. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Reference Architecture https://github.com/aws-samples/transactional-datalake-using-apache-iceberg-on-aws-glue
  • 38. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Spark + Glue Context Kinesis Data Streams Apache Iceberg Insert/Update/Delete 1 2 3 Glue Streaming Job Code
  • 39. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Glue Streaming Job Code
  • 40. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. 5 Glue Streaming Upsert Delete 1 2 3
  • 41. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Summary
  • 42. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. “Table Format” = Layout of Files in Table O P E N T A B L E F O R M A T S Amazon S3 Update/Delete In-Place table1 table2 table3 RDBMS Transactional Data Lake RDBMS
  • 43. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Transactional Data Lake: AWS DMS Amazon Kinesis Data Streams AWS Glue ETL Amazon Athena Amazon S3 Amazon RDS (Apache Iceberg, Hudi, Delta Lake) Amazon S3 Amazon Kinesis Data Firehose Raw Zone Curated Zone
  • 44. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Transactional Data Lake: + L A M B D A A R C H I T E C T U R E AWS DMS Amazon Kinesis Data Streams AWS Glue ETL Amazon Athena Amazon S3 Amazon RDS Amazon Redshift / Redshift Serverless Real-Time Materialized View Streaming Table Permanent Tables (Apache Iceberg, Hudi, Delta Lake) Amazon S3 Amazon Kinesis Data Firehose Raw Zone Curated Zone Batch Layer Speed Layer
  • 45. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Transactional Data Lake: AWS DMS Amazon Kinesis Data Streams AWS Glue Streaming Amazon Athena Amazon S3 Amazon RDS (Apache Iceberg, Hudi, Delta Lake) Amazon Redshift / Redshift Serverless Real-Time Materialized View Streaming Table Permanent Tables
  • 46. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. On-Premise Transactional Data Lake Generic database Corporate data center Long Time-to-build High Cost in TCO Deep Expertise Required Security HDFS Kafka Connect Connect Hive / Presto Flink / Spark Streaming
  • 47. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Generic database AWS DMS Amazon Kinesis Data Streams AWS Glue Streaming Amazon Athena Amazon S3 Corporate data center AWS Cloud Streaming Migrations for Analytics on Generic database Corporate data center HDFS Hive / Presto Kafka Connect Connect (Apache Iceberg, Hudi, Delta Lake) (Apache Iceberg, Hudi, Delta Lake) Flink / Spark S
  • 48. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data Lake
  • 49. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. 감사합니다 © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.