SlideShare a Scribd company logo
Data Platform
Architecture Principles
Julien Le Dem
CTO and co-founder Datakin
@J_
AGENDA
01 A Healthy Data Ecosystem
02 Data Platform Abstractions and Services
03 Observability for data pipelines
A Healthy Data
Ecosystem
01
Team interdependencies
Team A Team B
Team C
Explicit contracts
● Schemas
● Shared or Private
● SLA: experimental, production ready
Understanding dependencies
● Who do I depend on?
● Who depends on me?
Quick iterations
● Fail safe environment: Easy to undo
● Quick troubleshooting
● Quick feedback
Data Platform
Abstractions and
Services
02
Data In
motion
Storage and ingestion
Events
CDC
Services
Online storage
Data-at-rest
Archival
Data In
motion
Storage and ingestion
Events
CDC
Services
Online storage
Data-at-rest
Archival
Data Products
Data-in-motion
● Schema registry
● Keyed for CDC
● Horizontally scalable
○ Partitioning
● Candidates: Kafka, Pulsar, …
Data-at-rest
● Table abstraction:
○ Snapshot Isolation
○ Time travel: can roll back a change
○ Schema evolution
○ Partitioning decoupled from job
● Candidates:
○ Iceberg,
○ Deltalake over cloud blob storage
Processing
Data In
motion
Data-at-rest
Archival
Stream
processing
Batch
processing
Data Products
Stream processing
● Anti-pattern:
○ Dependencies outside the streaming bubble:
■ Synchronous service calls
■ Database lookup
○ Ingest that data instead (CDC / Domain events)
■ kafka.KTable, flink.DynamicTable
● Candidates:
○ Flink, Spark Streaming, Kafka Streams
● Your job as a function: inputs and outputs are
parameters.
○ Testable transformation:
○ Multiple instances in parallel
● Atomic runs:
○ output is complete or not visible
● Understand dependencies
○ Jobs depend on their inputs
Batch processing
Interactive
● Notebooks:
○ Source control for saving state
○ Repeatable environments: docker images
● Warehouse technology:
○ Decoupled storage and compute
○ Interconnection with data storage
Observability for data
pipelines
03
DATA
● What is the data source?
● What is the schema?
● Who is the owner?
● How often is it updated?
● Where is it coming from?
● Who is using the data?
● What has changed?
Today: Limited context
Maslow’s Data hierarchy of needs
New Business Opportunities
Business optimization
Data Accuracy
Data Timeliness
Data Availability
Observability for data
● Dependencies: Lineage
● availability, timeliness, accuracy
● Change management
○ Schema
○ Code
○ Size
○ Duration
Observability for data
● Dependencies: Lineage
● availability, timeliness, accuracy
● Change management
○ Schema
○ Code
○ Size
○ Duration
In the services
world it’s called
traces
Metadata:
Ingest Storage Compute
StreamingBatch/ETL
● Data Platform
built around
Observability
● Integrations
○ Ingest
○ Storage
○ Compute
○ BI dashboards
Flink
Airflow
Kafka
Iceberg / S3
BI
Marquez: Data model
Job
Dataset Job Version
Run
*
1
*
1
*
1
1*
1*
Source
1 *
● MYSQL
● POSTGRESQL
● REDSHIFT
● SNOWFLAKE
● KAFKA
● S3
● ICEBERG
● DELTALAKE
● BATCH
● STREAM
● SERVICE
Dataset Version
Marquez
API
● Marquez standardizes metadata collection
○ Job runs
○ parameters
○ version
○ inputs / outputs
● Datakin enables
○ Understanding operational dependencies
○ Impact analysis
○ Troubleshooting: What has changed
since the last time it worked?
Datakin leverages Marquez metadata
Datakin
Lineage analysis
Graph
Integrations
Thanks! <o/
Questions?

More Related Content

Data platform architecture principles - ieee infrastructure 2020