Data platform architecture principles - ieee infrastructure 2020

Data Platform
Architecture Principles
Julien Le Dem
CTO and co-founder Datakin
@J_

AGENDA
01 A Healthy Data Ecosystem
02 Data Platform Abstractions and Services
03 Observability for data pipelines

Team interdependencies
Team A Team B
Team C

Explicit contracts
● Schemas
● Shared or Private
● SLA: experimental, production ready

Understanding dependencies
● Who do I depend on?
● Who depends on me?

Quick iterations
● Fail safe environment: Easy to undo
● Quick troubleshooting
● Quick feedback

Data Platform
Abstractions and
Services
02

Data In
motion
Storage and ingestion
Events
CDC
Services
Online storage
Data-at-rest
Archival

Data In
motion
Storage and ingestion
Events
CDC
Services
Online storage
Data-at-rest
Archival
Data Products

Data-in-motion
● Schema registry
● Keyed for CDC
● Horizontally scalable
○ Partitioning
● Candidates: Kafka, Pulsar, …

Data-at-rest
● Table abstraction:
○ Snapshot Isolation
○ Time travel: can roll back a change
○ Schema evolution
○ Partitioning decoupled from job
● Candidates:
○ Iceberg,
○ Deltalake over cloud blob storage

Processing
Data In
motion
Data-at-rest
Archival
Stream
processing
Batch
processing
Data Products

Stream processing
● Anti-pattern:
○ Dependencies outside the streaming bubble:
■ Synchronous service calls
■ Database lookup
○ Ingest that data instead (CDC / Domain events)
■ kafka.KTable, flink.DynamicTable
● Candidates:
○ Flink, Spark Streaming, Kafka Streams

● Your job as a function: inputs and outputs are
parameters.
○ Testable transformation:
○ Multiple instances in parallel
● Atomic runs:
○ output is complete or not visible
● Understand dependencies
○ Jobs depend on their inputs
Batch processing

Interactive
● Notebooks:
○ Source control for saving state
○ Repeatable environments: docker images
● Warehouse technology:
○ Decoupled storage and compute
○ Interconnection with data storage

Observability for data
pipelines
03

DATA
● What is the data source?
● What is the schema?
● Who is the owner?
● How often is it updated?
● Where is it coming from?
● Who is using the data?
● What has changed?
Today: Limited context

Maslow’s Data hierarchy of needs
New Business Opportunities
Business optimization
Data Accuracy
Data Timeliness
Data Availability

● Dependencies: Lineage
● availability, timeliness, accuracy
● Change management
○ Schema
○ Code
○ Size
○ Duration

● Dependencies: Lineage
● availability, timeliness, accuracy
● Change management
○ Schema
○ Code
○ Size
○ Duration
In the services
world it’s called
traces

Metadata:
Ingest Storage Compute
StreamingBatch/ETL
● Data Platform
built around
Observability
● Integrations
○ Ingest
○ Storage
○ Compute
○ BI dashboards
Flink
Airflow
Kafka
Iceberg / S3
BI

Marquez: Data model
Job
Dataset Job Version
Run
*
1
*
1
*
1
1*
1*
Source
1 *
● MYSQL
● POSTGRESQL
● REDSHIFT
● SNOWFLAKE
● KAFKA
● S3
● ICEBERG
● DELTALAKE
● BATCH
● STREAM
● SERVICE
Dataset Version

Marquez
API
● Marquez standardizes metadata collection
○ Job runs
○ parameters
○ version
○ inputs / outputs
● Datakin enables
○ Understanding operational dependencies
○ Impact analysis
○ Troubleshooting: What has changed
since the last time it worked?
Datakin leverages Marquez metadata
Datakin
Lineage analysis
Graph
Integrations

Data platform architecture principles - ieee infrastructure 2020

Related slideshows

More Related Content

Data platform architecture principles - ieee infrastructure 2020