Data platform architecture principles - ieee infrastructure 2020
- 2. AGENDA
01 A Healthy Data Ecosystem
02 Data Platform Abstractions and Services
03 Observability for data pipelines
- 12. Data-at-rest
● Table abstraction:
○ Snapshot Isolation
○ Time travel: can roll back a change
○ Schema evolution
○ Partitioning decoupled from job
● Candidates:
○ Iceberg,
○ Deltalake over cloud blob storage
- 14. Stream processing
● Anti-pattern:
○ Dependencies outside the streaming bubble:
■ Synchronous service calls
■ Database lookup
○ Ingest that data instead (CDC / Domain events)
■ kafka.KTable, flink.DynamicTable
● Candidates:
○ Flink, Spark Streaming, Kafka Streams
- 15. ● Your job as a function: inputs and outputs are
parameters.
○ Testable transformation:
○ Multiple instances in parallel
● Atomic runs:
○ output is complete or not visible
● Understand dependencies
○ Jobs depend on their inputs
Batch processing
- 16. Interactive
● Notebooks:
○ Source control for saving state
○ Repeatable environments: docker images
● Warehouse technology:
○ Decoupled storage and compute
○ Interconnection with data storage
- 18. DATA
● What is the data source?
● What is the schema?
● Who is the owner?
● How often is it updated?
● Where is it coming from?
● Who is using the data?
● What has changed?
Today: Limited context
- 19. Maslow’s Data hierarchy of needs
New Business Opportunities
Business optimization
Data Accuracy
Data Timeliness
Data Availability
- 20. Observability for data
● Dependencies: Lineage
● availability, timeliness, accuracy
● Change management
○ Schema
○ Code
○ Size
○ Duration
- 21. Observability for data
● Dependencies: Lineage
● availability, timeliness, accuracy
● Change management
○ Schema
○ Code
○ Size
○ Duration
In the services
world it’s called
traces
- 23. Marquez: Data model
Job
Dataset Job Version
Run
*
1
*
1
*
1
1*
1*
Source
1 *
● MYSQL
● POSTGRESQL
● REDSHIFT
● SNOWFLAKE
● KAFKA
● S3
● ICEBERG
● DELTALAKE
● BATCH
● STREAM
● SERVICE
Dataset Version
- 24. Marquez
API
● Marquez standardizes metadata collection
○ Job runs
○ parameters
○ version
○ inputs / outputs
● Datakin enables
○ Understanding operational dependencies
○ Impact analysis
○ Troubleshooting: What has changed
since the last time it worked?
Datakin leverages Marquez metadata
Datakin
Lineage analysis
Graph
Integrations