SlideShare a Scribd company logo
GCP Dataflow Architecture
Svetak Sundhar
Solutions Engineer, Google
Agenda
1. Overview of Dataflow Runner architecture
2. Overview of Dataflow Runner core features
3. GCP Horizontal services integrations
4. New Dataflow Runner features
2
Beam College 2021
Google Cloud Dataflow
Service
Google Cloud Dataflow Service
Optional GCP Services
Virtual Private
Cloud
Cloud Network
Cloud IAM
Stackdriver
Key
Management Service
Compute Engine
Flexible Resources
Monitoring
Log Collection
Graph Optimization
Resource Auto-scaler
Intelligent Watermarking
Shuffle Service
Dynamic Work Rebalancer
Streaming Engine Dataflow SQL
S
O
U
R
C
E
SI
N
K
S
Beam College 2021
Dataflow features
Graph optimization
● Producer - Consumer Fusion
● Sibling Fusion
● Others...
Beam College 2021
Dataflow features
Monitoring
● Dataflow job page
○ Enhanced observability features
Beam College 2021
Dataflow features
Centralized Logging
● Single searchable logging via GCP
logging
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Beam College 2021
Google Cloud Dataflow
Service
Google Cloud Dataflow Service
Optional GCP Services
Virtual Private
Cloud
Cloud Network
Cloud IAM
Stackdriver
Key
Management Service
Compute Engine
Flexible Resources
Monitoring
Log Collection
Graph Optimization
Resource Auto-scaler
Intelligent Watermarking
Shuffle Service
Dynamic Work Rebalancer
Streaming Engine Dataflow SQL
S
O
U
R
C
E
SI
N
K
S
Beam College 2021
Cloud Platform
Regional Endpoint
User pipeline code and SDK
Job Manager
Deploy and
Schedule
At a very high level: a user submits a processing
pipeline to our managed service, which optimizes
it and runs a pool of virtual machines
(sometimes called workers) to do the work.
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Beam College 2021
Region endpoint
● Deploys and controls Dataflow workers and stores Job Metadata
● Region is us-central1 by default, unless explicitly set using the
region parameter
Zone
● Defines the locations of the Dataflow workers
● Defaults to a zone in the region selected based on available zone
capacity. It can be overridden using the zone parameter.
The zone does not need to be in the same region as the endpoint.
Reasons you may want to do this include:
● Security and Compliance
● Data locality
● Resilience and geographic separation
Caution: If you override the zone and the zone is in a different region
than the regional endpoint, there may be negative impact on
performance, network traffic, and network latency.
Dataflow & Compute Engine
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Cloud IAM Compute Engine Cloud Network
Beam College 2021
Identity Access Management
There are a minimum of 2 service accounts
used by the Dataflow service
● Dataflow Service Account
(service-<project-number>@dataflow-service-producer-prod.iam.gser
viceaccount.com)
○ Used for worker creation, monitoring etc...
● Controller Service Account
○ <project-number>-compute@developer.gserviceaccount.com
○ Used by the workers to access resources needed
by the pipeline, for example files on a Google
Cloud Storage Bucket
○ Can be overridden using --serviceAccount
Dataflow & Compute Engine
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Cloud IAM Compute Engine Cloud Network
Beam College 2021
Google Cloud Dataflow
Service
Google Cloud Dataflow Service
Optional GCP Services
Virtual Private
Cloud
Cloud Network
Cloud IAM
Stackdriver
Key
Management Service
Compute Engine
Flexible Resources
Monitoring
Log Collection
Graph Optimization
Resource Auto-scaler
Intelligent Watermarking
Shuffle Service
Dynamic Work Rebalancer
Streaming Engine Dataflow SQL
S
O
U
R
C
E
SI
N
K
S
Beam College 2021
Dataflow features
Batch Dynamic Work Redistribution
● Redistribute hot keys for more even
workload distribution.
● Fully automated
Beam College 2021
Dataflow Shuffle - Batch
Compute
Petabit
network
Dataflow Shuffle
Region
Zone ‘a’ Zone ‘b’
Zone ‘c’
Distributed
in-memory
file system
Distributed
on-disk
file system
Shuffle
proxy
Autozone placement
Beam College 2021
Dataflow Streaming Engine
Benefits
Smoother autoscaling
Better supportability
Less worker resources
User code
Streaming engine
Worker
User code
Worker
User code
Worker
User code
Worker
Window state storage Streaming shuffle
Beam College 2021
Dataflow SQL UI
No coding required
● Write SQL in BigQuery UI
● Use Schemas form Data Catalog
● Submit Dataflow jobs
SELECT payload.userId,
payload.productId
FROM pubsub.topic.project.transactions
WHERE
payload.location.latitude < 40.72
AND
payload.location.latitude > 40.699
AND
payload.location.longitude < -73.969
AND
payload.location.longitude > -74.747
Beam College 2021
Dataflow features
Flexible Resource Scheduling
● FlexRS reduces batch processing costs by
using advanced scheduling techniques, the
Cloud Dataflow Shuffle service, and a
combination of preemptible virtual machine
(VM) instances and regular VMs.
● Jobs with FlexRS use service-based Cloud
Dataflow Shuffle for joining and grouping.
Standard VM
Preemptible
VM
Beam College 2021
Dataflow Templates
No coding required
● Select one of 20+ Google-provided
templates or use your own
● Popular ETL sources and sinks
● Streaming and Batch modes
● Launch from GCS or Pub/Sub browsers
Attach Graphical Processing
Units (GPU) to your Dataflow
workers to accelerate ML model
training, batch and Streaming
predictions, and general data
processing
What you can do with it
● Select from a range of
GPU types (NVIDIA K80,
P100, P4, T4, and V100s)
for your job
● Accelerate ML workloads
(preprocessing, feature
engineering, ML
inference)
GPU Support
Beam College 2021
Dataflow Serverless
Dataflow Prime
Serverless
Auto Tuning
Infrastructure
Serverless
Smart
Diagnostics
Simplified Billing
Unified Batch and Streaming
“Streaming ML” for real time insights
Open, intelligent and flexible platform
Governance, Security, Lineage and Workflow Management
Ingest and distribute
data reliably with
Serverless and OSS
systems
Store and analyze at
scale with serverless
and OSS systems
Summary
1. Architecture
2. DAG optimization
3. Shuffle Service
4. Streaming Engine
5. Monitoring / Logging
6. Flexible Resource Scheduling
7. Out of the box Templates
8. SQL UI
9. Dataflow Prime

More Related Content

b04-DataflowArchitecture.pdf

  • 1. GCP Dataflow Architecture Svetak Sundhar Solutions Engineer, Google
  • 2. Agenda 1. Overview of Dataflow Runner architecture 2. Overview of Dataflow Runner core features 3. GCP Horizontal services integrations 4. New Dataflow Runner features 2
  • 3. Beam College 2021 Google Cloud Dataflow Service Google Cloud Dataflow Service Optional GCP Services Virtual Private Cloud Cloud Network Cloud IAM Stackdriver Key Management Service Compute Engine Flexible Resources Monitoring Log Collection Graph Optimization Resource Auto-scaler Intelligent Watermarking Shuffle Service Dynamic Work Rebalancer Streaming Engine Dataflow SQL S O U R C E SI N K S
  • 4. Beam College 2021 Dataflow features Graph optimization ● Producer - Consumer Fusion ● Sibling Fusion ● Others...
  • 5. Beam College 2021 Dataflow features Monitoring ● Dataflow job page ○ Enhanced observability features
  • 6. Beam College 2021 Dataflow features Centralized Logging ● Single searchable logging via GCP logging Compute Engine 10GB PD 2 1 Compute Engine 10GB PD 2 1 Compute Engine 10GB PD 2 1 Compute Engine 10GB PD 2 1 Compute Engine 10GB PD 2 1
  • 7. Beam College 2021 Google Cloud Dataflow Service Google Cloud Dataflow Service Optional GCP Services Virtual Private Cloud Cloud Network Cloud IAM Stackdriver Key Management Service Compute Engine Flexible Resources Monitoring Log Collection Graph Optimization Resource Auto-scaler Intelligent Watermarking Shuffle Service Dynamic Work Rebalancer Streaming Engine Dataflow SQL S O U R C E SI N K S
  • 8. Beam College 2021 Cloud Platform Regional Endpoint User pipeline code and SDK Job Manager Deploy and Schedule At a very high level: a user submits a processing pipeline to our managed service, which optimizes it and runs a pool of virtual machines (sometimes called workers) to do the work. Compute Engine 10GB PD 2 1 Compute Engine 10GB PD 2 1 Compute Engine 10GB PD 2 1 Compute Engine 10GB PD 2 1 Compute Engine 10GB PD 2 1
  • 9. Beam College 2021 Region endpoint ● Deploys and controls Dataflow workers and stores Job Metadata ● Region is us-central1 by default, unless explicitly set using the region parameter Zone ● Defines the locations of the Dataflow workers ● Defaults to a zone in the region selected based on available zone capacity. It can be overridden using the zone parameter. The zone does not need to be in the same region as the endpoint. Reasons you may want to do this include: ● Security and Compliance ● Data locality ● Resilience and geographic separation Caution: If you override the zone and the zone is in a different region than the regional endpoint, there may be negative impact on performance, network traffic, and network latency. Dataflow & Compute Engine Compute Engine 10GB PD 2 1 Compute Engine 10GB PD 2 1 Compute Engine 10GB PD 2 1 Compute Engine 10GB PD 2 1 Compute Engine 10GB PD 2 1 Cloud IAM Compute Engine Cloud Network
  • 10. Beam College 2021 Identity Access Management There are a minimum of 2 service accounts used by the Dataflow service ● Dataflow Service Account (service-<project-number>@dataflow-service-producer-prod.iam.gser viceaccount.com) ○ Used for worker creation, monitoring etc... ● Controller Service Account ○ <project-number>-compute@developer.gserviceaccount.com ○ Used by the workers to access resources needed by the pipeline, for example files on a Google Cloud Storage Bucket ○ Can be overridden using --serviceAccount Dataflow & Compute Engine Compute Engine 10GB PD 2 1 Compute Engine 10GB PD 2 1 Compute Engine 10GB PD 2 1 Compute Engine 10GB PD 2 1 Compute Engine 10GB PD 2 1 Cloud IAM Compute Engine Cloud Network
  • 11. Beam College 2021 Google Cloud Dataflow Service Google Cloud Dataflow Service Optional GCP Services Virtual Private Cloud Cloud Network Cloud IAM Stackdriver Key Management Service Compute Engine Flexible Resources Monitoring Log Collection Graph Optimization Resource Auto-scaler Intelligent Watermarking Shuffle Service Dynamic Work Rebalancer Streaming Engine Dataflow SQL S O U R C E SI N K S
  • 12. Beam College 2021 Dataflow features Batch Dynamic Work Redistribution ● Redistribute hot keys for more even workload distribution. ● Fully automated
  • 13. Beam College 2021 Dataflow Shuffle - Batch Compute Petabit network Dataflow Shuffle Region Zone ‘a’ Zone ‘b’ Zone ‘c’ Distributed in-memory file system Distributed on-disk file system Shuffle proxy Autozone placement
  • 14. Beam College 2021 Dataflow Streaming Engine Benefits Smoother autoscaling Better supportability Less worker resources User code Streaming engine Worker User code Worker User code Worker User code Worker Window state storage Streaming shuffle
  • 15. Beam College 2021 Dataflow SQL UI No coding required ● Write SQL in BigQuery UI ● Use Schemas form Data Catalog ● Submit Dataflow jobs SELECT payload.userId, payload.productId FROM pubsub.topic.project.transactions WHERE payload.location.latitude < 40.72 AND payload.location.latitude > 40.699 AND payload.location.longitude < -73.969 AND payload.location.longitude > -74.747
  • 16. Beam College 2021 Dataflow features Flexible Resource Scheduling ● FlexRS reduces batch processing costs by using advanced scheduling techniques, the Cloud Dataflow Shuffle service, and a combination of preemptible virtual machine (VM) instances and regular VMs. ● Jobs with FlexRS use service-based Cloud Dataflow Shuffle for joining and grouping. Standard VM Preemptible VM
  • 17. Beam College 2021 Dataflow Templates No coding required ● Select one of 20+ Google-provided templates or use your own ● Popular ETL sources and sinks ● Streaming and Batch modes ● Launch from GCS or Pub/Sub browsers
  • 18. Attach Graphical Processing Units (GPU) to your Dataflow workers to accelerate ML model training, batch and Streaming predictions, and general data processing What you can do with it ● Select from a range of GPU types (NVIDIA K80, P100, P4, T4, and V100s) for your job ● Accelerate ML workloads (preprocessing, feature engineering, ML inference) GPU Support
  • 19. Beam College 2021 Dataflow Serverless Dataflow Prime Serverless Auto Tuning Infrastructure Serverless Smart Diagnostics Simplified Billing Unified Batch and Streaming “Streaming ML” for real time insights Open, intelligent and flexible platform Governance, Security, Lineage and Workflow Management Ingest and distribute data reliably with Serverless and OSS systems Store and analyze at scale with serverless and OSS systems
  • 20. Summary 1. Architecture 2. DAG optimization 3. Shuffle Service 4. Streaming Engine 5. Monitoring / Logging 6. Flexible Resource Scheduling 7. Out of the box Templates 8. SQL UI 9. Dataflow Prime