The document discusses Google Cloud Dataflow architecture and features. It describes how Dataflow optimizes data processing pipelines, leverages services like the Shuffle Service and Streaming Engine, provides flexible resource scheduling and monitoring, includes templates for common workflows, and offers an SQL UI and Dataflow Prime serverless option.
2. Agenda
1. Overview of Dataflow Runner architecture
2. Overview of Dataflow Runner core features
3. GCP Horizontal services integrations
4. New Dataflow Runner features
2
3. Beam College 2021
Google Cloud Dataflow
Service
Google Cloud Dataflow Service
Optional GCP Services
Virtual Private
Cloud
Cloud Network
Cloud IAM
Stackdriver
Key
Management Service
Compute Engine
Flexible Resources
Monitoring
Log Collection
Graph Optimization
Resource Auto-scaler
Intelligent Watermarking
Shuffle Service
Dynamic Work Rebalancer
Streaming Engine Dataflow SQL
S
O
U
R
C
E
SI
N
K
S
4. Beam College 2021
Dataflow features
Graph optimization
● Producer - Consumer Fusion
● Sibling Fusion
● Others...
7. Beam College 2021
Google Cloud Dataflow
Service
Google Cloud Dataflow Service
Optional GCP Services
Virtual Private
Cloud
Cloud Network
Cloud IAM
Stackdriver
Key
Management Service
Compute Engine
Flexible Resources
Monitoring
Log Collection
Graph Optimization
Resource Auto-scaler
Intelligent Watermarking
Shuffle Service
Dynamic Work Rebalancer
Streaming Engine Dataflow SQL
S
O
U
R
C
E
SI
N
K
S
8. Beam College 2021
Cloud Platform
Regional Endpoint
User pipeline code and SDK
Job Manager
Deploy and
Schedule
At a very high level: a user submits a processing
pipeline to our managed service, which optimizes
it and runs a pool of virtual machines
(sometimes called workers) to do the work.
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
9. Beam College 2021
Region endpoint
● Deploys and controls Dataflow workers and stores Job Metadata
● Region is us-central1 by default, unless explicitly set using the
region parameter
Zone
● Defines the locations of the Dataflow workers
● Defaults to a zone in the region selected based on available zone
capacity. It can be overridden using the zone parameter.
The zone does not need to be in the same region as the endpoint.
Reasons you may want to do this include:
● Security and Compliance
● Data locality
● Resilience and geographic separation
Caution: If you override the zone and the zone is in a different region
than the regional endpoint, there may be negative impact on
performance, network traffic, and network latency.
Dataflow & Compute Engine
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Cloud IAM Compute Engine Cloud Network
10. Beam College 2021
Identity Access Management
There are a minimum of 2 service accounts
used by the Dataflow service
● Dataflow Service Account
(service-<project-number>@dataflow-service-producer-prod.iam.gser
viceaccount.com)
○ Used for worker creation, monitoring etc...
● Controller Service Account
○ <project-number>-compute@developer.gserviceaccount.com
○ Used by the workers to access resources needed
by the pipeline, for example files on a Google
Cloud Storage Bucket
○ Can be overridden using --serviceAccount
Dataflow & Compute Engine
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Cloud IAM Compute Engine Cloud Network
11. Beam College 2021
Google Cloud Dataflow
Service
Google Cloud Dataflow Service
Optional GCP Services
Virtual Private
Cloud
Cloud Network
Cloud IAM
Stackdriver
Key
Management Service
Compute Engine
Flexible Resources
Monitoring
Log Collection
Graph Optimization
Resource Auto-scaler
Intelligent Watermarking
Shuffle Service
Dynamic Work Rebalancer
Streaming Engine Dataflow SQL
S
O
U
R
C
E
SI
N
K
S
12. Beam College 2021
Dataflow features
Batch Dynamic Work Redistribution
● Redistribute hot keys for more even
workload distribution.
● Fully automated
13. Beam College 2021
Dataflow Shuffle - Batch
Compute
Petabit
network
Dataflow Shuffle
Region
Zone ‘a’ Zone ‘b’
Zone ‘c’
Distributed
in-memory
file system
Distributed
on-disk
file system
Shuffle
proxy
Autozone placement
14. Beam College 2021
Dataflow Streaming Engine
Benefits
Smoother autoscaling
Better supportability
Less worker resources
User code
Streaming engine
Worker
User code
Worker
User code
Worker
User code
Worker
Window state storage Streaming shuffle
15. Beam College 2021
Dataflow SQL UI
No coding required
● Write SQL in BigQuery UI
● Use Schemas form Data Catalog
● Submit Dataflow jobs
SELECT payload.userId,
payload.productId
FROM pubsub.topic.project.transactions
WHERE
payload.location.latitude < 40.72
AND
payload.location.latitude > 40.699
AND
payload.location.longitude < -73.969
AND
payload.location.longitude > -74.747
16. Beam College 2021
Dataflow features
Flexible Resource Scheduling
● FlexRS reduces batch processing costs by
using advanced scheduling techniques, the
Cloud Dataflow Shuffle service, and a
combination of preemptible virtual machine
(VM) instances and regular VMs.
● Jobs with FlexRS use service-based Cloud
Dataflow Shuffle for joining and grouping.
Standard VM
Preemptible
VM
17. Beam College 2021
Dataflow Templates
No coding required
● Select one of 20+ Google-provided
templates or use your own
● Popular ETL sources and sinks
● Streaming and Batch modes
● Launch from GCS or Pub/Sub browsers
18. Attach Graphical Processing
Units (GPU) to your Dataflow
workers to accelerate ML model
training, batch and Streaming
predictions, and general data
processing
What you can do with it
● Select from a range of
GPU types (NVIDIA K80,
P100, P4, T4, and V100s)
for your job
● Accelerate ML workloads
(preprocessing, feature
engineering, ML
inference)
GPU Support
19. Beam College 2021
Dataflow Serverless
Dataflow Prime
Serverless
Auto Tuning
Infrastructure
Serverless
Smart
Diagnostics
Simplified Billing
Unified Batch and Streaming
“Streaming ML” for real time insights
Open, intelligent and flexible platform
Governance, Security, Lineage and Workflow Management
Ingest and distribute
data reliably with
Serverless and OSS
systems
Store and analyze at
scale with serverless
and OSS systems
20. Summary
1. Architecture
2. DAG optimization
3. Shuffle Service
4. Streaming Engine
5. Monitoring / Logging
6. Flexible Resource Scheduling
7. Out of the box Templates
8. SQL UI
9. Dataflow Prime