b04-DataflowArchitecture.pdf

GCP Dataﬂow Architecture
Svetak Sundhar
Solutions Engineer, Google

Agenda
1. Overview of Dataflow Runner architecture
2. Overview of Dataflow Runner core features
3. GCP Horizontal services integrations
4. New Dataflow Runner features
2

Beam College 2021
Google Cloud Dataﬂow
Service
Google Cloud Dataflow Service
Optional GCP Services
Virtual Private
Cloud
Cloud Network
Cloud IAM
Stackdriver
Key
Management Service
Compute Engine
Flexible Resources
Monitoring
Log Collection
Graph Optimization
Resource Auto-scaler
Intelligent Watermarking
Shuffle Service
Dynamic Work Rebalancer
Streaming Engine Dataflow SQL
S
O
U
R
C
E
SI
N
K
S

Beam College 2021
Dataﬂow features
Graph optimization
● Producer - Consumer Fusion
● Sibling Fusion
● Others...

Beam College 2021
Dataﬂow features
Monitoring
● Dataﬂow job page
○ Enhanced observability features

Beam College 2021
Dataﬂow features
Centralized Logging
● Single searchable logging via GCP
logging
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1

Beam College 2021
Cloud Platform
Regional Endpoint
User pipeline code and SDK
Job Manager
Deploy and
Schedule
At a very high level: a user submits a processing
pipeline to our managed service, which optimizes
it and runs a pool of virtual machines
(sometimes called workers) to do the work.
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1

Beam College 2021
Region endpoint
● Deploys and controls Dataflow workers and stores Job Metadata
● Region is us-central1 by default, unless explicitly set using the
region parameter
Zone
● Defines the locations of the Dataflow workers
● Defaults to a zone in the region selected based on available zone
capacity. It can be overridden using the zone parameter.
The zone does not need to be in the same region as the endpoint.
Reasons you may want to do this include:
● Security and Compliance
● Data locality
● Resilience and geographic separation
Caution: If you override the zone and the zone is in a different region
than the regional endpoint, there may be negative impact on
performance, network traffic, and network latency.
Dataflow & Compute Engine
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Cloud IAM Compute Engine Cloud Network

Beam College 2021
Identity Access Management
There are a minimum of 2 service accounts
used by the Dataflow service
● Dataflow Service Account
(service-<project-number>@dataflow-service-producer-prod.iam.gser
viceaccount.com)
○ Used for worker creation, monitoring etc...
● Controller Service Account
○ <project-number>-compute@developer.gserviceaccount.com
○ Used by the workers to access resources needed
by the pipeline, for example files on a Google
Cloud Storage Bucket
○ Can be overridden using --serviceAccount
Dataflow & Compute Engine
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Compute Engine
10GB PD
2 1
Cloud IAM Compute Engine Cloud Network

Beam College 2021
Dataﬂow features
Batch Dynamic Work Redistribution
● Redistribute hot keys for more even
workload distribution.
● Fully automated

Beam College 2021
Dataﬂow Shuﬄe - Batch
Compute
Petabit
network
Dataflow Shuffle
Region
Zone ‘a’ Zone ‘b’
Zone ‘c’
Distributed
in-memory
file system
Distributed
on-disk
file system
Shuffle
proxy
Autozone placement

Beam College 2021
Dataﬂow Streaming Engine
Benefits
Smoother autoscaling
Better supportability
Less worker resources
User code
Streaming engine
Worker
User code
Worker
User code
Worker
User code
Worker
Window state storage Streaming shuffle

Beam College 2021
Dataﬂow SQL UI
No coding required
● Write SQL in BigQuery UI
● Use Schemas form Data Catalog
● Submit Dataﬂow jobs
SELECT payload.userId,
payload.productId
FROM pubsub.topic.project.transactions
WHERE
payload.location.latitude < 40.72
AND
payload.location.latitude > 40.699
AND
payload.location.longitude < -73.969
AND
payload.location.longitude > -74.747

Beam College 2021
Dataflow features
Flexible Resource Scheduling
● FlexRS reduces batch processing costs by
using advanced scheduling techniques, the
Cloud Dataflow Shuffle service, and a
combination of preemptible virtual machine
(VM) instances and regular VMs.
● Jobs with FlexRS use service-based Cloud
Dataflow Shuffle for joining and grouping.
Standard VM
Preemptible
VM

Beam College 2021
Dataﬂow Templates
No coding required
● Select one of 20+ Google-provided
templates or use your own
● Popular ETL sources and sinks
● Streaming and Batch modes
● Launch from GCS or Pub/Sub browsers

Attach Graphical Processing
Units (GPU) to your Dataﬂow
workers to accelerate ML model
training, batch and Streaming
predictions, and general data
processing
What you can do with it
● Select from a range of
GPU types (NVIDIA K80,
P100, P4, T4, and V100s)
for your job
● Accelerate ML workloads
(preprocessing, feature
engineering, ML
inference)
GPU Support

Beam College 2021
Dataflow Serverless
Dataﬂow Prime
Serverless
Auto Tuning
Infrastructure
Serverless
Smart
Diagnostics
Simplified Billing
Unified Batch and Streaming
“Streaming ML” for real time insights
Open, intelligent and flexible platform
Governance, Security, Lineage and Workflow Management
Ingest and distribute
data reliably with
Serverless and OSS
systems
Store and analyze at
scale with serverless
and OSS systems

Summary
1. Architecture
2. DAG optimization
3. Shuﬄe Service
4. Streaming Engine
5. Monitoring / Logging
6. Flexible Resource Scheduling
7. Out of the box Templates
8. SQL UI
9. Dataﬂow Prime

b04-DataflowArchitecture.pdf

More Related Content

b04-DataflowArchitecture.pdf