Deep Dive into GPU Support in Apache Spark 3.x

Deep Dive into GPU Support in
Apache Spark 3.x
Robert Evans and Jason Lowe
NVIDIA

Agenda
GPU Features in Apache Spark 3
Accelerated SQL/DataFrame
Accelerated Shuffle
What’s Next

GPU Features in Apache Spark 3

Accelerator-Aware Scheduling
▪ SPARK-24615
▪ Request resources
▪ Executor
▪ Driver
▪ Task
▪ Resource discovery
▪ API to determine assignment
▪ Supported on YARN, Kubernetes, and Standalone
GPUs are now a schedulable resource

GPU Scheduling Example
./bin/spark-shell --master yarn --executor-cores 2
--conf spark.driver.resource.gpu.amount=1
--conf spark.driver.resource.gpu.discoveryScript=/opt/spark/getGpuResources.sh
--conf spark.executor.resource.gpu.amount=2
--conf spark.executor.resource.gpu.discoveryScript=./getGpuResources.sh
--conf spark.task.resource.gpu.amount=1
--files examples/src/main/scripts/getGpusResources.sh

GPU Discovery Script Example
#!/bin/bash
#
# Outputs a JSON formatted string that is expected by the
# spark.{driver/executor}.resource.gpu.discoveryScript config.
#
# Example output: {"name": "gpu", "addresses":["0","1","2","3","4","5","6","7"]}
ADDRS=$(nvidia-smi --query-gpu=index --format=csv,noheader
| sed -e :a -e N -e'$!ba' -e 's/n/","/g')
echo {"name": "gpu", "addresses":["$ADDRS"]}

GPU Assignments API
// Task API
val context = TaskContext.get()
val resources = context.resources()
val assignedGpuAddrs = resources("gpu").addresses
// Pass assignedGpuAddrs into TensorFlow or other AI code
// Driver API
scala> sc.resources("gpu").addresses
Array[String] = Array(0)

Stage Level Scheduling
CPU
NODE
GPU
SPARK ML APPLICATION
ETL Stage ML Stage
CPU
NODE

Stage Level Scheduling
▪ SPARK-27495
▪ Specify resource requirements per RDD operation
▪ Spark dynamically allocates containers to meet resource requirements
▪ Spark schedules tasks on appropriate containers
▪ Coming soon in Spark 3.1

SQL Columnar Processing
▪ SPARK-27396
▪ Catalyst API for columnar processing
▪ Plugins can modify query plan with columnar operations
▪ Plan nodes can exchange RDD[ColumnarBatch] instead of RDD[Row]
▪ Enables efficient processing by vectorized accelerators
▪ SIMD
▪ FPGA
▪ GPU

Spark 3
Spark 3 with Project Hydrogen
Spark 2.x
Data Preparation Model Training
Shared Storage
CPU Powered Cluster GPU Powered Cluster
Data
Sources
Spark
XGBoost | TensorFlow |
PyTorch
Spark Orchestrated Data
Sources
GPU Powered Cluster
Spark
PyTorch
Spark Orchestrated

Spark 3 with Project Hydrogen
▪ Single pipeline
▪ Ingest
▪ Data preparation
▪ Model Training
▪ Infrastructure is consolidated and
simplified
▪ ETL can be GPU-accelerated
Enabling end-to-end acceleration
Data
Sources
GPU Powered Cluster
Spark
PyTorch
Spark Orchestrated
Spark 3

Accelerated ETL?
Can a GPU make an elephant fast?

Yes
TPCx-BB Like Benchmark Results (10TB Dataset, Two Node DGX-2 Cluster)*
Query #5 Query #16 Query #21 Query #22
CPU 25.95 6.16 7.13 3.80
GPU 1.31 1.16 0.56 0.14
0.00
5.00
10.00
15.00
20.00
25.00
30.00
Time(mins)
Query Time: GPU vs CPU (Mins)
Environment: Two DGX-2 (96 CPU Cores, 1.5TB Host memory, 16 V100 GPUs, 512 GB GPU Memory)
* Not official or complete TPCx-BB runs (ETL power only).

Deep Learning Recommendation Machines
▪ Anonymized 7-day clickstream (1 TB)
▪ Convert high-cardinality strings to
contiguous integer IDs
▪ DLRM github repo has turnkey scripts
Example use case: Criteo Dataset

DLRM on Criteo Dataset (Past)
144.0
12.1
45.0
0.7
0.0
40.0
80.0
120.0
ETL (1 core CPU)* Spark ETL (96 core CPU) Training (96 core CPU) Training (1 - V100)
Time(Hours)
ETL & Training Run Time for CPU & GPU
Criteo Dataset (1TB)

12.1
2.3
0.5
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
Spark ETL (96 core CPU) Spark ETL (1 - V100) Spark ETL (8 - V100)
Time(Hours)
Spark ETL for CRITEO DATASET (1TB)
DLRM ETL on Criteo Dataset (Present)

DLRM End-to-End on Criteo Dataset (Present)
Original CPU (1 Core for
ETL, 96 Core CPU for
Training)
Spark CPU (96 Core for
ETL & Training)
Spark CPU (96 Core for
ETL) & Spark GPU (1-
V100 Training)
Spark GPU (8-V100 for
ETL & 1-V100 Training)
Training 45.0 45.0 0.7 0.7
ETL 144.0 12.1 12.1 0.5
144.0
12.1 12.1
0.5
45.0
45.0 0.7
0.7
0.0
20.0
40.0
60.0
80.0
100.0
120.0
140.0
160.0
180.0
Time(Hours) Spark ETL + Training for Criteo Dataset (1TB)

Jensen Huang
GPU Technology Conference 2020
"The more you buy, the more you
save."

RAPIDS Accelerator for Apache Spark (Plugin)
RAPIDS Accelerator
for Apache Spark
UCX LibrariesRAPIDS C++ Libraries
JNI bindings
Mapping From Java/Scala to C++
DISTRIBUTED SCALE-OUT SPARK APPLICATIONS
APACHE SPARK CORE
Spark SQL Spark ShuffleDataFrame
if gpu_enabled(op, data_type)
call-out to RAPIDS
else
execute standard Spark op
● Custom Implementation of
Spark Shuffle
● Optimized to use RDMA
and GPU-to-GPU direct
communication
CUDA
JNI bindings
Mapping From Java/Scala to C++

RAPIDS Accelerator for Apache Spark 3.0 Plugin

No Code Changes
Same SQL and DataFrame code

What We Support
and growing…
!
%
&
*
+
-
/
<
<=
<=>
=
==
>
>=
^
abs
acos
and
asin
atan
avg
bigint
boolean
cast
cbrt
ceil
ceiling
coalesce
concat
cos
cosh
cot
count
cube
current_date
current_timestamp
date
datediff
day
dayofmonth
degrees
double
e
exp
explode*
expm1
first
first_value
float
floor
from_unixtime
hour
if
ifnull
in
initcap
input_file_block_le
ngth
input_file_block_st
art
input_file_name
int
isnan
isnotnull
isnull
last
last_value
length
lcase
like
ln
locate
log
log10
log1p
log2
lower
ltrim
max
mean
min
minute
mod
monotonically_inc
reasing_id
month
nanvl
negative
not
now
nullif
nvl
nvl2
or
pi
posexplode*
position
pow
power
radians
rand*
regexp_replace*
replace
rint
rollup
row_number
rtrim
second
shiftleft
shiftright
shiftrightunsigned
sign
signum
sin
sinh
smallint
spark_partition_id
sqrt
string
substr
substring
sum
tan
tanh
timestamp
tinyint
trim
ucase
upper
when
window
year
|
~
CSV Reading*
Orc Reading
Orc Writing
Parquet Reading
Parquet Writing
ANSI casts
TimeSub for time
ranges
startswith
endswith
contains
limit
order by
group by
filter
union
repartition
equi-joins
select

Is This a Silver Bullet?
No
▪ Small amounts of data
▪ Few hundred MB per partition for GPU
▪ Cache coherent processing
▪ Data Movement
▪ Slow I/O (networking, disks, etc.)
▪ Going back and forth to the CPU (UDFs)
▪ Shuffle
▪ Limited GPU Memory
160
550
1250
3500
12288
24576 25600
46080
307200
1048576
Spinning
D
isk
SSD
10G
igE
N
VM
EPC
Ie
gen3
PC
Ie
gen4
D
DR
4-3200
D
IM
M
TypicalPC
R
AM
N
VLink
C
PU
Cache
MB/s(LogScale)

But It Can Be Amazing
▪ High cardinality data
▪ Joins
▪ Aggregates
▪ Sort
▪ Window operations
▪ Especially on large windows
▪ Aggregate with lots of distinct operations
▪ Complicated processing
▪ Transcoding
▪ Encoding and compressing Parquet and ORC is expensive
▪ Parsing CSV is expensive
What the SQL plugin excels at

Spark SQL & DataFrame Compilation Flow
DataFrame
Logical Plan
Physical Plan
RDD[InternalRow]
bar.groupBy(
col(”product_id”),
col(“ds”))
.agg(
max(col(“price”)) -
min(col(“price”)).alias(“range”))
SELECT product_id, ds,
max(price) – min(price) AS
range FROM bar GROUP BY
product_id, ds
QUERY
CPUPHYSICALPLAN

Spark SQL & DataFrame Compilation Flow
DataFrame
Logical Plan
Physical Plan
RDD[InternalRow]
bar.groupBy(
col(”product_id”),
col(“ds”))
.agg(
max(col(“price”)) -
min(col(“price”)).alias(“range”))
SELECT product_id, ds,
max(price) – min(price) AS
range FROM bar GROUP BY
product_id, ds
QUERY
GPUPHYSICALPLAN
Physical Plan
RDD[ColumnarBatch]

Spark SQL & DataFrame Compilation FlowCPUPHYSICALPLAN
Read Parquet File
First Stage
Aggregate
Shuffle Exchange
Second Stage
Aggregate
Write Parquet File
Combine Shuffle
Data
Read Parquet File
First Stage
Aggregate
Shuffle Exchange
Second Stage
Aggregate
Write Parquet File
Convert to Row
Format
Convert to Row
Format
GPUPHYSICALPLAN

ETL Technology Stack
Dask cuDF
cuDF, Pandas
Python
Cython
cuDF C++
CUDA Libraries
CUDA
Java
JNI bindings
Spark DataFrame,
Scala, PySpark

Demo Cluster Setup
CPU
Driver:
▪ 1 - r4.xlarge
▪ 30.5GB Memory
▪ 4 Cores
▪ 1 DBU
Workers:
▪ 12 - r4.2xlarge
▪ 61GB Memory
▪ 8 cores
▪ 2 DBU
Databricks (AWS)
GPU
Driver:
▪ 1 - p2.xlarge
▪ 61GB Memory
▪ 4 cores
▪ 1 - K80 (Not needed)
▪ 1.22 DBU
Workers:
▪ 12 – p3.2xlarge
▪ 61GB Memory
▪ 1 - V100
▪ 8 cores
▪ 4.15 DBU

Databricks Demo Results
“The more you buy, the more you save” – Jensen H Huang, CEO NVIDIA
1,736
423
0
350
700
1,050
1,400
1,750
CPU (12 - r4.2xlarge) GPU (12 - p3.2xlarge)
ETL Time (seconds)
4x Speed-up $8.03
$6.81
$0.0
$2.0
$4.0
$6.0
$8.0
$10.0
CPU (12- r4.2xlarge) GPU (12 - p3.2xlarge)
ETL Cost (AWS+DBU)
18% Cost Savings*
* Costs based on Databricks Standard edition

T4 Cluster Setup
EC2
V100 is optimized for ML/DL training
T4 fits better with SQL processing
Driver (Ran on one of the worker nodes)
Workers:
▪ 12 – g4dn.2xlarge
▪ 32GB Memory
▪ 1 - T4
▪ 8 cores

Coming Soon….T4 GPUs on Databricks
Same speed-up as V100 but more savings
1,736
457
0
350
700
1,050
1,400
1,750
CPU (12 - r4.2xlarge) GPU (12 - g4dn.2xlarge)
ETL Time (seconds)
3.8x Speed-up
$8.03
$3.76
$0.0
$2.0
$4.0
$6.0
$8.0
$10.0
CPU (12- r4.2xlarge) GPU (12 - g4dn.2xlarge)
ETL Cost (AWS+DBU)
50% Cost Savings*
* Costs based on AWS T4 GPU instance market price & V100 GPU price on Databricks Standard edition

RAPIDS Accelerator on AWS
▪ ~3.5x Speed-up
▪ ~40% Cost Savings
Based on TPCx-BB like Queries #5 & #22 with 1TB scale factor input
221
82.68
61
26.83
0
50
100
150
200
250
Q5 Q22
ETL Time (Seconds)
CPU: 12 - m5dn.2xlarge
(8-core 32GB)
GPU: 12 - g4dn.2xlarge
(8-core 32GB 1xT4 GPU)

Spark Shuffle
Data exchange between stages
Task 1Task 0 Task 2
Task 1Task 0
Stage 1
Stage 2

Spark Shuffle
CPU-centric data movement
PCI-e Bus
Local
Storage
NetworkGPU 1
CPU
GPU 0

Accelerated Shuffle
GPU-centric data movement
PCI-e Bus
Local
Storage
NetworkGPU 1
CPU
GPU 0
NVLink
RDMA
GPU Direct
Storage

Accelerated Shuffle
Shuffling Spilled Data
PCI-e Bus
Local
Storage
NetworkGPU 1 GPU 0
RDMA
CPU
Host
Memory

UCX Library
Unified Communication X
▪ Abstracts communication transports
▪ Selects best route(s)
▪ TCP
▪ RDMA
▪ Shared Memory
▪ CUDA IPC
▪ Zero-copy GPU transfers over RDMA
▪ RDMA requires network support
▪ Infiniband
▪ RoCE
▪ http://openucx.org

Accelerated Shuffle Results
Inventory pricing query
228
45
8.4
0
50
100
150
200
250
CPU GPU GPU+UCX
QueryDurationInSeconds

Accelerated Shuffle Results
ETL for logistical regression model
1556
172
79
0
400
800
1200
1600
CPU GPU GPU+UCX
QueryDurationinSeconds

What’s Next
▪ Open Source (DONE)
▪ https://github.com/NVIDIA/spark-rapids
▪ https://nvidia.github.io/spark-rapids/
▪ Nested types
▪ Arrays
▪ Structs
▪ Maps
▪ Decimal type
▪ More operators
▪ GPU Direct Storage
▪ Time zone support for timestamps
▪ Only UTC supported now
▪ Higher order functions
▪ UDFs
Further OutComing Soon

Where to Get More Information
▪ https://NVIDIA.com/Spark
▪ Please use the “Contact Us” link to get in touch with NVIDIA’s Spark team
▪ https://github.com/NVIDIA/spark-rapids
▪ https://nvidia.github.io/spark-rapids/
▪ Listen to Adobe’s Email Marketing Intelligent Services Use-Case
▪ Free e-book at NVIDIA.com/Spark-Book

Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

Deep Dive into GPU Support in Apache Spark 3.x

Related slideshows

More Related Content

Deep Dive into GPU Support in Apache Spark 3.x