SlideShare a Scribd company logo
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in
Apache Spark 3.x
Robert Evans and Jason Lowe
NVIDIA
Agenda
GPU Features in Apache Spark 3
Accelerated SQL/DataFrame
Accelerated Shuffle
What’s Next
GPU Features in Apache Spark 3
Accelerator-Aware Scheduling
▪ SPARK-24615
▪ Request resources
▪ Executor
▪ Driver
▪ Task
▪ Resource discovery
▪ API to determine assignment
▪ Supported on YARN, Kubernetes, and Standalone
GPUs are now a schedulable resource
GPU Scheduling Example
./bin/spark-shell --master yarn --executor-cores 2 
--conf spark.driver.resource.gpu.amount=1 
--conf spark.driver.resource.gpu.discoveryScript=/opt/spark/getGpuResources.sh 
--conf spark.executor.resource.gpu.amount=2 
--conf spark.executor.resource.gpu.discoveryScript=./getGpuResources.sh 
--conf spark.task.resource.gpu.amount=1 
--files examples/src/main/scripts/getGpusResources.sh
GPU Discovery Script Example
#!/bin/bash
#
# Outputs a JSON formatted string that is expected by the
# spark.{driver/executor}.resource.gpu.discoveryScript config.
#
# Example output: {"name": "gpu", "addresses":["0","1","2","3","4","5","6","7"]}
ADDRS=$(nvidia-smi --query-gpu=index --format=csv,noheader 
| sed -e :a -e N -e'$!ba' -e 's/n/","/g')
echo {"name": "gpu", "addresses":["$ADDRS"]}
GPU Assignments API
// Task API
val context = TaskContext.get()
val resources = context.resources()
val assignedGpuAddrs = resources("gpu").addresses
// Pass assignedGpuAddrs into TensorFlow or other AI code
// Driver API
scala> sc.resources("gpu").addresses
Array[String] = Array(0)
GPU Scheduling UI
Stage Level Scheduling
CPU
NODE
GPU
SPARK ML APPLICATION
ETL Stage ML Stage
CPU
NODE
Stage Level Scheduling
▪ SPARK-27495
▪ Specify resource requirements per RDD operation
▪ Spark dynamically allocates containers to meet resource requirements
▪ Spark schedules tasks on appropriate containers
▪ Coming soon in Spark 3.1
SQL Columnar Processing
▪ SPARK-27396
▪ Catalyst API for columnar processing
▪ Plugins can modify query plan with columnar operations
▪ Plan nodes can exchange RDD[ColumnarBatch] instead of RDD[Row]
▪ Enables efficient processing by vectorized accelerators
▪ SIMD
▪ FPGA
▪ GPU
Spark 3
Spark 3 with Project Hydrogen
Spark 2.x
Data Preparation Model Training
Shared Storage
CPU Powered Cluster GPU Powered Cluster
Data
Sources
Spark
XGBoost | TensorFlow |
PyTorch
Spark Orchestrated Data
Sources
GPU Powered Cluster
Data Preparation Model Training
Spark
XGBoost | TensorFlow |
PyTorch
Spark Orchestrated
Spark 3 with Project Hydrogen
▪ Single pipeline
▪ Ingest
▪ Data preparation
▪ Model Training
▪ Infrastructure is consolidated and
simplified
▪ ETL can be GPU-accelerated
Enabling end-to-end acceleration
Data
Sources
GPU Powered Cluster
Data Preparation Model Training
Spark
XGBoost | TensorFlow |
PyTorch
Spark Orchestrated
Spark 3
Accelerated SQL/DataFrame
Accelerated ETL?
Can a GPU make an elephant fast?
Yes
TPCx-BB Like Benchmark Results (10TB Dataset, Two Node DGX-2 Cluster)*
Query #5 Query #16 Query #21 Query #22
CPU 25.95 6.16 7.13 3.80
GPU 1.31 1.16 0.56 0.14
0.00
5.00
10.00
15.00
20.00
25.00
30.00
Time(mins)
Query Time: GPU vs CPU (Mins)
Environment: Two DGX-2 (96 CPU Cores, 1.5TB Host memory, 16 V100 GPUs, 512 GB GPU Memory)
* Not official or complete TPCx-BB runs (ETL power only).
Deep Learning Recommendation Machines
▪ Anonymized 7-day clickstream (1 TB)
▪ Convert high-cardinality strings to
contiguous integer IDs
▪ DLRM github repo has turnkey scripts
Example use case: Criteo Dataset
DLRM on Criteo Dataset (Past)
144.0
12.1
45.0
0.7
0.0
40.0
80.0
120.0
ETL (1 core CPU)* Spark ETL (96 core CPU) Training (96 core CPU) Training (1 - V100)
Time(Hours)
ETL & Training Run Time for CPU & GPU
Criteo Dataset (1TB)
12.1
2.3
0.5
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
Spark ETL (96 core CPU) Spark ETL (1 - V100) Spark ETL (8 - V100)
Time(Hours)
Spark ETL for CRITEO DATASET (1TB)
DLRM ETL on Criteo Dataset (Present)
DLRM End-to-End on Criteo Dataset (Present)
Original CPU (1 Core for
ETL, 96 Core CPU for
Training)
Spark CPU (96 Core for
ETL & Training)
Spark CPU (96 Core for
ETL) & Spark GPU (1-
V100 Training)
Spark GPU (8-V100 for
ETL & 1-V100 Training)
Training 45.0 45.0 0.7 0.7
ETL 144.0 12.1 12.1 0.5
144.0
12.1 12.1
0.5
45.0
45.0 0.7
0.7
0.0
20.0
40.0
60.0
80.0
100.0
120.0
140.0
160.0
180.0
Time(Hours) Spark ETL + Training for Criteo Dataset (1TB)
Jensen Huang
GPU Technology Conference 2020
"The more you buy, the more you
save."
RAPIDS Accelerator for Apache Spark (Plugin)
RAPIDS Accelerator
for Apache Spark
UCX LibrariesRAPIDS C++ Libraries
JNI bindings
Mapping From Java/Scala to C++
DISTRIBUTED SCALE-OUT SPARK APPLICATIONS
APACHE SPARK CORE
Spark SQL Spark ShuffleDataFrame
if gpu_enabled(op, data_type)
call-out to RAPIDS
else
execute standard Spark op
● Custom Implementation of
Spark Shuffle
● Optimized to use RDMA
and GPU-to-GPU direct
communication
CUDA
JNI bindings
Mapping From Java/Scala to C++
RAPIDS Accelerator for Apache Spark 3.0 Plugin
No Code Changes
Same SQL and DataFrame code
What We Support
and growing…
!
%
&
*
+
-
/
<
<=
<=>
=
==
>
>=
^
abs
acos
and
asin
atan
avg
bigint
boolean
cast
cbrt
ceil
ceiling
coalesce
concat
cos
cosh
cot
count
cube
current_date
current_timestamp
date
datediff
day
dayofmonth
degrees
double
e
exp
explode*
expm1
first
first_value
float
floor
from_unixtime
hour
if
ifnull
in
initcap
input_file_block_le
ngth
input_file_block_st
art
input_file_name
int
isnan
isnotnull
isnull
last
last_value
length
lcase
like
ln
locate
log
log10
log1p
log2
lower
ltrim
max
mean
min
minute
mod
monotonically_inc
reasing_id
month
nanvl
negative
not
now
nullif
nvl
nvl2
or
pi
posexplode*
position
pow
power
radians
rand*
regexp_replace*
replace
rint
rollup
row_number
rtrim
second
shiftleft
shiftright
shiftrightunsigned
sign
signum
sin
sinh
smallint
spark_partition_id
sqrt
string
substr
substring
sum
tan
tanh
timestamp
tinyint
trim
ucase
upper
when
window
year
|
~
CSV Reading*
Orc Reading
Orc Writing
Parquet Reading
Parquet Writing
ANSI casts
TimeSub for time
ranges
startswith
endswith
contains
limit
order by
group by
filter
union
repartition
equi-joins
select
Is This a Silver Bullet?
No
▪ Small amounts of data
▪ Few hundred MB per partition for GPU
▪ Cache coherent processing
▪ Data Movement
▪ Slow I/O (networking, disks, etc.)
▪ Going back and forth to the CPU (UDFs)
▪ Shuffle
▪ Limited GPU Memory
160
550
1250
3500
12288
24576 25600
46080
307200
1048576
Spinning
D
isk
SSD
10G
igE
N
VM
EPC
Ie
gen3
PC
Ie
gen4
D
DR
4-3200
D
IM
M
TypicalPC
R
AM
N
VLink
C
PU
Cache
MB/s(LogScale)
But It Can Be Amazing
▪ High cardinality data
▪ Joins
▪ Aggregates
▪ Sort
▪ Window operations
▪ Especially on large windows
▪ Aggregate with lots of distinct operations
▪ Complicated processing
▪ Transcoding
▪ Encoding and compressing Parquet and ORC is expensive
▪ Parsing CSV is expensive
What the SQL plugin excels at
How Does It Work
Spark SQL & DataFrame Compilation Flow
DataFrame
Logical Plan
Physical Plan
RDD[InternalRow]
bar.groupBy(
col(”product_id”),
col(“ds”))
.agg(
max(col(“price”)) -
min(col(“price”)).alias(“range”))
SELECT product_id, ds,
max(price) – min(price) AS
range FROM bar GROUP BY
product_id, ds
QUERY
CPUPHYSICALPLAN
Spark SQL & DataFrame Compilation Flow
DataFrame
Logical Plan
Physical Plan
RDD[InternalRow]
bar.groupBy(
col(”product_id”),
col(“ds”))
.agg(
max(col(“price”)) -
min(col(“price”)).alias(“range”))
SELECT product_id, ds,
max(price) – min(price) AS
range FROM bar GROUP BY
product_id, ds
QUERY
GPUPHYSICALPLAN
Physical Plan
RDD[ColumnarBatch]
Spark SQL & DataFrame Compilation FlowCPUPHYSICALPLAN
Read Parquet File
First Stage
Aggregate
Shuffle Exchange
Second Stage
Aggregate
Write Parquet File
Combine Shuffle
Data
Read Parquet File
First Stage
Aggregate
Shuffle Exchange
Second Stage
Aggregate
Write Parquet File
Convert to Row
Format
Convert to Row
Format
GPUPHYSICALPLAN
ETL Technology Stack
Dask cuDF
cuDF, Pandas
Python
Cython
cuDF C++
CUDA Libraries
CUDA
Java
JNI bindings
Spark DataFrame,
Scala, PySpark
Demo
Demo Cluster Setup
CPU
Driver:
▪ 1 - r4.xlarge
▪ 30.5GB Memory
▪ 4 Cores
▪ 1 DBU
Workers:
▪ 12 - r4.2xlarge
▪ 61GB Memory
▪ 8 cores
▪ 2 DBU
Databricks (AWS)
GPU
Driver:
▪ 1 - p2.xlarge
▪ 61GB Memory
▪ 4 cores
▪ 1 - K80 (Not needed)
▪ 1.22 DBU
Workers:
▪ 12 – p3.2xlarge
▪ 61GB Memory
▪ 1 - V100
▪ 8 cores
▪ 4.15 DBU
Databricks Demo Results
“The more you buy, the more you save” – Jensen H Huang, CEO NVIDIA
1,736
423
0
350
700
1,050
1,400
1,750
CPU (12 - r4.2xlarge) GPU (12 - p3.2xlarge)
ETL Time (seconds)
4x Speed-up $8.03
$6.81
$0.0
$2.0
$4.0
$6.0
$8.0
$10.0
CPU (12- r4.2xlarge) GPU (12 - p3.2xlarge)
ETL Cost (AWS+DBU)
18% Cost Savings*
* Costs based on Databricks Standard edition
T4 Cluster Setup
EC2
V100 is optimized for ML/DL training
T4 fits better with SQL processing
Driver (Ran on one of the worker nodes)
Workers:
▪ 12 – g4dn.2xlarge
▪ 32GB Memory
▪ 1 - T4
▪ 8 cores
Coming Soon….T4 GPUs on Databricks
Same speed-up as V100 but more savings
1,736
457
0
350
700
1,050
1,400
1,750
CPU (12 - r4.2xlarge) GPU (12 - g4dn.2xlarge)
ETL Time (seconds)
3.8x Speed-up
$8.03
$3.76
$0.0
$2.0
$4.0
$6.0
$8.0
$10.0
CPU (12- r4.2xlarge) GPU (12 - g4dn.2xlarge)
ETL Cost (AWS+DBU)
50% Cost Savings*
* Costs based on AWS T4 GPU instance market price & V100 GPU price on Databricks Standard edition
RAPIDS Accelerator on AWS
▪ ~3.5x Speed-up
▪ ~40% Cost Savings
Based on TPCx-BB like Queries #5 & #22 with 1TB scale factor input
221
82.68
61
26.83
0
50
100
150
200
250
Q5 Q22
ETL Time (Seconds)
CPU: 12 - m5dn.2xlarge
(8-core 32GB)
GPU: 12 - g4dn.2xlarge
(8-core 32GB 1xT4 GPU)
Accelerated Shuffle
Spark Shuffle
Data exchange between stages
Task 1Task 0 Task 2
Task 1Task 0
Stage 1
Stage 2
Spark Shuffle
CPU-centric data movement
PCI-e Bus
Local
Storage
NetworkGPU 1
CPU
GPU 0
Accelerated Shuffle
GPU-centric data movement
PCI-e Bus
Local
Storage
NetworkGPU 1
CPU
GPU 0
NVLink
RDMA
GPU Direct
Storage
Accelerated Shuffle
Shuffling Spilled Data
PCI-e Bus
Local
Storage
NetworkGPU 1 GPU 0
RDMA
CPU
Host
Memory
UCX Library
Unified Communication X
▪ Abstracts communication transports
▪ Selects best route(s)
▪ TCP
▪ RDMA
▪ Shared Memory
▪ CUDA IPC
▪ Zero-copy GPU transfers over RDMA
▪ RDMA requires network support
▪ Infiniband
▪ RoCE
▪ http://openucx.org
Accelerated Shuffle Results
Inventory pricing query
228
45
8.4
0
50
100
150
200
250
CPU GPU GPU+UCX
QueryDurationInSeconds
Accelerated Shuffle Results
ETL for logistical regression model
1556
172
79
0
400
800
1200
1600
CPU GPU GPU+UCX
QueryDurationinSeconds
What’s Next?
What’s Next
▪ Open Source (DONE)
▪ https://github.com/NVIDIA/spark-rapids
▪ https://nvidia.github.io/spark-rapids/
▪ Nested types
▪ Arrays
▪ Structs
▪ Maps
▪ Decimal type
▪ More operators
▪ GPU Direct Storage
▪ Time zone support for timestamps
▪ Only UTC supported now
▪ Higher order functions
▪ UDFs
Further OutComing Soon
Where to Get More Information
▪ https://NVIDIA.com/Spark
▪ Please use the “Contact Us” link to get in touch with NVIDIA’s Spark team
▪ https://github.com/NVIDIA/spark-rapids
▪ https://nvidia.github.io/spark-rapids/
▪ Listen to Adobe’s Email Marketing Intelligent Services Use-Case
▪ Free e-book at NVIDIA.com/Spark-Book
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Deep Dive into GPU Support in Apache Spark 3.x

More Related Content

Deep Dive into GPU Support in Apache Spark 3.x

  • 2. Deep Dive into GPU Support in Apache Spark 3.x Robert Evans and Jason Lowe NVIDIA
  • 3. Agenda GPU Features in Apache Spark 3 Accelerated SQL/DataFrame Accelerated Shuffle What’s Next
  • 4. GPU Features in Apache Spark 3
  • 5. Accelerator-Aware Scheduling ▪ SPARK-24615 ▪ Request resources ▪ Executor ▪ Driver ▪ Task ▪ Resource discovery ▪ API to determine assignment ▪ Supported on YARN, Kubernetes, and Standalone GPUs are now a schedulable resource
  • 6. GPU Scheduling Example ./bin/spark-shell --master yarn --executor-cores 2 --conf spark.driver.resource.gpu.amount=1 --conf spark.driver.resource.gpu.discoveryScript=/opt/spark/getGpuResources.sh --conf spark.executor.resource.gpu.amount=2 --conf spark.executor.resource.gpu.discoveryScript=./getGpuResources.sh --conf spark.task.resource.gpu.amount=1 --files examples/src/main/scripts/getGpusResources.sh
  • 7. GPU Discovery Script Example #!/bin/bash # # Outputs a JSON formatted string that is expected by the # spark.{driver/executor}.resource.gpu.discoveryScript config. # # Example output: {"name": "gpu", "addresses":["0","1","2","3","4","5","6","7"]} ADDRS=$(nvidia-smi --query-gpu=index --format=csv,noheader | sed -e :a -e N -e'$!ba' -e 's/n/","/g') echo {"name": "gpu", "addresses":["$ADDRS"]}
  • 8. GPU Assignments API // Task API val context = TaskContext.get() val resources = context.resources() val assignedGpuAddrs = resources("gpu").addresses // Pass assignedGpuAddrs into TensorFlow or other AI code // Driver API scala> sc.resources("gpu").addresses Array[String] = Array(0)
  • 10. Stage Level Scheduling CPU NODE GPU SPARK ML APPLICATION ETL Stage ML Stage CPU NODE
  • 11. Stage Level Scheduling ▪ SPARK-27495 ▪ Specify resource requirements per RDD operation ▪ Spark dynamically allocates containers to meet resource requirements ▪ Spark schedules tasks on appropriate containers ▪ Coming soon in Spark 3.1
  • 12. SQL Columnar Processing ▪ SPARK-27396 ▪ Catalyst API for columnar processing ▪ Plugins can modify query plan with columnar operations ▪ Plan nodes can exchange RDD[ColumnarBatch] instead of RDD[Row] ▪ Enables efficient processing by vectorized accelerators ▪ SIMD ▪ FPGA ▪ GPU
  • 13. Spark 3 Spark 3 with Project Hydrogen Spark 2.x Data Preparation Model Training Shared Storage CPU Powered Cluster GPU Powered Cluster Data Sources Spark XGBoost | TensorFlow | PyTorch Spark Orchestrated Data Sources GPU Powered Cluster Data Preparation Model Training Spark XGBoost | TensorFlow | PyTorch Spark Orchestrated
  • 14. Spark 3 with Project Hydrogen ▪ Single pipeline ▪ Ingest ▪ Data preparation ▪ Model Training ▪ Infrastructure is consolidated and simplified ▪ ETL can be GPU-accelerated Enabling end-to-end acceleration Data Sources GPU Powered Cluster Data Preparation Model Training Spark XGBoost | TensorFlow | PyTorch Spark Orchestrated Spark 3
  • 16. Accelerated ETL? Can a GPU make an elephant fast?
  • 17. Yes TPCx-BB Like Benchmark Results (10TB Dataset, Two Node DGX-2 Cluster)* Query #5 Query #16 Query #21 Query #22 CPU 25.95 6.16 7.13 3.80 GPU 1.31 1.16 0.56 0.14 0.00 5.00 10.00 15.00 20.00 25.00 30.00 Time(mins) Query Time: GPU vs CPU (Mins) Environment: Two DGX-2 (96 CPU Cores, 1.5TB Host memory, 16 V100 GPUs, 512 GB GPU Memory) * Not official or complete TPCx-BB runs (ETL power only).
  • 18. Deep Learning Recommendation Machines ▪ Anonymized 7-day clickstream (1 TB) ▪ Convert high-cardinality strings to contiguous integer IDs ▪ DLRM github repo has turnkey scripts Example use case: Criteo Dataset
  • 19. DLRM on Criteo Dataset (Past) 144.0 12.1 45.0 0.7 0.0 40.0 80.0 120.0 ETL (1 core CPU)* Spark ETL (96 core CPU) Training (96 core CPU) Training (1 - V100) Time(Hours) ETL & Training Run Time for CPU & GPU Criteo Dataset (1TB)
  • 20. 12.1 2.3 0.5 0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 Spark ETL (96 core CPU) Spark ETL (1 - V100) Spark ETL (8 - V100) Time(Hours) Spark ETL for CRITEO DATASET (1TB) DLRM ETL on Criteo Dataset (Present)
  • 21. DLRM End-to-End on Criteo Dataset (Present) Original CPU (1 Core for ETL, 96 Core CPU for Training) Spark CPU (96 Core for ETL & Training) Spark CPU (96 Core for ETL) & Spark GPU (1- V100 Training) Spark GPU (8-V100 for ETL & 1-V100 Training) Training 45.0 45.0 0.7 0.7 ETL 144.0 12.1 12.1 0.5 144.0 12.1 12.1 0.5 45.0 45.0 0.7 0.7 0.0 20.0 40.0 60.0 80.0 100.0 120.0 140.0 160.0 180.0 Time(Hours) Spark ETL + Training for Criteo Dataset (1TB)
  • 22. Jensen Huang GPU Technology Conference 2020 "The more you buy, the more you save."
  • 23. RAPIDS Accelerator for Apache Spark (Plugin) RAPIDS Accelerator for Apache Spark UCX LibrariesRAPIDS C++ Libraries JNI bindings Mapping From Java/Scala to C++ DISTRIBUTED SCALE-OUT SPARK APPLICATIONS APACHE SPARK CORE Spark SQL Spark ShuffleDataFrame if gpu_enabled(op, data_type) call-out to RAPIDS else execute standard Spark op ● Custom Implementation of Spark Shuffle ● Optimized to use RDMA and GPU-to-GPU direct communication CUDA JNI bindings Mapping From Java/Scala to C++
  • 24. RAPIDS Accelerator for Apache Spark 3.0 Plugin
  • 25. No Code Changes Same SQL and DataFrame code
  • 26. What We Support and growing… ! % & * + - / < <= <=> = == > >= ^ abs acos and asin atan avg bigint boolean cast cbrt ceil ceiling coalesce concat cos cosh cot count cube current_date current_timestamp date datediff day dayofmonth degrees double e exp explode* expm1 first first_value float floor from_unixtime hour if ifnull in initcap input_file_block_le ngth input_file_block_st art input_file_name int isnan isnotnull isnull last last_value length lcase like ln locate log log10 log1p log2 lower ltrim max mean min minute mod monotonically_inc reasing_id month nanvl negative not now nullif nvl nvl2 or pi posexplode* position pow power radians rand* regexp_replace* replace rint rollup row_number rtrim second shiftleft shiftright shiftrightunsigned sign signum sin sinh smallint spark_partition_id sqrt string substr substring sum tan tanh timestamp tinyint trim ucase upper when window year | ~ CSV Reading* Orc Reading Orc Writing Parquet Reading Parquet Writing ANSI casts TimeSub for time ranges startswith endswith contains limit order by group by filter union repartition equi-joins select
  • 27. Is This a Silver Bullet? No ▪ Small amounts of data ▪ Few hundred MB per partition for GPU ▪ Cache coherent processing ▪ Data Movement ▪ Slow I/O (networking, disks, etc.) ▪ Going back and forth to the CPU (UDFs) ▪ Shuffle ▪ Limited GPU Memory 160 550 1250 3500 12288 24576 25600 46080 307200 1048576 Spinning D isk SSD 10G igE N VM EPC Ie gen3 PC Ie gen4 D DR 4-3200 D IM M TypicalPC R AM N VLink C PU Cache MB/s(LogScale)
  • 28. But It Can Be Amazing ▪ High cardinality data ▪ Joins ▪ Aggregates ▪ Sort ▪ Window operations ▪ Especially on large windows ▪ Aggregate with lots of distinct operations ▪ Complicated processing ▪ Transcoding ▪ Encoding and compressing Parquet and ORC is expensive ▪ Parsing CSV is expensive What the SQL plugin excels at
  • 29. How Does It Work
  • 30. Spark SQL & DataFrame Compilation Flow DataFrame Logical Plan Physical Plan RDD[InternalRow] bar.groupBy( col(”product_id”), col(“ds”)) .agg( max(col(“price”)) - min(col(“price”)).alias(“range”)) SELECT product_id, ds, max(price) – min(price) AS range FROM bar GROUP BY product_id, ds QUERY CPUPHYSICALPLAN
  • 31. Spark SQL & DataFrame Compilation Flow DataFrame Logical Plan Physical Plan RDD[InternalRow] bar.groupBy( col(”product_id”), col(“ds”)) .agg( max(col(“price”)) - min(col(“price”)).alias(“range”)) SELECT product_id, ds, max(price) – min(price) AS range FROM bar GROUP BY product_id, ds QUERY GPUPHYSICALPLAN Physical Plan RDD[ColumnarBatch]
  • 32. Spark SQL & DataFrame Compilation FlowCPUPHYSICALPLAN Read Parquet File First Stage Aggregate Shuffle Exchange Second Stage Aggregate Write Parquet File Combine Shuffle Data Read Parquet File First Stage Aggregate Shuffle Exchange Second Stage Aggregate Write Parquet File Convert to Row Format Convert to Row Format GPUPHYSICALPLAN
  • 33. ETL Technology Stack Dask cuDF cuDF, Pandas Python Cython cuDF C++ CUDA Libraries CUDA Java JNI bindings Spark DataFrame, Scala, PySpark
  • 34. Demo
  • 35. Demo Cluster Setup CPU Driver: ▪ 1 - r4.xlarge ▪ 30.5GB Memory ▪ 4 Cores ▪ 1 DBU Workers: ▪ 12 - r4.2xlarge ▪ 61GB Memory ▪ 8 cores ▪ 2 DBU Databricks (AWS) GPU Driver: ▪ 1 - p2.xlarge ▪ 61GB Memory ▪ 4 cores ▪ 1 - K80 (Not needed) ▪ 1.22 DBU Workers: ▪ 12 – p3.2xlarge ▪ 61GB Memory ▪ 1 - V100 ▪ 8 cores ▪ 4.15 DBU
  • 36. Databricks Demo Results “The more you buy, the more you save” – Jensen H Huang, CEO NVIDIA 1,736 423 0 350 700 1,050 1,400 1,750 CPU (12 - r4.2xlarge) GPU (12 - p3.2xlarge) ETL Time (seconds) 4x Speed-up $8.03 $6.81 $0.0 $2.0 $4.0 $6.0 $8.0 $10.0 CPU (12- r4.2xlarge) GPU (12 - p3.2xlarge) ETL Cost (AWS+DBU) 18% Cost Savings* * Costs based on Databricks Standard edition
  • 37. T4 Cluster Setup EC2 V100 is optimized for ML/DL training T4 fits better with SQL processing Driver (Ran on one of the worker nodes) Workers: ▪ 12 – g4dn.2xlarge ▪ 32GB Memory ▪ 1 - T4 ▪ 8 cores
  • 38. Coming Soon….T4 GPUs on Databricks Same speed-up as V100 but more savings 1,736 457 0 350 700 1,050 1,400 1,750 CPU (12 - r4.2xlarge) GPU (12 - g4dn.2xlarge) ETL Time (seconds) 3.8x Speed-up $8.03 $3.76 $0.0 $2.0 $4.0 $6.0 $8.0 $10.0 CPU (12- r4.2xlarge) GPU (12 - g4dn.2xlarge) ETL Cost (AWS+DBU) 50% Cost Savings* * Costs based on AWS T4 GPU instance market price & V100 GPU price on Databricks Standard edition
  • 39. RAPIDS Accelerator on AWS ▪ ~3.5x Speed-up ▪ ~40% Cost Savings Based on TPCx-BB like Queries #5 & #22 with 1TB scale factor input 221 82.68 61 26.83 0 50 100 150 200 250 Q5 Q22 ETL Time (Seconds) CPU: 12 - m5dn.2xlarge (8-core 32GB) GPU: 12 - g4dn.2xlarge (8-core 32GB 1xT4 GPU)
  • 41. Spark Shuffle Data exchange between stages Task 1Task 0 Task 2 Task 1Task 0 Stage 1 Stage 2
  • 42. Spark Shuffle CPU-centric data movement PCI-e Bus Local Storage NetworkGPU 1 CPU GPU 0
  • 43. Accelerated Shuffle GPU-centric data movement PCI-e Bus Local Storage NetworkGPU 1 CPU GPU 0 NVLink RDMA GPU Direct Storage
  • 44. Accelerated Shuffle Shuffling Spilled Data PCI-e Bus Local Storage NetworkGPU 1 GPU 0 RDMA CPU Host Memory
  • 45. UCX Library Unified Communication X ▪ Abstracts communication transports ▪ Selects best route(s) ▪ TCP ▪ RDMA ▪ Shared Memory ▪ CUDA IPC ▪ Zero-copy GPU transfers over RDMA ▪ RDMA requires network support ▪ Infiniband ▪ RoCE ▪ http://openucx.org
  • 46. Accelerated Shuffle Results Inventory pricing query 228 45 8.4 0 50 100 150 200 250 CPU GPU GPU+UCX QueryDurationInSeconds
  • 47. Accelerated Shuffle Results ETL for logistical regression model 1556 172 79 0 400 800 1200 1600 CPU GPU GPU+UCX QueryDurationinSeconds
  • 49. What’s Next ▪ Open Source (DONE) ▪ https://github.com/NVIDIA/spark-rapids ▪ https://nvidia.github.io/spark-rapids/ ▪ Nested types ▪ Arrays ▪ Structs ▪ Maps ▪ Decimal type ▪ More operators ▪ GPU Direct Storage ▪ Time zone support for timestamps ▪ Only UTC supported now ▪ Higher order functions ▪ UDFs Further OutComing Soon
  • 50. Where to Get More Information ▪ https://NVIDIA.com/Spark ▪ Please use the “Contact Us” link to get in touch with NVIDIA’s Spark team ▪ https://github.com/NVIDIA/spark-rapids ▪ https://nvidia.github.io/spark-rapids/ ▪ Listen to Adobe’s Email Marketing Intelligent Services Use-Case ▪ Free e-book at NVIDIA.com/Spark-Book
  • 51. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.