SlideShare a Scribd company logo
Keith Kraus 18-10-2018
RAPIDS: GPU-ACCELERATED ETL AND
FEATURE ENGINEERING
2
REALITIES OF DATA
3
Faster Data Access Less Data Movement
DATA PROCESSING EVOLUTION
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
Hadoop Processing, Reading from disk
4
Faster Data Access Less Data Movement
DATA PROCESSING EVOLUTION
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read
Query ETL ML Train
Hadoop Processing, Reading from disk
25-100x
Improvement
Less code
Language flexible
Primarily In-Memory
Spark In-Memory Processing
5
1980 1990 2000 2010 2020
102
103
104
105
106
107
Single-threaded perf
1.5X per year
1.1X per year
Transistors
(thousands)
NEED MORE SPEED
CPU Performance Has Plateaued
6
WE NEED MORE COMPUTE!
Basic workloads are bottlenecked by the CPU
Source: Mark Litwintschik’s blog: 1.1 Billion Taxi Rides: EC2 versus EMR
• In a simple benchmark consisting of
aggregating data, the CPU is the
bottleneck
• This is after the data is parsed and
cached into memory which is
another common bottleneck
• The CPU bottleneck is even worse
in more complex workloads!
SELECT cab_type, count(*) FROM
trips_orc GROUP BY cab_type;
7
HOW CAN WE DO BETTER?
• Focus on the full Data Science workflow
• Data Loading
• Data Transformation
• Data Analytics
• Python
• Provide as close to a drop-in replacement for existing tools
• Performance - Leverage GPUs
8
1980 1990 2000 2010 2020
102
103
104
105
106
107
Single-threaded perf
1.5X per year
1.1X per year
GPU-Computing perf
1.5X per year 1000X
By 2025
NEW BEGINNINGS
GPU Performance Grows
9
GPU
ADOPTION
BARRIERS
• Too much data movement
• Too many makeshift data
formats
• Writing CUDA C/C++ is hard
• No Python API for data
manipulation
Yes GPUs are fast but …
10
APP A
DATA MOVEMENT AND TRANSFORMATION
The bane of productivity and performance
CPU GPU
APP B
Read Data
Copy & Convert
Copy & Convert
Copy & Convert
Load Data
APP A GPU
Data
APP B
GPU
Data
APP A
APP B
11
APP A
DATA MOVEMENT AND TRANSFORMATION
What if we could keep data on the GPU?
APP B
Copy & Convert
Copy & Convert
Copy & Convert
APP A GPU
Data
APP B
GPU
Data
Read Data
Load Data
APP B
CPU GPU
APP A
12
DATA FORMATS
Avro
XML
JSON
GML
ProtoBuf
HDFS
Pickle
CSV
Parquet
Pandas
Plain Text vs Binary Compressed vs Uncompressed
CSR
COO
CSC
* Not a complete list
Numpy
13
LEARNING FROM APACHE ARROW
From Apache Arrow Home Page - https://arrow.apache.org/
14
cuDF
Analytics
GPU Memory
Data Preparation VisualizationModel Training
cuML
Machine Learning
cuGraph
Graph Analytics
PyTorch & Chainer
Deep Learning
Kepler.GL
Visualization
RAPIDS OPEN SOURCE SOFTWARE
15
Faster Data Access Less Data Movement
DATA PROCESSING EVOLUTION
25-100x Improvement
Less code
Language flexible
Primarily In-Memory
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read
Query ETL ML Train
HDFS
Read
GPU
Read
Query
CPU
Write
GPU
Read
ETL
CPU
Write
GPU
Read
ML
Train
Arrow
Read
Query ETL
ML
Train
5-10x Improvement
More code
Language rigid
Substantially on GPU
50-100x Improvement
Same code
Language flexible
Primarily on GPU
RAPIDS
GPU/Spark In-Memory Processing
Hadoop Processing, Reading from disk
Spark In-Memory Processing
16
RAPIDS
Rapid Accelerated Platform for Integrating Data Science
APPLICATIONS
SYSTEMS
ALGORITHMS
CUDA
ARCHITECTURE
• Learn what the data science community needs
• Use best practices and standards
• Build scalable systems and algorithms
• Test Applications and workflows
• Iterate
17
RAPIDS
How can I download and use RAPIDS?
• https://ngc.nvidia.com/registry/nvidia-
rapidsai-rapidsai
• https://hub.docker.com/r/rapidsai/rapidsai/
• https://github.com/rapidsai
• WIP: https://anaconda.org/rapidsai/
• WIP:
• https://pypi.org/project/cudf
• https://pypi.org/project/cuml
18
AI LIBRARIES
Accelerating more of the AI ecosystem
Graph Analytics is fundamental to network analysis
Machine Learning is fundamental to prediction,
classification, clustering, anomaly detection and
recommendations.
Both can be accelerated with NVIDIA GPU
8x V100 20-90x faster than dual socket CPU
Decisions Trees
Random Forests
Linear Regressions
Logistics Regressions
K-Means
K-Nearest Neighbor
DBSCAN
Kalman Filtering
Principal Components
Single Value Decomposition
Bayesian Inferencing
PageRank
BFS
Jaccard Similarity
Single Source Shortest Path
Triangle Counting
Louvain Modularity
ARIMA
Holt-Winters
Machine Learning Graph Analytics
Time Series
XGBoost, Criteo Dataset, 90x
3 Hours to 2 mins on 1 DGX-1
cuML & cuGraph
19
CUDF + XGBOOST
DGX-2 vs Scale Out CPU Cluster
• Full end to end pipeline
• Leveraging Dask + PyGDF
• Store each GPU results in sys mem then read back in
• Arrow to Dmatrix (CSR) for XGBoost
20
CUDF + XGBOOST
Scale Out GPU Cluster vs DGX-2
0 50 100 150 200 250 300 350
5xDGX-1
DGX-2
Chart Title
ETL+CSV (s) ML Prep (s) ML (s)
• Full end to end pipeline
• Leveraging Dask for multi-node + PyGDF
• Store each GPU results in sys mem then read back in
• Arrow to Dmatrix (CSR) for XGBoost
21
CUGRAPH
GPU-Accelerated Graph Analytics Library
22
cuDF
Analytics
GPU Memory
Data Preparation VisualizationModel Training
cuML
Machine Learning
cuGraph
Graph Analytics
PyTorch & Chainer
Deep Learning
Kepler.GL
Visualization
CUDF
23
GPU-ACCELERATED ETL
Is GPU-acceleration really needed?
24
GPU-ACCELERATED ETL
The average data scientist spends 90+% of their time in ETL as opposed
to training models
25
CUDF
GPU DataFrame library
• Apache Arrow data format
• Pandas-like API
• Unary and Binary Operations
• Joins / Merges
• GroupBys
• Filters
• User-Defined Functions (UDFs)
• Accelerated file readers
• Etc.
26
CUDF
Today
LibGDF PyGDF
• Low level library containing function
implementations and C/C++ API
• Importing/exporting a GDF using the CUDA IPC
mechanism
• CUDA kernels to perform element-wise math
operations on GPU DataFrame columns
• CUDA sort, join, groupby, and reduction
operations on GPU DataFrames
• A Python library for manipulating GPU
DataFrames
• Python interface to LibGDF library with
additional functionality
• Creating GDFs from Numpy arrays, Pandas
DataFrames, and PyArrow Tables
• JIT compilation of User-Defined Functions
(UDFs) using Numba
27
CUDF
Refactor in Progress
cuDF
• Single repository containing both the low level
implementation and high level wrappers and APIs
• Future high level language bindings based on
community demand, feedback, and contributions
• Moving from CFFI to Cython for Python bindings
to better integrate into the PyData community
PyGDFLibGDF
cuDF
28
PANDAS-LIKE API
Python GPU DataFrame library
29
PANDAS-LIKE API
Pandas ↔ PyGDF
30
PANDAS-LIKE API
Built-In Functions
31
DEMO
32
DASK
What is Dask and why does RAPIDS use it for scaling out?
• Dask is a distributed computation scheduler
built to scale Python workloads from laptops to
supercomputer clusters.
• Extremely modular with scheduling, compute,
data transfer, and out-of-core handling all being
disjointed allowing us to plug in our own
implementations.
• Can easily run multiple Dask workers per node
to allow for an easier development model of
one worker per GPU regardless of single node
or multi node environment.
33
DASK
Scale up and out with cuDF
• Use cuDF primitives underneath in map-reduce style
operations with the same high level API
• Instead of using typical Dask data movement of
pickling objects and sending via TCP sockets, take
advantage of hardware advancements using a
communications framework called OpenUCX:
• For intranode data movement, utilize NVLink
and PCIe peer-to-peer communications
• For internode data movement, utilize GPU
RDMA over Infiniband and RoCE
https://github.com/rapidsai/dask_gdf
34
DASK
Scale up and out with cuML
• Native integration with Dask + cuDF
• Can easily use Dask workers to initialize NCCL for
optimized gather / scatter operations
• Example: this is how the dask-xgboost included
in the container works for multi-GPU and multi-
node, multi-GPU
• Provides easy to use, high level primitives for
synchronization of workers which is needed for many
ML algorithms
35
LOOKING TO THE
FUTURE
36
Next few months
GPU DATAFRAME
• Continue improving performance and functionality
• Single GPU
• Single node, multi GPU
• Multi node, multi GPU
• String Support
• Support for specific “string” dtype with GPU-accelerated functionality similar to Pandas
• Accelerated Data Loading
• File formats: CSV, Parquet, Orc – to start
37
GPU-Accelerated string functions with a Pandas-like API
CUSTRING
• API and functionality is following Pandas:
https://pandas.pydata.org/pandas-
docs/stable/api.html#string-handling
• lower()
• ~22x speedup
• find()
• ~40x speedup
• slice()
• ~100x speedup
0.00
100.00
200.00
300.00
400.00
500.00
600.00
700.00
800.00
lower() find(#) slice(1,15)
milliseconds
Pandas cudastrings
38
CPUs bottleneck data loading in high throughput systems
ACCELERATED DATA LOADING
• CSV Reader
• Follows API of pandas.read_csv
• Current implementation is >10x speed
improvement over pandas
• Parquet Reader
• Work in progress:
https://github.com/gpuopenanalytics/li
bgdf/pull/85
• Will follow API of pandas.read_parquet
• ORC Reader
• Additionally looking towards GPU-accelerating
decompression for common compression
schemes
Source: Apache Crail blog: SQL Performance: Part 1 - Input File Formats
39
PYTHON CUDA ARRAY INTERFACE
Interoperability for Python GPU Array Libraries
• The CUDA array interface is a standard format
that describes a GPU array to allow sharing
GPU arrays between different libraries without
needing to copy or convert data
• Numba, CuPy, and PyTorch are the first
libraries to adopt the interface:
• https://numba.pydata.org/numba-
doc/dev/cuda/cuda_array_interface.html
• https://github.com/cupy/cupy/releases/tag/
v5.0.0b4
• https://github.com/pytorch/pytorch/pull/119
84
40
JOIN THE REVOLUTION
Everyone Can Help!
Integrations, feedback, documentation support, pull requests, new issues, or donations welcomed!
APACHE ARROW GPU Open Analytics
Initiative
https://arrow.apache.org/
@ApacheArrow
http://gpuopenanalytics.com/
@GPUOAI
RAPIDS
https://rapids.ai
@RAPIDSAI
41
WE’RE HIRING
Help us build the future!
• Junior/Mid/Senior Data Scientists
• Junior/Mid/Senior Data Engineers
• CUDA
• Internships
THANK YOU
Keith Kraus @keithjkraus

More Related Content

RAPIDS: GPU-Accelerated ETL and Feature Engineering

  • 1. Keith Kraus 18-10-2018 RAPIDS: GPU-ACCELERATED ETL AND FEATURE ENGINEERING
  • 3. 3 Faster Data Access Less Data Movement DATA PROCESSING EVOLUTION HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train Hadoop Processing, Reading from disk
  • 4. 4 Faster Data Access Less Data Movement DATA PROCESSING EVOLUTION HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train HDFS Read Query ETL ML Train Hadoop Processing, Reading from disk 25-100x Improvement Less code Language flexible Primarily In-Memory Spark In-Memory Processing
  • 5. 5 1980 1990 2000 2010 2020 102 103 104 105 106 107 Single-threaded perf 1.5X per year 1.1X per year Transistors (thousands) NEED MORE SPEED CPU Performance Has Plateaued
  • 6. 6 WE NEED MORE COMPUTE! Basic workloads are bottlenecked by the CPU Source: Mark Litwintschik’s blog: 1.1 Billion Taxi Rides: EC2 versus EMR • In a simple benchmark consisting of aggregating data, the CPU is the bottleneck • This is after the data is parsed and cached into memory which is another common bottleneck • The CPU bottleneck is even worse in more complex workloads! SELECT cab_type, count(*) FROM trips_orc GROUP BY cab_type;
  • 7. 7 HOW CAN WE DO BETTER? • Focus on the full Data Science workflow • Data Loading • Data Transformation • Data Analytics • Python • Provide as close to a drop-in replacement for existing tools • Performance - Leverage GPUs
  • 8. 8 1980 1990 2000 2010 2020 102 103 104 105 106 107 Single-threaded perf 1.5X per year 1.1X per year GPU-Computing perf 1.5X per year 1000X By 2025 NEW BEGINNINGS GPU Performance Grows
  • 9. 9 GPU ADOPTION BARRIERS • Too much data movement • Too many makeshift data formats • Writing CUDA C/C++ is hard • No Python API for data manipulation Yes GPUs are fast but …
  • 10. 10 APP A DATA MOVEMENT AND TRANSFORMATION The bane of productivity and performance CPU GPU APP B Read Data Copy & Convert Copy & Convert Copy & Convert Load Data APP A GPU Data APP B GPU Data APP A APP B
  • 11. 11 APP A DATA MOVEMENT AND TRANSFORMATION What if we could keep data on the GPU? APP B Copy & Convert Copy & Convert Copy & Convert APP A GPU Data APP B GPU Data Read Data Load Data APP B CPU GPU APP A
  • 12. 12 DATA FORMATS Avro XML JSON GML ProtoBuf HDFS Pickle CSV Parquet Pandas Plain Text vs Binary Compressed vs Uncompressed CSR COO CSC * Not a complete list Numpy
  • 13. 13 LEARNING FROM APACHE ARROW From Apache Arrow Home Page - https://arrow.apache.org/
  • 14. 14 cuDF Analytics GPU Memory Data Preparation VisualizationModel Training cuML Machine Learning cuGraph Graph Analytics PyTorch & Chainer Deep Learning Kepler.GL Visualization RAPIDS OPEN SOURCE SOFTWARE
  • 15. 15 Faster Data Access Less Data Movement DATA PROCESSING EVOLUTION 25-100x Improvement Less code Language flexible Primarily In-Memory HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train HDFS Read Query ETL ML Train HDFS Read GPU Read Query CPU Write GPU Read ETL CPU Write GPU Read ML Train Arrow Read Query ETL ML Train 5-10x Improvement More code Language rigid Substantially on GPU 50-100x Improvement Same code Language flexible Primarily on GPU RAPIDS GPU/Spark In-Memory Processing Hadoop Processing, Reading from disk Spark In-Memory Processing
  • 16. 16 RAPIDS Rapid Accelerated Platform for Integrating Data Science APPLICATIONS SYSTEMS ALGORITHMS CUDA ARCHITECTURE • Learn what the data science community needs • Use best practices and standards • Build scalable systems and algorithms • Test Applications and workflows • Iterate
  • 17. 17 RAPIDS How can I download and use RAPIDS? • https://ngc.nvidia.com/registry/nvidia- rapidsai-rapidsai • https://hub.docker.com/r/rapidsai/rapidsai/ • https://github.com/rapidsai • WIP: https://anaconda.org/rapidsai/ • WIP: • https://pypi.org/project/cudf • https://pypi.org/project/cuml
  • 18. 18 AI LIBRARIES Accelerating more of the AI ecosystem Graph Analytics is fundamental to network analysis Machine Learning is fundamental to prediction, classification, clustering, anomaly detection and recommendations. Both can be accelerated with NVIDIA GPU 8x V100 20-90x faster than dual socket CPU Decisions Trees Random Forests Linear Regressions Logistics Regressions K-Means K-Nearest Neighbor DBSCAN Kalman Filtering Principal Components Single Value Decomposition Bayesian Inferencing PageRank BFS Jaccard Similarity Single Source Shortest Path Triangle Counting Louvain Modularity ARIMA Holt-Winters Machine Learning Graph Analytics Time Series XGBoost, Criteo Dataset, 90x 3 Hours to 2 mins on 1 DGX-1 cuML & cuGraph
  • 19. 19 CUDF + XGBOOST DGX-2 vs Scale Out CPU Cluster • Full end to end pipeline • Leveraging Dask + PyGDF • Store each GPU results in sys mem then read back in • Arrow to Dmatrix (CSR) for XGBoost
  • 20. 20 CUDF + XGBOOST Scale Out GPU Cluster vs DGX-2 0 50 100 150 200 250 300 350 5xDGX-1 DGX-2 Chart Title ETL+CSV (s) ML Prep (s) ML (s) • Full end to end pipeline • Leveraging Dask for multi-node + PyGDF • Store each GPU results in sys mem then read back in • Arrow to Dmatrix (CSR) for XGBoost
  • 22. 22 cuDF Analytics GPU Memory Data Preparation VisualizationModel Training cuML Machine Learning cuGraph Graph Analytics PyTorch & Chainer Deep Learning Kepler.GL Visualization CUDF
  • 24. 24 GPU-ACCELERATED ETL The average data scientist spends 90+% of their time in ETL as opposed to training models
  • 25. 25 CUDF GPU DataFrame library • Apache Arrow data format • Pandas-like API • Unary and Binary Operations • Joins / Merges • GroupBys • Filters • User-Defined Functions (UDFs) • Accelerated file readers • Etc.
  • 26. 26 CUDF Today LibGDF PyGDF • Low level library containing function implementations and C/C++ API • Importing/exporting a GDF using the CUDA IPC mechanism • CUDA kernels to perform element-wise math operations on GPU DataFrame columns • CUDA sort, join, groupby, and reduction operations on GPU DataFrames • A Python library for manipulating GPU DataFrames • Python interface to LibGDF library with additional functionality • Creating GDFs from Numpy arrays, Pandas DataFrames, and PyArrow Tables • JIT compilation of User-Defined Functions (UDFs) using Numba
  • 27. 27 CUDF Refactor in Progress cuDF • Single repository containing both the low level implementation and high level wrappers and APIs • Future high level language bindings based on community demand, feedback, and contributions • Moving from CFFI to Cython for Python bindings to better integrate into the PyData community PyGDFLibGDF cuDF
  • 28. 28 PANDAS-LIKE API Python GPU DataFrame library
  • 32. 32 DASK What is Dask and why does RAPIDS use it for scaling out? • Dask is a distributed computation scheduler built to scale Python workloads from laptops to supercomputer clusters. • Extremely modular with scheduling, compute, data transfer, and out-of-core handling all being disjointed allowing us to plug in our own implementations. • Can easily run multiple Dask workers per node to allow for an easier development model of one worker per GPU regardless of single node or multi node environment.
  • 33. 33 DASK Scale up and out with cuDF • Use cuDF primitives underneath in map-reduce style operations with the same high level API • Instead of using typical Dask data movement of pickling objects and sending via TCP sockets, take advantage of hardware advancements using a communications framework called OpenUCX: • For intranode data movement, utilize NVLink and PCIe peer-to-peer communications • For internode data movement, utilize GPU RDMA over Infiniband and RoCE https://github.com/rapidsai/dask_gdf
  • 34. 34 DASK Scale up and out with cuML • Native integration with Dask + cuDF • Can easily use Dask workers to initialize NCCL for optimized gather / scatter operations • Example: this is how the dask-xgboost included in the container works for multi-GPU and multi- node, multi-GPU • Provides easy to use, high level primitives for synchronization of workers which is needed for many ML algorithms
  • 36. 36 Next few months GPU DATAFRAME • Continue improving performance and functionality • Single GPU • Single node, multi GPU • Multi node, multi GPU • String Support • Support for specific “string” dtype with GPU-accelerated functionality similar to Pandas • Accelerated Data Loading • File formats: CSV, Parquet, Orc – to start
  • 37. 37 GPU-Accelerated string functions with a Pandas-like API CUSTRING • API and functionality is following Pandas: https://pandas.pydata.org/pandas- docs/stable/api.html#string-handling • lower() • ~22x speedup • find() • ~40x speedup • slice() • ~100x speedup 0.00 100.00 200.00 300.00 400.00 500.00 600.00 700.00 800.00 lower() find(#) slice(1,15) milliseconds Pandas cudastrings
  • 38. 38 CPUs bottleneck data loading in high throughput systems ACCELERATED DATA LOADING • CSV Reader • Follows API of pandas.read_csv • Current implementation is >10x speed improvement over pandas • Parquet Reader • Work in progress: https://github.com/gpuopenanalytics/li bgdf/pull/85 • Will follow API of pandas.read_parquet • ORC Reader • Additionally looking towards GPU-accelerating decompression for common compression schemes Source: Apache Crail blog: SQL Performance: Part 1 - Input File Formats
  • 39. 39 PYTHON CUDA ARRAY INTERFACE Interoperability for Python GPU Array Libraries • The CUDA array interface is a standard format that describes a GPU array to allow sharing GPU arrays between different libraries without needing to copy or convert data • Numba, CuPy, and PyTorch are the first libraries to adopt the interface: • https://numba.pydata.org/numba- doc/dev/cuda/cuda_array_interface.html • https://github.com/cupy/cupy/releases/tag/ v5.0.0b4 • https://github.com/pytorch/pytorch/pull/119 84
  • 40. 40 JOIN THE REVOLUTION Everyone Can Help! Integrations, feedback, documentation support, pull requests, new issues, or donations welcomed! APACHE ARROW GPU Open Analytics Initiative https://arrow.apache.org/ @ApacheArrow http://gpuopenanalytics.com/ @GPUOAI RAPIDS https://rapids.ai @RAPIDSAI
  • 41. 41 WE’RE HIRING Help us build the future! • Junior/Mid/Senior Data Scientists • Junior/Mid/Senior Data Engineers • CUDA • Internships
  • 42. THANK YOU Keith Kraus @keithjkraus