S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
- 1. Accelerate Data Science in
Python with RAPIDS
John Zedlewski, Senior Director, RAPIDS and Data Science @ NVIDIA
Ashwin Srinath, Senior Engineer, RAPIDS @NVIDIA
GTC 2023
- 2. RAPIDS brings GPU acceleration to the open-source data
science and data engineering ecosystem
- 3. General Purpose and Domain-Specific Libraries
Data Preparation/ETL Visualization
Analytics/ML/Graph
cuDF
● GPU-accelerated ETL functions
● Tracks Pandas and other common
PyData APIs
● Dask + UCX integration for scaling
RAPIDS for Apache Spark
● RAPIDS accelerator for Apache Spark
RAPIDS ML
● GPU-native cuML library (scikit-
learn-style APIs) XGBoost, RAFT,
FIL, HPO, DL interop, and more
cuGraph
● GPU graph analytics, including
Louvain, PageRank, and more
● Multi-Node Multi-GPU features
cuxfilter
● GPU-accelerated cross-filtering
Viz integration
● pyViz: Plotly Dash, Bokeh,
Datashader, HoloViews, hvPlot
● Node-RAPIDS bindings for node.js
Morpheus
Cybersecurity application
development framework
cuSignal
Signals processing
cuSpatial
Spatial analytics
Merlin
Recommender Systems
development framework
cuCIM
Computer vision & image processing primitives
NVTabular
Tabular data feature engineering
Application and Domain-Specific Frameworks
- 4. cuDF
A GPU DataFrame library in Python with a pandas-like API built into the PyData ecosystem
Pandas-like API on the GPU Best-in-Class Performance (Benchmark)
>>> import pandas as pd
>>> df = pd.read_csv("filepath")
>>> df.groupby(“col”).mean()
>>> df.rolling(window=3).sum()
>>> import cudf
>>> df = cudf.read_csv("filepath")
>>> df.groupby(“col”).mean()
>>> df.rolling(window=3).sum()
GPU
CPU
pandas
cuDF
Average Speed-Ups: 10-100x
10 Minutes to cuDF
Groupby Time Series
Strings and Regex
Missing Data
Indexing
Nested Types
Rolling Windows
CuPy Interoperability
UDFs
NVIDIA A100 vs. AMD EPYC 7642 48-Core Processor
cuDF Python vs. Pandas 1.4
Performance
maximized on large in-
memory datasets
- 6. cuDF: Advanced Features
I/O and Interoperability
High-performance IO
▸ CUDA-accelerated readers and writers for CSV,
JSON, Parquet, ORC, Avro, and plain text
▸ GPU-direct storage to bypass PCI bottlenecks
▸ On-GPU compression / decompression with nvComp
Support for text data on GPU
▸ Standard Pandas-style string functions and regular
expressions, accelerated in CUDA
▸ Advanced parsers and tokenizers for deep learning
and NLP, such as Byte-Pair Encoding
Complex Datatypes
▸ Struct, List, and Decimal128 columns – often found
in enterprise datasets but not in core Pandas
Interoperability
▸ Zero-copy data passing to cuPy, Pytorch, and more
via dlpack and __cuda_array_interface__
nelem = 10000
df = cudf.DataFrame({
'a':range(nelem),
'b':range(500, nelem + 500),
'c':range(1000, nelem + 1000)}
)
# Convert to cupy
arr_cupy = df.to_cupy()
# Convert from PyTorch
import torch
From torch.utils import dlpack
data = torch.randn(40000).cuda()
Df = cudf.from_dlpack(
dlpack.to_dlpack(data))
- 7. cuDF: User-defined functions
▸ RAPIDS leverages Numba to compile a wide range of
your user-defined functions to CUDA code
▸ Numba has explicit CUDA JIT support, but it is
automatically applied and optimized in key RAPIDS
locations – totally transparently
▸ Can be used in:
* apply on Series and DataFrames
* Rolling windows for time series (.rolling)
* apply_grouped for aggregations
* On strings in many cases (newly added!)
Bringing all your Python code to CUDA
>> sr = cudf.Series([-1, 1, 2, 3])
# Explicit Numba
@cuda.jit
def rectified_linear(x):
if x < 0:
return 0
elif x < 1:
return x
else:
return 1
>> rectified_linear(sr)
# Automatic Numba with a lambda
>> sr.apply(lambda x: math.log(x+1))
- 9. cuML
Accelerated Machine Learning with a scikit-learn API
>>> from sklearn.ensemble import
RandomForestClassifier
>>> clf = RandomForestClassifier()
>>> clf.fit(x, y)
>>> from cuml.ensemble import
RandomForestClassifier
>>> clf = RandomForestClassifier()
>>> clf.fit(x, y)
GPU
CPU
Scikit-learn
cuML
40+ GPU-Accelerated Algorithms & Growing
Time Series Preprocessing
Classification
Tree Models
Cross Validation
Clustering
Explainability
Dimensionality Reduction
Regression
A100 GPU vs. 2x Intel Xeon E5-2698 CPUs (80 logical cores)
cuML 23.02, scikit-learn 1.2, umap-learn 0.5.3
Performance
maximized on large
in-memory datasets
- 10. ● One line of code change to unlock up to 20x
speedups with GPUs
● Scalable to the world’s largest datasets with
Dask and PySpark
● Built-in SHAP support for model explainability
● Deployable with Triton for lighting-fast inference
in production
● Triton supports LightGBM and Random Forests as
well as XGBoost for inference
Accelerated XGBoost and Inference for Trees
“XGBoost is All You Need” – Bojan Tunguz, 4x Kaggle Grandmaster
>>> from xgboost import XGBClassifier
>>> clf = XGBClassifier()
>>> clf.fit(x, y)
>>> from xgboost import XGBClassifier
>>> clf =
XGBClassifier(tree_method=”gpu_hist”)
>>> clf.fit(x, y)
GPU
CPU
XGBoost
XGBoost
Up to 20x Speedups
- 11. Rapids Visualization
Scalable graphics with user-friendly interfaces
Leverage cuDF speedups to visualize, filter, and analzye
data frames fast with popular library integrations and
the RAPIDS-native cuXfilter and node-RAPIDS
- 12. Let’s code:
Intro to ML
(plus a little visualization)
Get the notebook at
rapids.ai/introgtc2023
- 13. • cuGraph is a library of graph
algorithms capable of processing the
world’s largest graphs
• Link analysis, community detection,
centrality, linear assignment, property
graphs, and more
• Friendly, consistent C, C++17, and
Python APIs compatible with NetworkX
• World-class performance for every
scale and use case
• Support for trillion+ edge graphs
• Graph neural network library
integration
• Graph database integration
cuGraph
Making Large-Scale Graph Analytics Possible
PageRank on 4.4 trillion edges at 1.5 seconds per iteration
Each node has eight A100 80GB GPUs, InfiniBand for inter-node
communication, and NVLink for intra-node communication
- 14. Scaling with RAPIDS + Dask
▸ Distributed extensions with familiar APIs for
DataFrames and Arrays
▸ Scale from laptop to supercomputer scales
– tested up to 1024 GPUs
▸ Deploy on any cloud service or Kubernetes in minutes
▸ Integrate cuDF, cuPy, cuML, cuGraph, and XGBoost
with a common framework
▸ Easily switch between CPU and GPU backends
Easy Multi-GPU for Python Programmers
# Start by telling Dask to use the GPU backend
with dask.config.set({“dataframe.backend”: “cudf”}):
ddf_s = dd.read_parquet(‘stores.parquet’)
ddf_p = dd.read_parquet(“purchases.parquet”)
ddf_p[“total”] = ddf_p.price * ddf_p.quantity
# Combine the two dataframes
ddf_join = ddf_p.merge(res,
on=["id"], how="inner")
ddf_join = ddf_join.set_index("key")
- 15. RAPIDS Accelerator for Apache Spark
Seamless integration with Apache Spark 3.x
spark.sql("""
select
order
count(*) as order_count
from
orders"""
)
spark.conf.set("spark.plugins",
"com.nvidia.spark.SQLPlugin")
spark.sql("""
select
order
count(*) as order_count
from
orders"""
)
CPU Spark
GPU Spark
Average Speed-Ups: 10x
~5x faster than CPU-based servers 78% cheaper than CPU-based servers
*CPU-only 4-node cluster: 4xn1-standard-32 (32vCPU, 120GB RAM)
*GPU 4-node cluster: 4xn1-standard-32 (32vCPU, 120GB RAM) and 8xT4 NVIDIA GPU
*NDS stands for NVIDIA Decision Support benchmark that is derived from the TPC-DS benchmark
and is used for internal testing. Results from NDS are not comparable to TPC-DS
- 16. 350+
RAPIDS contributors on GitHub
Powering Modern Data Teams
Battle tested on the most challenging workloads, integrated with the most
innovative tools, and backed by a huge community
100+
Open-source and commercial
software integrations
25%
of Fortune 100 companies using
RAPIDS
- 18. More on RAPIDS at GTC 2023
General RAPIDS and Data Science
Accelerate Spark With RAPIDS For Cost Savings [S52202]
Accelerating Your Prototypes with NVIDIA RAPIDS and Friends* [DLIT51679] (DLI)
Learn How to Create Features from Tabular Data and Accelerate your Data Science Pipeline* [DLIT51195]
Accelerating Exploratory Data Analysis at LinkedIn [S51399]
cuSpatial: Integrate High-Performance Spatial Computation with Your Existing Workflow [S51243]
ML and Recommender Systems
Using GPU-Optimized Software to Shorten the Feedback Loop in AML and Fraud Models by an Order of Magnitude [S51632]
Using GNNs in LinkedIn Recommendation Systems [S51400]
Merlin Updates - Build and Deploy Recommender Systems at Any Scale [S51335]
Graph and Operations
Accelerating Huge Graph GNN Training using DGL and PyG with Integrated Containers [S51156]
Batched Graph Community Detection on NVIDIA DGX Platforms [PS51057]
Using GNNs in LinkedIn Recommendation Systems [S51400]
Advances in Operations Optimization [S51717]
- 19. How to Get Started with RAPIDS
A Variety of Ways to Get Up & Running
More about RAPIDS Self-Start Resources Discussion & Support
● Learn more at RAPIDS.ai
● Read the API docs
● Check out the RAPIDS blog
● Read the NVIDIA DevBlog
● Get started with RAPIDS
● Deploy on the Cloud today
● Start with Google Colab
● Look at the cheat sheets
● Check the RAPIDS GitHub
● Use the NVIDIA Forums
● Reach out on Slack
● Talk to NVIDIA Services
@RAPIDSai
https://github.com/rapidsai https://rapids.ai/slack-invite/ https://rapids.ai
Get Engaged