SlideShare a Scribd company logo
Accelerate Data Science in
Python with RAPIDS
John Zedlewski, Senior Director, RAPIDS and Data Science @ NVIDIA
Ashwin Srinath, Senior Engineer, RAPIDS @NVIDIA
GTC 2023
RAPIDS brings GPU acceleration to the open-source data
science and data engineering ecosystem
General Purpose and Domain-Specific Libraries
Data Preparation/ETL Visualization
Analytics/ML/Graph
cuDF
● GPU-accelerated ETL functions
● Tracks Pandas and other common
PyData APIs
● Dask + UCX integration for scaling
RAPIDS for Apache Spark
● RAPIDS accelerator for Apache Spark
RAPIDS ML
● GPU-native cuML library (scikit-
learn-style APIs) XGBoost, RAFT,
FIL, HPO, DL interop, and more
cuGraph
● GPU graph analytics, including
Louvain, PageRank, and more
● Multi-Node Multi-GPU features
cuxfilter
● GPU-accelerated cross-filtering
Viz integration
● pyViz: Plotly Dash, Bokeh,
Datashader, HoloViews, hvPlot
● Node-RAPIDS bindings for node.js
Morpheus
Cybersecurity application
development framework
cuSignal
Signals processing
cuSpatial
Spatial analytics
Merlin
Recommender Systems
development framework
cuCIM
Computer vision & image processing primitives
NVTabular
Tabular data feature engineering
Application and Domain-Specific Frameworks
cuDF
A GPU DataFrame library in Python with a pandas-like API built into the PyData ecosystem
Pandas-like API on the GPU Best-in-Class Performance (Benchmark)
>>> import pandas as pd
>>> df = pd.read_csv("filepath")
>>> df.groupby(“col”).mean()
>>> df.rolling(window=3).sum()
>>> import cudf
>>> df = cudf.read_csv("filepath")
>>> df.groupby(“col”).mean()
>>> df.rolling(window=3).sum()
GPU
CPU
pandas
cuDF
Average Speed-Ups: 10-100x
10 Minutes to cuDF
Groupby Time Series
Strings and Regex
Missing Data
Indexing
Nested Types
Rolling Windows
CuPy Interoperability
UDFs
NVIDIA A100 vs. AMD EPYC 7642 48-Core Processor
cuDF Python vs. Pandas 1.4
Performance
maximized on large in-
memory datasets
Let’s Code:
Loading and Preparing Data
Get the notebook at
rapids.ai/introgtc2023
cuDF: Advanced Features
I/O and Interoperability
High-performance IO
▸ CUDA-accelerated readers and writers for CSV,
JSON, Parquet, ORC, Avro, and plain text
▸ GPU-direct storage to bypass PCI bottlenecks
▸ On-GPU compression / decompression with nvComp
Support for text data on GPU
▸ Standard Pandas-style string functions and regular
expressions, accelerated in CUDA
▸ Advanced parsers and tokenizers for deep learning
and NLP, such as Byte-Pair Encoding
Complex Datatypes
▸ Struct, List, and Decimal128 columns – often found
in enterprise datasets but not in core Pandas
Interoperability
▸ Zero-copy data passing to cuPy, Pytorch, and more
via dlpack and __cuda_array_interface__
nelem = 10000
df = cudf.DataFrame({
'a':range(nelem),
'b':range(500, nelem + 500),
'c':range(1000, nelem + 1000)}
)
# Convert to cupy
arr_cupy = df.to_cupy()
# Convert from PyTorch
import torch
From torch.utils import dlpack
data = torch.randn(40000).cuda()
Df = cudf.from_dlpack(
dlpack.to_dlpack(data))
cuDF: User-defined functions
▸ RAPIDS leverages Numba to compile a wide range of
your user-defined functions to CUDA code
▸ Numba has explicit CUDA JIT support, but it is
automatically applied and optimized in key RAPIDS
locations – totally transparently
▸ Can be used in:
* apply on Series and DataFrames
* Rolling windows for time series (.rolling)
* apply_grouped for aggregations
* On strings in many cases (newly added!)
Bringing all your Python code to CUDA
>> sr = cudf.Series([-1, 1, 2, 3])
# Explicit Numba
@cuda.jit
def rectified_linear(x):
if x < 0:
return 0
elif x < 1:
return x
else:
return 1
>> rectified_linear(sr)
# Automatic Numba with a lambda
>> sr.apply(lambda x: math.log(x+1))
Let’s Code:
Working with UDFs and
Feature Engineering
Get the notebook at
rapids.ai/introgtc2023
cuML
Accelerated Machine Learning with a scikit-learn API
>>> from sklearn.ensemble import
RandomForestClassifier
>>> clf = RandomForestClassifier()
>>> clf.fit(x, y)
>>> from cuml.ensemble import
RandomForestClassifier
>>> clf = RandomForestClassifier()
>>> clf.fit(x, y)
GPU
CPU
Scikit-learn
cuML
40+ GPU-Accelerated Algorithms & Growing
Time Series Preprocessing
Classification
Tree Models
Cross Validation
Clustering
Explainability
Dimensionality Reduction
Regression
A100 GPU vs. 2x Intel Xeon E5-2698 CPUs (80 logical cores)
cuML 23.02, scikit-learn 1.2, umap-learn 0.5.3
Performance
maximized on large
in-memory datasets
● One line of code change to unlock up to 20x
speedups with GPUs
● Scalable to the world’s largest datasets with
Dask and PySpark
● Built-in SHAP support for model explainability
● Deployable with Triton for lighting-fast inference
in production
● Triton supports LightGBM and Random Forests as
well as XGBoost for inference
Accelerated XGBoost and Inference for Trees
“XGBoost is All You Need” – Bojan Tunguz, 4x Kaggle Grandmaster
>>> from xgboost import XGBClassifier
>>> clf = XGBClassifier()
>>> clf.fit(x, y)
>>> from xgboost import XGBClassifier
>>> clf =
XGBClassifier(tree_method=”gpu_hist”)
>>> clf.fit(x, y)
GPU
CPU
XGBoost
XGBoost
Up to 20x Speedups
Rapids Visualization
Scalable graphics with user-friendly interfaces
Leverage cuDF speedups to visualize, filter, and analzye
data frames fast with popular library integrations and
the RAPIDS-native cuXfilter and node-RAPIDS
Let’s code:
Intro to ML
(plus a little visualization)
Get the notebook at
rapids.ai/introgtc2023
• cuGraph is a library of graph
algorithms capable of processing the
world’s largest graphs
• Link analysis, community detection,
centrality, linear assignment, property
graphs, and more
• Friendly, consistent C, C++17, and
Python APIs compatible with NetworkX
• World-class performance for every
scale and use case
• Support for trillion+ edge graphs
• Graph neural network library
integration
• Graph database integration
cuGraph
Making Large-Scale Graph Analytics Possible
PageRank on 4.4 trillion edges at 1.5 seconds per iteration
Each node has eight A100 80GB GPUs, InfiniBand for inter-node
communication, and NVLink for intra-node communication
Scaling with RAPIDS + Dask
▸ Distributed extensions with familiar APIs for
DataFrames and Arrays
▸ Scale from laptop to supercomputer scales
– tested up to 1024 GPUs
▸ Deploy on any cloud service or Kubernetes in minutes
▸ Integrate cuDF, cuPy, cuML, cuGraph, and XGBoost
with a common framework
▸ Easily switch between CPU and GPU backends
Easy Multi-GPU for Python Programmers
# Start by telling Dask to use the GPU backend
with dask.config.set({“dataframe.backend”: “cudf”}):
ddf_s = dd.read_parquet(‘stores.parquet’)
ddf_p = dd.read_parquet(“purchases.parquet”)
ddf_p[“total”] = ddf_p.price * ddf_p.quantity
# Combine the two dataframes
ddf_join = ddf_p.merge(res,
on=["id"], how="inner")
ddf_join = ddf_join.set_index("key")
RAPIDS Accelerator for Apache Spark
Seamless integration with Apache Spark 3.x
spark.sql("""
select
order
count(*) as order_count
from
orders"""
)
spark.conf.set("spark.plugins",
"com.nvidia.spark.SQLPlugin")
spark.sql("""
select
order
count(*) as order_count
from
orders"""
)
CPU Spark
GPU Spark
Average Speed-Ups: 10x
~5x faster than CPU-based servers 78% cheaper than CPU-based servers
*CPU-only 4-node cluster: 4xn1-standard-32 (32vCPU, 120GB RAM)
*GPU 4-node cluster: 4xn1-standard-32 (32vCPU, 120GB RAM) and 8xT4 NVIDIA GPU
*NDS stands for NVIDIA Decision Support benchmark that is derived from the TPC-DS benchmark
and is used for internal testing. Results from NDS are not comparable to TPC-DS
350+
RAPIDS contributors on GitHub
Powering Modern Data Teams
Battle tested on the most challenging workloads, integrated with the most
innovative tools, and backed by a huge community
100+
Open-source and commercial
software integrations
25%
of Fortune 100 companies using
RAPIDS
Deploying RAPIDS
Running RAPIDS anywhere (https://docs.rapids.ai/deployment)
RAPIDS Deployment Documentation
More on RAPIDS at GTC 2023
General RAPIDS and Data Science
Accelerate Spark With RAPIDS For Cost Savings [S52202]
Accelerating Your Prototypes with NVIDIA RAPIDS and Friends* [DLIT51679] (DLI)
Learn How to Create Features from Tabular Data and Accelerate your Data Science Pipeline* [DLIT51195]
Accelerating Exploratory Data Analysis at LinkedIn [S51399]
cuSpatial: Integrate High-Performance Spatial Computation with Your Existing Workflow [S51243]
ML and Recommender Systems
Using GPU-Optimized Software to Shorten the Feedback Loop in AML and Fraud Models by an Order of Magnitude [S51632]
Using GNNs in LinkedIn Recommendation Systems [S51400]
Merlin Updates - Build and Deploy Recommender Systems at Any Scale [S51335]
Graph and Operations
Accelerating Huge Graph GNN Training using DGL and PyG with Integrated Containers [S51156]
Batched Graph Community Detection on NVIDIA DGX Platforms [PS51057]
Using GNNs in LinkedIn Recommendation Systems [S51400]
Advances in Operations Optimization [S51717]
How to Get Started with RAPIDS
A Variety of Ways to Get Up & Running
More about RAPIDS Self-Start Resources Discussion & Support
● Learn more at RAPIDS.ai
● Read the API docs
● Check out the RAPIDS blog
● Read the NVIDIA DevBlog
● Get started with RAPIDS
● Deploy on the Cloud today
● Start with Google Colab
● Look at the cheat sheets
● Check the RAPIDS GitHub
● Use the NVIDIA Forums
● Reach out on Slack
● Talk to NVIDIA Services
@RAPIDSai
https://github.com/rapidsai https://rapids.ai/slack-invite/ https://rapids.ai
Get Engaged
NVIDIA Launchpad
Instantly experience end-to-end workflows for AI, data science, 3D design collaboration, and more
Get Started with Launchpad
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf

More Related Content

S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf

  • 1. Accelerate Data Science in Python with RAPIDS John Zedlewski, Senior Director, RAPIDS and Data Science @ NVIDIA Ashwin Srinath, Senior Engineer, RAPIDS @NVIDIA GTC 2023
  • 2. RAPIDS brings GPU acceleration to the open-source data science and data engineering ecosystem
  • 3. General Purpose and Domain-Specific Libraries Data Preparation/ETL Visualization Analytics/ML/Graph cuDF ● GPU-accelerated ETL functions ● Tracks Pandas and other common PyData APIs ● Dask + UCX integration for scaling RAPIDS for Apache Spark ● RAPIDS accelerator for Apache Spark RAPIDS ML ● GPU-native cuML library (scikit- learn-style APIs) XGBoost, RAFT, FIL, HPO, DL interop, and more cuGraph ● GPU graph analytics, including Louvain, PageRank, and more ● Multi-Node Multi-GPU features cuxfilter ● GPU-accelerated cross-filtering Viz integration ● pyViz: Plotly Dash, Bokeh, Datashader, HoloViews, hvPlot ● Node-RAPIDS bindings for node.js Morpheus Cybersecurity application development framework cuSignal Signals processing cuSpatial Spatial analytics Merlin Recommender Systems development framework cuCIM Computer vision & image processing primitives NVTabular Tabular data feature engineering Application and Domain-Specific Frameworks
  • 4. cuDF A GPU DataFrame library in Python with a pandas-like API built into the PyData ecosystem Pandas-like API on the GPU Best-in-Class Performance (Benchmark) >>> import pandas as pd >>> df = pd.read_csv("filepath") >>> df.groupby(“col”).mean() >>> df.rolling(window=3).sum() >>> import cudf >>> df = cudf.read_csv("filepath") >>> df.groupby(“col”).mean() >>> df.rolling(window=3).sum() GPU CPU pandas cuDF Average Speed-Ups: 10-100x 10 Minutes to cuDF Groupby Time Series Strings and Regex Missing Data Indexing Nested Types Rolling Windows CuPy Interoperability UDFs NVIDIA A100 vs. AMD EPYC 7642 48-Core Processor cuDF Python vs. Pandas 1.4 Performance maximized on large in- memory datasets
  • 5. Let’s Code: Loading and Preparing Data Get the notebook at rapids.ai/introgtc2023
  • 6. cuDF: Advanced Features I/O and Interoperability High-performance IO ▸ CUDA-accelerated readers and writers for CSV, JSON, Parquet, ORC, Avro, and plain text ▸ GPU-direct storage to bypass PCI bottlenecks ▸ On-GPU compression / decompression with nvComp Support for text data on GPU ▸ Standard Pandas-style string functions and regular expressions, accelerated in CUDA ▸ Advanced parsers and tokenizers for deep learning and NLP, such as Byte-Pair Encoding Complex Datatypes ▸ Struct, List, and Decimal128 columns – often found in enterprise datasets but not in core Pandas Interoperability ▸ Zero-copy data passing to cuPy, Pytorch, and more via dlpack and __cuda_array_interface__ nelem = 10000 df = cudf.DataFrame({ 'a':range(nelem), 'b':range(500, nelem + 500), 'c':range(1000, nelem + 1000)} ) # Convert to cupy arr_cupy = df.to_cupy() # Convert from PyTorch import torch From torch.utils import dlpack data = torch.randn(40000).cuda() Df = cudf.from_dlpack( dlpack.to_dlpack(data))
  • 7. cuDF: User-defined functions ▸ RAPIDS leverages Numba to compile a wide range of your user-defined functions to CUDA code ▸ Numba has explicit CUDA JIT support, but it is automatically applied and optimized in key RAPIDS locations – totally transparently ▸ Can be used in: * apply on Series and DataFrames * Rolling windows for time series (.rolling) * apply_grouped for aggregations * On strings in many cases (newly added!) Bringing all your Python code to CUDA >> sr = cudf.Series([-1, 1, 2, 3]) # Explicit Numba @cuda.jit def rectified_linear(x): if x < 0: return 0 elif x < 1: return x else: return 1 >> rectified_linear(sr) # Automatic Numba with a lambda >> sr.apply(lambda x: math.log(x+1))
  • 8. Let’s Code: Working with UDFs and Feature Engineering Get the notebook at rapids.ai/introgtc2023
  • 9. cuML Accelerated Machine Learning with a scikit-learn API >>> from sklearn.ensemble import RandomForestClassifier >>> clf = RandomForestClassifier() >>> clf.fit(x, y) >>> from cuml.ensemble import RandomForestClassifier >>> clf = RandomForestClassifier() >>> clf.fit(x, y) GPU CPU Scikit-learn cuML 40+ GPU-Accelerated Algorithms & Growing Time Series Preprocessing Classification Tree Models Cross Validation Clustering Explainability Dimensionality Reduction Regression A100 GPU vs. 2x Intel Xeon E5-2698 CPUs (80 logical cores) cuML 23.02, scikit-learn 1.2, umap-learn 0.5.3 Performance maximized on large in-memory datasets
  • 10. ● One line of code change to unlock up to 20x speedups with GPUs ● Scalable to the world’s largest datasets with Dask and PySpark ● Built-in SHAP support for model explainability ● Deployable with Triton for lighting-fast inference in production ● Triton supports LightGBM and Random Forests as well as XGBoost for inference Accelerated XGBoost and Inference for Trees “XGBoost is All You Need” – Bojan Tunguz, 4x Kaggle Grandmaster >>> from xgboost import XGBClassifier >>> clf = XGBClassifier() >>> clf.fit(x, y) >>> from xgboost import XGBClassifier >>> clf = XGBClassifier(tree_method=”gpu_hist”) >>> clf.fit(x, y) GPU CPU XGBoost XGBoost Up to 20x Speedups
  • 11. Rapids Visualization Scalable graphics with user-friendly interfaces Leverage cuDF speedups to visualize, filter, and analzye data frames fast with popular library integrations and the RAPIDS-native cuXfilter and node-RAPIDS
  • 12. Let’s code: Intro to ML (plus a little visualization) Get the notebook at rapids.ai/introgtc2023
  • 13. • cuGraph is a library of graph algorithms capable of processing the world’s largest graphs • Link analysis, community detection, centrality, linear assignment, property graphs, and more • Friendly, consistent C, C++17, and Python APIs compatible with NetworkX • World-class performance for every scale and use case • Support for trillion+ edge graphs • Graph neural network library integration • Graph database integration cuGraph Making Large-Scale Graph Analytics Possible PageRank on 4.4 trillion edges at 1.5 seconds per iteration Each node has eight A100 80GB GPUs, InfiniBand for inter-node communication, and NVLink for intra-node communication
  • 14. Scaling with RAPIDS + Dask ▸ Distributed extensions with familiar APIs for DataFrames and Arrays ▸ Scale from laptop to supercomputer scales – tested up to 1024 GPUs ▸ Deploy on any cloud service or Kubernetes in minutes ▸ Integrate cuDF, cuPy, cuML, cuGraph, and XGBoost with a common framework ▸ Easily switch between CPU and GPU backends Easy Multi-GPU for Python Programmers # Start by telling Dask to use the GPU backend with dask.config.set({“dataframe.backend”: “cudf”}): ddf_s = dd.read_parquet(‘stores.parquet’) ddf_p = dd.read_parquet(“purchases.parquet”) ddf_p[“total”] = ddf_p.price * ddf_p.quantity # Combine the two dataframes ddf_join = ddf_p.merge(res, on=["id"], how="inner") ddf_join = ddf_join.set_index("key")
  • 15. RAPIDS Accelerator for Apache Spark Seamless integration with Apache Spark 3.x spark.sql(""" select order count(*) as order_count from orders""" ) spark.conf.set("spark.plugins", "com.nvidia.spark.SQLPlugin") spark.sql(""" select order count(*) as order_count from orders""" ) CPU Spark GPU Spark Average Speed-Ups: 10x ~5x faster than CPU-based servers 78% cheaper than CPU-based servers *CPU-only 4-node cluster: 4xn1-standard-32 (32vCPU, 120GB RAM) *GPU 4-node cluster: 4xn1-standard-32 (32vCPU, 120GB RAM) and 8xT4 NVIDIA GPU *NDS stands for NVIDIA Decision Support benchmark that is derived from the TPC-DS benchmark and is used for internal testing. Results from NDS are not comparable to TPC-DS
  • 16. 350+ RAPIDS contributors on GitHub Powering Modern Data Teams Battle tested on the most challenging workloads, integrated with the most innovative tools, and backed by a huge community 100+ Open-source and commercial software integrations 25% of Fortune 100 companies using RAPIDS
  • 17. Deploying RAPIDS Running RAPIDS anywhere (https://docs.rapids.ai/deployment) RAPIDS Deployment Documentation
  • 18. More on RAPIDS at GTC 2023 General RAPIDS and Data Science Accelerate Spark With RAPIDS For Cost Savings [S52202] Accelerating Your Prototypes with NVIDIA RAPIDS and Friends* [DLIT51679] (DLI) Learn How to Create Features from Tabular Data and Accelerate your Data Science Pipeline* [DLIT51195] Accelerating Exploratory Data Analysis at LinkedIn [S51399] cuSpatial: Integrate High-Performance Spatial Computation with Your Existing Workflow [S51243] ML and Recommender Systems Using GPU-Optimized Software to Shorten the Feedback Loop in AML and Fraud Models by an Order of Magnitude [S51632] Using GNNs in LinkedIn Recommendation Systems [S51400] Merlin Updates - Build and Deploy Recommender Systems at Any Scale [S51335] Graph and Operations Accelerating Huge Graph GNN Training using DGL and PyG with Integrated Containers [S51156] Batched Graph Community Detection on NVIDIA DGX Platforms [PS51057] Using GNNs in LinkedIn Recommendation Systems [S51400] Advances in Operations Optimization [S51717]
  • 19. How to Get Started with RAPIDS A Variety of Ways to Get Up & Running More about RAPIDS Self-Start Resources Discussion & Support ● Learn more at RAPIDS.ai ● Read the API docs ● Check out the RAPIDS blog ● Read the NVIDIA DevBlog ● Get started with RAPIDS ● Deploy on the Cloud today ● Start with Google Colab ● Look at the cheat sheets ● Check the RAPIDS GitHub ● Use the NVIDIA Forums ● Reach out on Slack ● Talk to NVIDIA Services @RAPIDSai https://github.com/rapidsai https://rapids.ai/slack-invite/ https://rapids.ai Get Engaged
  • 20. NVIDIA Launchpad Instantly experience end-to-end workflows for AI, data science, 3D design collaboration, and more Get Started with Launchpad