Raven: End-to-end Optimization of ML Prediction Queries

Raven: End-to-end Optimization
of ML Prediction Queries
Konstantinos Karanasos, Kwanghyun Park
Gray Systems Lab, Microsoft

App
logic
offline
online
Model Inference
Featurization Model
Model
optimization
policies
orchestratio
n
Data
Catalogs
Governance
Model
Tracking
& Provenance
Access
Control
Logs &
Telemetry
policies
Decisions
Live Data
deployment
other data
featurizatio
n
Model
Training
Model Development / Training
offline feat.
Model
Enterprise-grade ML lifecycle

Data Scientist
Analyst/Developer
model
training
model scoring
data exploration/
preparation
data selection/
transformation
model
deployment
Use Case: Length-of-stay in Hospital
Model:
“Predict length of stay of
a patient in the hospital”
Prediction query:
“Find pregnant patients that
are expected to stay in the
hospital more than a week”

Featurization Model
Container
REST
Prediction Queries: Baseline Approach
policies
HTTP
WebServer
App logic
ODBC
DBMS
Enterprise Features
• Security: data and models outside of the DB
• Extra infrastructure
• High TCO
• Lack tooling/best-practices
Performance
• Data movement
• Latency
• Throughput on batch-scoring

Prediction Queries: In-Engine Evaluation
policies
HTTP
WebServer
App logic
ODBC
DBMS
Enterprise Features
• Security: Data and models within the DBMS
• Reuse Existing infrastructure
• Language/tools/best practices
• Low TCO
Performance ?
• Up to 13x faster on Spark
• Up to 330x faster on SQL Server

Raven: An Optimizer for Prediction Queries in Azure Data
+
data
models
Unified IR
INSERT INTO model (name, model) AS
(“duration_of_stay”,
“from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from …
model_pipeline =
Pipeline([(‘union’, FeatureUnion(…
(‘scaler’,StandardScaler()), …))
(‘clf’,DecisionTreeClassifier())])”);
M: model pipeline (Data Scientist)
Q: SQL query invoking model (Data Analyst)
DECLARE @model varbinary(max) = (
SELECT model FROM scoring_models
WHERE model_name = ”duration_of_stay“ );
WITH data AS(
SELECT *
FROM patient_info AS pi
JOIN blood_tests AS be ON pi.id = be.id
JOIN prenatal_tests AS pt ON be.id = pt.id
);
SELECT d.id, p.length_of_stay
FROM PREDICT(MODEL=@model, DATA=data AS d)
WITH(length_of_stay Pred float) AS p
WHERE d.pregnant = 1 AND p.length_of_stay > 7;
patient_info blood_tests
Categorical
Encoding
FeatureExtractor
DecisionTreeClassifier
Rescaling
Concat
prenatal_tests
σ pregnant = 1
age
pregnant
gender
1 0
F M X
<35 >=35
…
bp … …
…
…
…
Unified IR for MQ
NeuralNet
prenatal_tests
Optimized plan for MQ
switch:
case (bp>140): 7
case (120<bp<140): 4
case (bp<120): 2
σage >35
σ pregnant = 1
π π π
σage <=35
U
σlength_of_stay >= 7
Static
Analysis
Cross
Optimization
2 4 7
… … … …
σbp>140
SQL-inlined model
MQ: inference query
Runtime
Code gen
+
Optimized
IR
from …
model_pipeline =
WITH data AS(
SELECT *
);
Categorical
Encoding
FeatureExtractor
Rescaling
Concat
prenatal_tests
σ pregnant = 1
age
pregnant
gender
1 0
F M X
<35 >=35
…
bp … …
…
…
…
Unified IR for MQ
NeuralNet
prenatal_tests
switch:
case (bp>140): 7
case (120<bp<140): 4
case (bp<120): 2
σage >35
σ pregnant = 1
π π π
σage <=35
U
Static
Analysis
Cross
Optimization
2 4 7
… … … …
σbp>140
SQL-inlined model
MQ: inference query
Runtime
Code gen
+
Embed high-performance
ML inference runtimes
within our data engines
Express data and
ML operations in
a common graph

Constructing the IR
Raven IR operators
Relational algebra
Linear algebra
Other ML operators and data featurizers
UDFs
Static analysis of the prediction query
Support for SQL+ML
Adding support for Python
from …
model_pipeline =
WITH data AS(
SELECT *
);
Categorical
Encoding
FeatureExtractor
Rescaling
Concat
prenatal_tests
σ pregnant = 1
age
pregnant
gender
1 0
F M X
<35 >=35
…
bp … …
…
…
…
Unified IR for MQ
Static
Analysis
Cross
Optimization
2 4 7
… … … …
MQ: inference query

ML Inference in Azure Data Engines
SQL Server
PREDICT statement in SQL Server
Embedded ONNX Runtime in the engine
Available in Azure SQL Edge and SQL DW
(part of Azure Synapse Analytics)
Spark
Introduced a new PREDICT operator
Similar syntax to SQL Server
Support for different types of models
from …
model_pipeline =
WITH data AS(
SELECT *
);
pa
Un
Static
Analysis
MQ: inference query

from …
model_pipeline =
WITH data AS(
SELECT *
);
Categorical
Encoding
FeatureExtractor
Rescaling
Concat
prenatal_tests
σ pregnant = 1
age
pregnant
gender
1 0
F M X
<35 >=35
…
bp … …
…
…
…
Unified IR for MQ
NeuralNet
prenatal_tests
switch:
case (bp>140): 7
case (120<bp<140): 4
case (bp<120): 2
σage >35
σ pregnant = 1
π π π
σage <=35
U
Static
Analysis
Cross
Optimization
2 4 7
… … … …
σbp>140
SQL-inlined model
MQ: inference query
Runtime
Code gen
+
from …
model_pipeline =
WITH data AS(
SELECT *
);
Categorical
Encoding
FeatureExtractor
Rescaling
Concat
prenatal_tests
σ pregnant = 1
age
pregnant
gender
1 0
F M X
<35 >=35
…
bp … …
…
…
…
Unified IR for MQ
NeuralNet
prenatal_tests
switch:
case (bp>140): 7
case (120<bp<140): 4
case (bp<120): 2
σage >35
σ pregnant = 1
π π π
σage <=35
U
Static
Analysis
Cross
Optimization
2 4 7
… … … …
σbp>140
SQL-inlined model
MQ: inference query
Runtime
Code gen
+
Q: “Find pregnant patients
expected to stay in the hospital
more than a week”
from …
model_pipeline =
WITH data AS(
SELECT *
);
Categorical
Encoding
FeatureExtractor
Rescaling
Concat
prenatal_tests
σ pregnant = 1
age
pregnant
gender
1 0
F M X
<35 >=35
…
bp … …
…
…
…
Unified IR for MQ
NeuralNet
prenatal_tests
switch:
case (bp>140): 7
case (120<bp<140): 4
case (bp<120): 2
σage >35
σ pregnant = 1
π π π
σage <=35
U
Static
Analysis
Cross
Optimization
2 4 7
… … … …
σbp>140
SQL-inlined model
MQ: inference query
Runtime
Code gen
+
Raven: An Optimizer for Prediction Queries
+
+
Runtime
Code
Gen

Raven optimizations in practice
(name, model) AS
”,
eline import Pipeline
rocessing import StandardScaler
import DecisionTreeClassifier
n’, FeatureUnion(…
scaler’,StandardScaler()), …))
reeClassifier())])”);
Data Scientist)
ng model (Data Analyst)
rbinary(max) = (
OM scoring_models
e = ”duration_of_stay“ );
nfo AS pi
ts AS be ON pi.id = be.id
tests AS pt ON be.id = pt.id
ngth_of_stay
L=@model, DATA=data AS d)
ay Pred float) AS p
= 1 AND p.length_of_stay > 7;
Categorical
Encoding
FeatureExtractor
Rescaling
Concat
prenatal_tests
σ pregnant = 1
age
pregnant
gender
1 0
F M X
<35 >=35
…
bp … …
…
…
…
Unified IR for MQ
NeuralNet
prenatal_tests
switch:
case (bp>140): 7
case (120<bp<140): 4
case (bp<120): 2
σage >35
σ pregnant = 1
π π π
σage <=35
U
Static
Analysis
Cross
Optimization
2 4 7
… … … …
σbp>140
SQL-inlined model
Runti
Code
1. Predicate-based
model pruning
2. Model projection
pushdown
3. Model splitting
4. Model-to-SQL
translation
5. NN translation
6. Standard DB
optimizations
7. Compiler
optimizations

1. Avoid unnecessary computation
Information passing between model and data
2. Pick the right runtime for each operation
Translation between data and ML operations
3. Hardware acceleration
Translation to tensor computations
(Hummingbird)
from …
model_pipeline =
WITH data AS(
SELECT *
);
Categorical
Encoding
FeatureExtractor
Rescaling
Concat
prenatal_tests
σ pregnant = 1
age
pregnant
gender
1 0
F M X
<35 >=35
…
bp … …
…
…
…
Unified IR for MQ
NeuralNet
prenatal_tests
switch:
case (bp>140): 7
case (120<bp<140): 4
case (bp<120): 2
σage >35
σ pregnant = 1
π π π
σage <=35
U
Static
Analysis
Cross
Optimization
2 4 7
… … … …
σbp>140
SQL-inlined model
MQ: inference query
Runtime
Code gen
Raven optimizations: Key Ideas

0
1,000
2,000
3,000
4,000
5,000
DT-depth5 DT-depth8 LR-.001 GB-20est DT-depth5 DT-depth8 LR-.001 GB-20est
Hospital - 2 billion rows Expedia - 500 million rows
Elapsed
time
(seconds)
End-to-end inferene query time
SparkML Sklearn ONNX runtime Raven
Performance Evaluation: Raven in Spark (HDI)
Best of Raven:
• Decision Trees (DT) and Logistic Regressions (LR): Model Projection Pushdown + ML-to-SQL
• Gradient Boost (GB): Model Projection Pushdown
SELECT PREDICT(model, col1, …)
FROM Hospital
SELECT PREDICT(model, S.col1, …)
FROM listings S, hotels R1, searches R2
WHERE S.prop_id = R1.prop_id AND S.srch_id = R2.srch_id
Raven outperforms other ML runtimes (SparkML, Sklearn, ONNX runtime) by up to ~44x
~44x

0
500
60Est/Dep5 100Est/Dep4 100Est/Dep8 500Est/Dep8
Elapsed
time
(seconds)
Gradient Boost Models (Hospital 200M rows)
ONNX runtime Raven - CPU Raven - GPU
2500
3000
3500
End-to-end inference query time
Performance Evaluation: Raven in Spark with GPU
SELECT PREDICT(model, col1, …) FROM Hospital
Raven + GPU outperforms ONNX runtime by up to ~8x for complex models
~8x

1
10
100
1,000
10,000
100,000
DT-depth5 DT- depth8 LR-.001 GB/RF-20est DT-depth5 DT- depth8 LR-.001 GB-20est
hospital - 100M rows expedia - 100M rows
End-to-end
Time
(sec)
Log
Scale
End-to-end inference query time
MADlib SQL Server (DOP1) Raven (DOP1) SQL Server (DOP16) Raven (DOP16)
Performance Evaluation: Raven Plans in SQL Server
Potential gains with Raven in SQL Server are significantly large!
~230x
~100x
Best of Raven:
• Decision Trees (DT) and Logistic Regressions (LR): Model Projection Pushdown + ML-to-SQL
• Gradient Boost (GB): Model Projection Pushdown

Performance Evaluation: Raven in SQL Server with GPU
Potential gains with Raven and GPU acceleration are significantly large!
~100x
Batch size:
• CPU: Minimum query time obtained with optimal choice of batch size (50K/100K rows).
• GPU: 600K rows.
0
200
400
600
800
1000
1200
1400
depth3-
20est
depth5-
60est
depth4-
100est
depth8-
100est
depth8-
500est
End-to-end
Time
(secs)
Min. CPU-SKL GPU-HB
~2.6x
hospital – 100M rows, GB models

Conclusion: in-DBMS model inference
• Raven is the first step in a long journey of incorporating ML inference
as a foundational extension of relational algebra and an integral
part of SQL query optimizers and runtimes
• Novel Raven optimizer with cross optimizations and operator
transformations
Ø Up to 13x performance improvements on Spark
Ø Up to 330x performance improvements on SQL Server
• Integration of Raven within Spark and SQL Server

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Current state of affairs: In-application model inference
Use case: hospital length-of-stay
“Find pregnant patients that are
expected to stay in the hospital more
than a week”
Security
• Data leaves the DB
• Model outside of the DB
Performance
• Data movement
• Use of Python for data operations
DBMS

Raven: In-DBMS model inference
from …
model_pipeline =
WITH data AS(
SELECT *
);
Static
Analysis
MQ: inference query
Inference query: SQL + PREDICT (SQL Server
2017 syntax) to combine SQL operations with ML
inference
DBMS
DBMS
Raven
model
data
SQL
+
ML

Raven: In-DB model inference
DBMS
Raven
Security
• Data and models within the DB
• Treat models as data
User experience
• Leverage maturity of RDBMS
• Connectivity, tool integration
Can in-DBMS ML inference match (or exceed?)
the performance of state-of-the-art ML
frameworks?
Yes, by up to 230x!

Cross-optimizations in practice
l (name, model) AS
ay”,
ipeline import Pipeline
eprocessing import StandardScaler
ee import DecisionTreeClassifier
=
ion’, FeatureUnion(…
nTreeClassifier())])”);
(Data Scientist)
oking model (Data Analyst)
varbinary(max) = (
FROM scoring_models
ame = ”duration_of_stay“ );
_info AS pi
ests AS be ON pi.id = be.id
l_tests AS pt ON be.id = pt.id
length_of_stay
DEL=@model, DATA=data AS d)
stay Pred float) AS p
t = 1 AND p.length_of_stay > 7;
Categorical
Encoding
FeatureExtractor
Rescaling
Concat
prenatal_tests
σ pregnant = 1
age
pregnant
gender
1 0
F M X
<35 >=35
…
bp … …
…
…
…
Unified IR for MQ
NeuralNet
prenatal_tests
switch:
case (bp>140): 7
case (120<bp<140): 4
case (bp<120): 2
σage >35
σ pregnant = 1
π π π
σage <=35
U
Static
Analysis
Cross
Optimization
2 4 7
… … … …
σbp>140
SQL-inlined model
y
Run
Cod
Cross-IR optimizations
and operator
transformations:
Ø Predicate-based
model pruning
Ø Model projection
pushdown
Ø Model splitting
Ø Model inlining
Ø NN translation
Ø Standard DB
optimizations
Ø Compiler
optimizations

Raven overview
from …
model_pipeline =
WITH data AS(
SELECT *
);
Categorical
Encoding
FeatureExtractor
Rescaling
Concat
prenatal_tests
σ pregnant = 1
age
pregnant
gender
1 0
F M X
<35 >=35
…
bp … …
…
…
…
Unified IR for MQ
NeuralNet
prenatal_tests
switch:
case (bp>140): 7
case (120<bp<140): 4
case (bp<120): 2
σage >35
σ pregnant = 1
π π π
σage <=35
U
Static
Analysis
Cross
Optimization
2 4 7
… … … …
σbp>140
SQL-inlined model
MQ: inference query
Runtime
Code gen
+
Key ideas:
1. Novel cross-optimizations between SQL and ML operations
2. Combine high-performance ML inference engines with SQL Server

Effect of cross optimizations
1
10
102
103
104
1K 10K 100K 1M
Inference
Time
(ms)
Log
Scale
Dataset Size
RF (scikit-learn)
RF-NN (CPU)
RF-NN (GPU)
24.5x
15x
5.3x

Execution modes
In-process
Deep integration of
ONNX Runtime in
SQL Server
Out-of-process
For queries/models not
supported by our static
analyzer
sp_execute_external_script
(Python, R, Java)
Containerized
For languages not
supported by out-
of-process execution

In-process execution
Native predict: execute the model in the same process as SQL Server
Rudimentary support since SQL Server 2017 (five hardcoded models)
Take advantage of state-of-the-art ML inference engines
Compiler optimizations, Code generation, Hardware acceleration
SQL Server + ONNX Runtime
Some challenges
Align schemata between DB and model
Transform data to/from tensors (avoid copying)
Cache inference sessions
Allow for different ML engines
from …
model_pipeline =
WITH data AS(
SELECT *
);
St
Ana
MQ: inference query

Current status
In-process predictions
Ø Implementation in SQL Server 2019
Ø Public preview in Azure SQL DB Edge
Ø Private preview in Azure SQL DW
Out-of-process predictions
Ø ONNX Runtime as an external
language
(ongoing)

Benefits of deep integration
1
10
100
1K
10k
1K 10K 100K 1M 10M 1K 10K 100K 1M 10M
Total
Inference
Time
(ms)
Log
Scale
Dataset Size
Random Forest MLP
ORT
Raven
Raven ext.

Raven: End-to-end Optimization of ML Prediction Queries

More Related Content

Raven: End-to-end Optimization of ML Prediction Queries