Automated Machine Learning

Machine Learning Engineer
Core Modeling Team
Teach sometimes
AI, Machine Learning, Summer/Winter ML Schools
Compete sometimes
Currently hold an Expert rank, top 2% worldwide

Seriously misunderstood creature, AutoML is
Image copyright © Warner Bros. Source: syfy.com

AutoML provides methods and processes to make ML available
for non-ML experts, to improve efﬁciency of ML and to accelerate
research on ML.
[www.automl.org]

Automated machine learning (AutoML) is the
process of automating end-to-end the process of
applying machine learning to real-world problems

Top 3 Questions
Peers Ask Me
?

#1: Will all data scientists lose their jobs soon?
Image copyright © 20th Century Fox. Source: youtube.com

#2: AutoML is about a neural network
generating neural networks, right?
NIPS 2016 conference. Source: blog.ought.com

#3: DS/ML requires serious human expertise.
How can automation ever be “better”?
Image copyright © USA Network. Source: wallpaperplay.com

Three Levels of Scope
Academic AutoML
Advance human knowledge in fundamental AutoML methods
Get publications, citations, degrees, inspire R&D1
Libraries and Open-Source AutoML Software
Reﬁne academic ideas to technical feasibility, gain product engineering experience
Find peers, validate ideas with early adopters, build a community of practitioners2
Commercial AutoML Product
Build a proﬁtable business by solving real-world problems and delivering value at scale
(from small businesses and NGOs to largest corporations and governments)3
focus of this talk

Some Background
🦄 Unicorn startup from Boston, MA
🗓 Developing AutoML products since 2012
💵 $430M of investments (Series E)
🏢 Hundreds of enterprise customers (including ⅓ of Fortune 50)
🔮 1.3 billion ML models built so far
👨‍💻 1000 employees @ ~50 locations around the globe
“DataRobot sets the standard for augmented data science and machine learning”
– Gartner Magic Quadrant for DS and ML Platforms, 2019
“DataRobot leads the pack with a broad set of robust capabilities”
– Forrester New Wave, Automation-Focused ML Solutions, Q2 2019

Recap: DS Value Generation
Business User Problem Data Science
Automation
Optimization
Actionable Insights
Bottom Line Improvement & Executive Decision Support
Raw Data

Business User Problem Data Science
Automation
Optimization
Actionable Insights
Problem
Fram
ing
DataPrep
&
Annotation
DataIngestion
&
M
anagem
entPartitioning
EDA
&
QualityAssessm
ent
FeatureEngineering
M
odelingM
odelTuning
Evaluation
&
Selection
SoftwareConstructionDeploym
entConsum
ption
M
odelM
aintenance
Risk&
Com
pliance

Problem
Fram
ing
DataPrep
&
Annotation
DataIngestion
&
M
anagem
entPartitioning
EDA
&
QualityAssessm
ent
FeatureEngineering
M
odelingM
odelTuning
Evaluation
&
Selection
entConsum
ption
M
odelM
aintenance
Risk&
Com
pliance
Needs domain knowledge to do right
Hates doing
Enjoys doing and wants to keep doing it
Often lacks skills or methodology to do right
Persona: Data Scientist
In large organizations, a lot of “throwing over the wall” happens here
~85% of DS projects never make it to production [bit.ly/30PGOZM]

Recall The Earlier Deﬁnitions:
1. “Accessible for non-ML experts”
2. “End-to-end automation”

Problem
Fram
ing
DataPrep
&
Annotation
IngestionPartitioning
EDA
&
QualityAssessm
ent
FeatureEngineering
M
odelingM
odelTuning
Evaluation
&
Selection
entConsum
ption
M
odelM
aintenance
Vast majority
of ML research
focused here
Risk&
Com
pliance
Vast majority of AutoML research
and emerging products focused here
Actually needed to deliver value in the real world

Sculley et al. (Google)
“Hidden Technical Debt in Machine Learning Systems” [NIPS 2015]

Ideal Goal
Business User AutoMLRaw Data Deﬁnition of
Business Objective
Automatically
Deployed Application
with Monitoring and
Continual Learning
● Lots of capable and motivated people in non-DS teams that know the domain and can deliver value
● Data scientists focus on strategic projects, mentor “citizen data scientists”, and help with problem setup

Good AutoML:
1. Empowers non-experts but does not alienate experts.
2. Augments user’s domain knowledge with automation and fast iteration.
3. Provides guardrails and trust.
Enables more people to get more results with better quality.
Source: MovieFigures via youtube.com

Interesting Use Case: Model Factory
AutoML
● Models speciﬁc to data subsets (e.g. propensity per SKU)
● Models speciﬁc to time ranges (e.g. +1 day, +1 month forecast)
● Short-lived models with rapid refresh cycle (e.g. fraud, malware)

Interesting Challenges
of Building an AutoML Product

Problem Framing
● Automatic detection of the modeling problem from data layout
(regression, binary, multiclass, multilabel, ranking, recommendation, ...)
● Are there datetime features in the data? Maybe it’s a time series forecasting
problem? Maybe there are multiple series along the same axis?
● Maybe there’s no target at all? (E.g., user is interested in anomaly detection)
● If there’s a target, can we ﬁgure out its distribution and recommend a reliable
optimization metric?
● Are there any prior constraints? (E.g., prediction range, monotonicity, weights)

● Does the data have valid tabular shape? Are there various data sources to merge?
Data Preparation and Annotation
ⓘ Deep Feature Synthesis: automatic generation of features from snowﬂake-schema relational data
J. Kanter, K. Veeramachaneni. Deep feature synthesis: Towards automating data science endeavors. DSAA 2015.
“featuretools” Python package: https://github.com/Featuretools/featuretools
ⓘ Snorkel: rapid training data creation with weak supervision
https://github.com/snorkel-team/snorkel
https://arxiv.org/abs/1711.10160
● Is the target deﬁned everywhere? Do we need weak supervision or active learning?

Partitioning
● Automatically recommend a problem-aware validation schema
● Are there group relationships between rows? Need different validation
● Is datetime an important dimension in the dataset? Need different validation
● Seasonal time series detected? Validation needs to account for the seasonal cycles
● Do we need to oversample/undersample/stratify/augment?
● Do not reuse the same validation set for multiple purposes (HPO, ES, model ranking)
● The entire modeling pipeline must be robust enough to never peek into the holdout
until the ﬁnal model deployment

EDA & Quality Assessment
● Automatic data type / column intent detection
ⓘ Exercise: think how you would distinguish between numerics, ordinals, categoricals, text, datetime
● Are there features without meaningful information?
(IDs, constants, duplicates, extreme cardinality or sparsity, noise)
● Are there features that are a potential source of leakage?
ⓘ Watch my earlier talk :P https://github.com/YuriyGuts/odsc-target-leakage-workshop
● Is the format of the data consistent over time? (typical issue for long-lived systems)
● Are there outliers that are dangerous for the chosen optimization objective?
● Can be super insightful to view the data over time, over space, over target label

Feature Engineering
● Needs to be model-aware! Linear, tree-based, neural, FM, classic time series require
different preprocessing and beneﬁt from different feature engineering techniques
● Needs to be datatype-aware
ⓘ For example, correctly distinguishing between a text feature and a categorical feature pays off here.
By the way, language matters for text. We should auto-detect it too and derive features accordingly.
● Needs to be leakage-free (no peeking into test set, very careful peeking at the target)
● Needs to work at prediction time when the model is deployed, using the same raw
data format but with no ground truth available
● Resources are ﬁnite! Latency and scalability are just as important as accuracy

Modeling
● Accuracy is a must. Every percent pays off. Auto-ensembling can help too.
ⓘ Steward Healthcare: www.datarobot.com/casestudy/reducing-costs-with-datarobot-at-steward-health-care/
More accurate predictions: –1% in nurse hours saves $2,000,000/year; –0.1% of patient stay saves $10,000,000/year
● No Free Lunch Theorem is very relevant, especially with prior business constraints.
● Not enough to just have a “list of models”: need to construct pipelines dynamically.
ⓘ Zoubin Ghahramani. Keynote at ICML 2018 AutoML workshop.
● Training from scratch / exhaustive search vs. transfer learning / metalearning.
● Efﬁcient data usage, CPU/GPU and RAM usage, training time, and prediction latency
are just as important as accuracy. Model search can also be constrained by time.
● Every model must be serializable, transferable, reproducible, autonomous.

Model Tuning
● Automated hyperparameter optimization (both for preprocessing and models)
ⓘ An extensively studied problem in AutoML research.
See www.automl.org/book/ for current approaches and libraries.
tl;dr: scikit-optimize, hyperopt, BOHB.
● Automated feature reduction / redundancy detection
● Models need to have well-calibrated probability outputs
ⓘ Guo et al. On Calibration of Modern Neural Networks, ICML 2017 arxiv.org/abs/1706.04599
● Pipeline optimization (also: Neural Architecture Search)
ⓘ Also a subject of extensive academic interest
See www.automl.org/book/ for current approaches
Pipeline optimization AutoML powered by genetic programming: TPOT https://github.com/EpistasisLab/tpot

github.com/pprett/aml-class-19
Genetic Pipeline Optimization

Evaluation and Selection
● Fair model comparison and ranking on out-of-sample data
● Analysis of data efﬁciency (learning curves), resource usage, prediction throughput
● Analysis of model stability out-of-sample
Typical issue: how well a time series model handles different forecasting horizons
● Recommending the best model, considering accuracy, transparency, and speed
● Making use of the data: retraining the best model on more data if needed
ⓘ Quiz: what to do with hyperparameters?
● Fair “apples-to-apples” comparison with externally developed models

Risk and Compliance
● Explaining feature importance, feature interactions, partial dependence
● Explaining the kinds and ranges of tuned hyperparameters and optimal values
● Explaining individual predictions in terms of original features
● Feature sensitivity analysis (effect of perturbations on predictions)
● “What-if” simulations and analysis (e.g. for ethical evaluation)
● Access to preprocessed/ﬁnal modeling data for external reproducibility
● Auto-documenting the methodology, results, and insights!
● All of the above should be available for every model!

Software Construction & Deployment
● Model needs to use the same dependencies it used during training.
OSS scientiﬁc packages also have bugs and breaking changes!
● Edge computing may require the model to be exportable and available ofﬂine
ⓘ Exercise: think how you would make a full model pipeline available for scoring on iOS, Android, Raspberry Pi, ...
● Application needs to be generated according to the initial business problem setup
(e.g. do we need to explain, predict, or prescribe/optimize). Needs to expose API/UI.
● IT policies and compliance have the same relevance here as for any other enterprise
software. OSS and security audit. Legacy software compatibility

Cloud-native,
Docker,
Kubernetes...
CentOS 6

Model Maintenance
● Need to distinguish service health vs. input data health vs. model health
● Automated feature drift / response drift detection
The world never stops changing
● Feedback loop detection
And we never stop changing the world
● Continuous learning
● Challenger models / auto-fallback to a more robust model

References
1. Rich Caruana (Microsoft Research). Open Research Problems in AutoML
https://sites.google.com/site/automlwsicml15/
2. AutoML: Methods, Systems, Challenges
http://automl.org/book/
3. Peter Prettenhofer: AutoML Class @ UCU Data Science School 2019
https://github.com/pprett/aml-class-19

?yuriy.guts @ gmail.com
linkedin.com/in/yuriyguts

Automated Machine Learning

Related slideshows

More Related Content

Automated Machine Learning