SlideShare a Scribd company logo
Bighead
Airbnb’s End-to-End
Machine Learning Infrastructure
Andrew Hoh
ML Infra @ Airbnb
Background Design Goals Architecture
Deep Dive
Open Source
Background
Airbnb’s Product
A global travel community
that offers magical
end-to-end trips, including
where you stay, what you
do and the people you
meet.
Airbnb is already driven by Machine Learning
Search Ranking Smart Pricing Fraud Detection
But there are *many* more opportunities for ML
● Paid Growth - Hosts
● Classifying / Categorizing Listings
● Experience Ranking + Personalization
● Room Type Categorizations
● Customer Service Ticket Routing
● Airbnb Plus
● Listing Photo Quality
● Object Detection - Amenities
● ....
Intrinsic Complexities with Machine Learning
● Understanding the business domain
● Selecting the appropriate Model
● Selecting the appropriate Features
● Fine tuning
Incidental Complexities with Machine Learning
● Integrating with Airbnb’s Data Warehouse
● Scaling model training & serving
● Keeping consistency between: Prototyping vs Production, Training vs Inference
● Keeping track of multiple models, versions, experiments
● Supporting iteration on ML models
→ ML models take on average 8 to 12 weeks to build
→ ML workflows tended to be slow, fragmented, and brittle
The ML Infrastructure Team addresses these challenges
Vision
Airbnb routinely ships
ML-powered features
throughout the product.
Mission
Equip Airbnb with shared
technology to build
production-ready ML
applications with no
incidental complexity.
Supporting the Full ML Lifecycle
Bighead: Design Goals
Scalable
Seamless Versatile
Consistent
Seamless
● Easy to prototype, easy to productionize
● Same workflow across different frameworks
Versatile
● Supports all major ML frameworks
● Meets various requirements
○ Online and Offline
○ Data size
○ SLA
○ GPU training
○ Scheduled and Ad hoc
Consistent
● Consistent environment across the stack
● Consistent data transformation
○ Prototyping and Production
○ Online and Offline
Scalable
● Horizontal
● Elastic
Bighead: Architecture Deep Dive
Execution Management: Bighead Library
Environment Management: Docker Image Service
Feature Data Management: Zipline
Bighead
Service / UI
Prototyping Lifecycle
Management Production
Real Time
Inference
Batch
Training +
Inference
Redspot
ML Automator
Airflow
Deep Thought
Execution Management: Bighead Library
Environment Management: Docker Image Service
Redspot
Feature Data Management: Zipline
Bighead
Service / UI
Deep Thought
ML Automator
Prototyping Lifecycle
Management Production
Airflow
Real Time
Inference
Batch
Training +
Inference
Redspot
Prototyping with Jupyter Notebooks
Jupyter Notebooks?
What are those?
“Creators need an immediate connection
to what they are creating.”
- Bret Victor
● Interactivity and Feedback
● Access to Powerful Hardware
● Access to Data
The ideal Machine Learning development environment?
● A fork of the JupyterHub project
● Integrated with our Data Warehouse
● Access to specialized hardware (e.g. GPUs)
● File sharing between users via AWS EFS
● Packaged in a familiar Jupyterhub UI
Redspot
a Supercharged Jupyter Notebook Service
Redspot
Versatile
● Customized Hardware:
AWS EC2 Instance Types
e.g. P3, X1
● Customized
Dependencies: Docker
Images e.g. Py2.7,
Py3.6+Tensorflow
Redspot
a Supercharged Jupyter Notebook Service
Consistent
● Promotes prototyping in
the exact environment
that your model will use
in production
Seamless
● Integrated with
Bighead Service &
Docker Image Service
via APIs & UI widgets
Execution Management: Bighead Library
Environment Management: Docker Image Service
Redspot
Feature Data Management: Zipline
Bighead
Service / UI
Deep Thought
ML Automator
Prototyping Lifecycle
Management Production
Airflow
Real Time
Inference
Batch
Training +
Inference
Docker Image Service
Environment Customization
● ML Users have a diverse, heterogeneous set of dependencies
● Need an easy way to bootstrap their own runtime environments
● Need to be consistent with the rest of Airbnb’s infrastructure
Docker Image Service - Why
+
● Our configuration management solution
● A composition layer on top of Docker
● Includes a customization service that
faces our users
● Promotes Consistency and Versatility
Docker Image Service - Dependency Customization
Execution Management: Bighead Library
Environment Management: Docker Image Service
Redspot
Feature Data Management: Zipline
Bighead
Service / UI
Deep Thought
ML Automator
Prototyping Lifecycle
Management Production
Airflow
Real Time
Inference
Batch
Training +
Inference
Bighead Service
Model Lifecycle Management
● Tracking ML model changes is just as
important as tracking code changes
● ML model work needs to be
reproducible to be sustainable
● Comparing experiments before you
launch models into production is critical
Model Lifecycle Management - why?
Sf big analytics: bighead
Sf big analytics: bighead
Sf big analytics: bighead
Seamless
● Context-aware
visualizations that carry
over from the prototyping
experience
Bighead Service
Consistent
● Central model
management service
● Single source of truth
about the state of a
model, it’s
dependencies, and
what’s deployed
Execution Management: Bighead Library
Environment Management: Docker Image Service
Redspot
Feature Data Management: Zipline
Bighead
Service / UI
Deep Thought
ML Automator
Prototyping Lifecycle
Management Production
Airflow
Real Time
Inference
Batch
Training +
Inference
Bighead Library
ML Models are highly heterogeneous in
Frameworks Training data
● Data quality
● Structured vs
Unstructured (image,
text)
Environment
● GPU vs CPU
● Dependencies
ML Models are hard to keep consistent
● Data in production is different from data in training
● Offline pipeline is different from online pipeline
● Everyone does everything in a different way
Versatile
● Pipeline on steroids - compute graph for
preprocessing / inference / training /
evaluation / visualization
● Composable, Reusable, Shareable
● Support popular frameworks
Bighead Library
Consistent
● Uniform API
● Serializable - same pipeline used in
training, offline inference, online inference
● Fast primitives for preprocessing
● Metadata for trained models
Bighead Library: ML Pipeline
Visualization - Pipeline
Easy to Serialize/Deserialize
Visualization - Training Data
Visualization - Transformer
Execution Management: Bighead Library
Environment Management: Docker Image Service
Redspot
Feature Data Management: Zipline
Bighead
Service / UI
Deep Thought
ML Automator
Prototyping Lifecycle
Management Production
Airflow
Real Time
Inference
Batch
Training +
Inference
Deep Thought
Online Inference
Easy to do
● Data scientists can’t
launch models without
engineer team
● Engineers often need to
rebuild models
Hard to make online model serving...
Scalable
● Resource
requirements varies
across models
● Throughput fluctuates
across time
Consistent with training
● Different data
● Different pipeline
● Different
dependencies
Seamless
● Integration with event
logging, dashboard
● Integration with Zipline
Deep Thought
Consistent
● Docker + Bighead
Library: Same data
source, pipeline,
environment from
training
Scalable
● Kubernetes: Model
pods can easily scale
● Resource segregation
across models
Sf big analytics: bighead
Execution Management: Bighead Library
Environment Management: Docker Image Service
Redspot
Feature Data Management: Zipline
Bighead
Service / UI
Deep Thought
ML Automator
Prototyping Lifecycle
Management Production
Airflow
Real Time
Inference
Batch
Training +
Inference
ML Automator
Offline Training and Batch Inference
Automated training, inference, and evaluation are necessary
● Scheduling
● Resource allocation
● Saving results
● Dashboards and alerts
● Orchestration
ML Automator - Why
Seamless
● Automate tasks via
Airflow: Generate DAGs
for training, inference,
etc. with appropriate
resources
● Integration with Zipline
for training and scoring
data
ML Automator
Consistent
● Docker + Bighead
Library: Same data
source, pipeline,
environment across the
stack
Scalable
● Spark: Distributed
computing for large
datasets
ML Automator
Execution Management: Bighead Library
Environment Management: Docker Image Service
Redspot
Feature Data Management: Zipline
Bighead
Service / UI
Deep Thought
ML Automator
Prototyping Lifecycle
Management Production
Airflow
Real Time
Inference
Batch
Training +
Inference
Zipline
ML Data Management Framework
Feature management is hard
● Inconsistent offline and online datasets
● Tricky to generate training sets that depend on time correctly
● Slow training sets backfill
● Inadequate data quality checks or monitoring
● Unclear feature ownership and sharing
Seamless
● Integration with Deep
Thought and ML
Automator
Zipline
Consistent
● Consistent data across
training/scoring
● Consistent data across
development/production
● Point-in-time correctness
across features to
prevent label leakage
Scalable
● Leverages Spark and
Flink to scale Batch
and Streaming
workloads
Production Data
Stores
Model Scoring
Data Warehouse
Zipline
Model Training
Features
Scoring
Training
Features
Zipline Addresses the Consistency Challenge Between Training and Scoring
Big Summary
End-to-End platform to build and deploy ML models to production that is seamless, versatile,
consistent, and scalable
● Model lifecycle management
● Feature generation & management
● Online & offline inference
● Pipeline library supporting major frameworks
● Docker image customization service
● Multi-tenant training environment
Built on open source technology
● TensorFlow, PyTorch, Keras, MXNet, Scikit-learn, XGBoost
● Spark, Jupyter, Kubernetes, Docker, Airflow
To be Open Sourced
We are selecting our first couple private
collaborators. If you are interested, please email
me at andrew.hoh@airbnb.com
Questions?
Appendix
Sf big analytics: bighead

More Related Content

Sf big analytics: bighead

  • 1. Bighead Airbnb’s End-to-End Machine Learning Infrastructure Andrew Hoh ML Infra @ Airbnb
  • 2. Background Design Goals Architecture Deep Dive Open Source
  • 4. Airbnb’s Product A global travel community that offers magical end-to-end trips, including where you stay, what you do and the people you meet.
  • 5. Airbnb is already driven by Machine Learning Search Ranking Smart Pricing Fraud Detection
  • 6. But there are *many* more opportunities for ML ● Paid Growth - Hosts ● Classifying / Categorizing Listings ● Experience Ranking + Personalization ● Room Type Categorizations ● Customer Service Ticket Routing ● Airbnb Plus ● Listing Photo Quality ● Object Detection - Amenities ● ....
  • 7. Intrinsic Complexities with Machine Learning ● Understanding the business domain ● Selecting the appropriate Model ● Selecting the appropriate Features ● Fine tuning
  • 8. Incidental Complexities with Machine Learning ● Integrating with Airbnb’s Data Warehouse ● Scaling model training & serving ● Keeping consistency between: Prototyping vs Production, Training vs Inference ● Keeping track of multiple models, versions, experiments ● Supporting iteration on ML models → ML models take on average 8 to 12 weeks to build → ML workflows tended to be slow, fragmented, and brittle
  • 9. The ML Infrastructure Team addresses these challenges Vision Airbnb routinely ships ML-powered features throughout the product. Mission Equip Airbnb with shared technology to build production-ready ML applications with no incidental complexity.
  • 10. Supporting the Full ML Lifecycle
  • 13. Seamless ● Easy to prototype, easy to productionize ● Same workflow across different frameworks
  • 14. Versatile ● Supports all major ML frameworks ● Meets various requirements ○ Online and Offline ○ Data size ○ SLA ○ GPU training ○ Scheduled and Ad hoc
  • 15. Consistent ● Consistent environment across the stack ● Consistent data transformation ○ Prototyping and Production ○ Online and Offline
  • 18. Execution Management: Bighead Library Environment Management: Docker Image Service Feature Data Management: Zipline Bighead Service / UI Prototyping Lifecycle Management Production Real Time Inference Batch Training + Inference Redspot ML Automator Airflow Deep Thought
  • 19. Execution Management: Bighead Library Environment Management: Docker Image Service Redspot Feature Data Management: Zipline Bighead Service / UI Deep Thought ML Automator Prototyping Lifecycle Management Production Airflow Real Time Inference Batch Training + Inference
  • 21. Jupyter Notebooks? What are those? “Creators need an immediate connection to what they are creating.” - Bret Victor
  • 22. ● Interactivity and Feedback ● Access to Powerful Hardware ● Access to Data The ideal Machine Learning development environment?
  • 23. ● A fork of the JupyterHub project ● Integrated with our Data Warehouse ● Access to specialized hardware (e.g. GPUs) ● File sharing between users via AWS EFS ● Packaged in a familiar Jupyterhub UI Redspot a Supercharged Jupyter Notebook Service
  • 25. Versatile ● Customized Hardware: AWS EC2 Instance Types e.g. P3, X1 ● Customized Dependencies: Docker Images e.g. Py2.7, Py3.6+Tensorflow Redspot a Supercharged Jupyter Notebook Service Consistent ● Promotes prototyping in the exact environment that your model will use in production Seamless ● Integrated with Bighead Service & Docker Image Service via APIs & UI widgets
  • 26. Execution Management: Bighead Library Environment Management: Docker Image Service Redspot Feature Data Management: Zipline Bighead Service / UI Deep Thought ML Automator Prototyping Lifecycle Management Production Airflow Real Time Inference Batch Training + Inference
  • 28. ● ML Users have a diverse, heterogeneous set of dependencies ● Need an easy way to bootstrap their own runtime environments ● Need to be consistent with the rest of Airbnb’s infrastructure Docker Image Service - Why +
  • 29. ● Our configuration management solution ● A composition layer on top of Docker ● Includes a customization service that faces our users ● Promotes Consistency and Versatility Docker Image Service - Dependency Customization
  • 30. Execution Management: Bighead Library Environment Management: Docker Image Service Redspot Feature Data Management: Zipline Bighead Service / UI Deep Thought ML Automator Prototyping Lifecycle Management Production Airflow Real Time Inference Batch Training + Inference
  • 32. ● Tracking ML model changes is just as important as tracking code changes ● ML model work needs to be reproducible to be sustainable ● Comparing experiments before you launch models into production is critical Model Lifecycle Management - why?
  • 36. Seamless ● Context-aware visualizations that carry over from the prototyping experience Bighead Service Consistent ● Central model management service ● Single source of truth about the state of a model, it’s dependencies, and what’s deployed
  • 37. Execution Management: Bighead Library Environment Management: Docker Image Service Redspot Feature Data Management: Zipline Bighead Service / UI Deep Thought ML Automator Prototyping Lifecycle Management Production Airflow Real Time Inference Batch Training + Inference
  • 39. ML Models are highly heterogeneous in Frameworks Training data ● Data quality ● Structured vs Unstructured (image, text) Environment ● GPU vs CPU ● Dependencies
  • 40. ML Models are hard to keep consistent ● Data in production is different from data in training ● Offline pipeline is different from online pipeline ● Everyone does everything in a different way
  • 41. Versatile ● Pipeline on steroids - compute graph for preprocessing / inference / training / evaluation / visualization ● Composable, Reusable, Shareable ● Support popular frameworks Bighead Library Consistent ● Uniform API ● Serializable - same pipeline used in training, offline inference, online inference ● Fast primitives for preprocessing ● Metadata for trained models
  • 47. Execution Management: Bighead Library Environment Management: Docker Image Service Redspot Feature Data Management: Zipline Bighead Service / UI Deep Thought ML Automator Prototyping Lifecycle Management Production Airflow Real Time Inference Batch Training + Inference
  • 49. Easy to do ● Data scientists can’t launch models without engineer team ● Engineers often need to rebuild models Hard to make online model serving... Scalable ● Resource requirements varies across models ● Throughput fluctuates across time Consistent with training ● Different data ● Different pipeline ● Different dependencies
  • 50. Seamless ● Integration with event logging, dashboard ● Integration with Zipline Deep Thought Consistent ● Docker + Bighead Library: Same data source, pipeline, environment from training Scalable ● Kubernetes: Model pods can easily scale ● Resource segregation across models
  • 52. Execution Management: Bighead Library Environment Management: Docker Image Service Redspot Feature Data Management: Zipline Bighead Service / UI Deep Thought ML Automator Prototyping Lifecycle Management Production Airflow Real Time Inference Batch Training + Inference
  • 53. ML Automator Offline Training and Batch Inference
  • 54. Automated training, inference, and evaluation are necessary ● Scheduling ● Resource allocation ● Saving results ● Dashboards and alerts ● Orchestration ML Automator - Why
  • 55. Seamless ● Automate tasks via Airflow: Generate DAGs for training, inference, etc. with appropriate resources ● Integration with Zipline for training and scoring data ML Automator Consistent ● Docker + Bighead Library: Same data source, pipeline, environment across the stack Scalable ● Spark: Distributed computing for large datasets
  • 57. Execution Management: Bighead Library Environment Management: Docker Image Service Redspot Feature Data Management: Zipline Bighead Service / UI Deep Thought ML Automator Prototyping Lifecycle Management Production Airflow Real Time Inference Batch Training + Inference
  • 59. Feature management is hard ● Inconsistent offline and online datasets ● Tricky to generate training sets that depend on time correctly ● Slow training sets backfill ● Inadequate data quality checks or monitoring ● Unclear feature ownership and sharing
  • 60. Seamless ● Integration with Deep Thought and ML Automator Zipline Consistent ● Consistent data across training/scoring ● Consistent data across development/production ● Point-in-time correctness across features to prevent label leakage Scalable ● Leverages Spark and Flink to scale Batch and Streaming workloads
  • 61. Production Data Stores Model Scoring Data Warehouse Zipline Model Training Features Scoring Training Features Zipline Addresses the Consistency Challenge Between Training and Scoring
  • 62. Big Summary End-to-End platform to build and deploy ML models to production that is seamless, versatile, consistent, and scalable ● Model lifecycle management ● Feature generation & management ● Online & offline inference ● Pipeline library supporting major frameworks ● Docker image customization service ● Multi-tenant training environment Built on open source technology ● TensorFlow, PyTorch, Keras, MXNet, Scikit-learn, XGBoost ● Spark, Jupyter, Kubernetes, Docker, Airflow
  • 63. To be Open Sourced We are selecting our first couple private collaborators. If you are interested, please email me at andrew.hoh@airbnb.com