SlideShare a Scribd company logo
PIPELINE.AI: HIGH PERFORMANCE MODEL
TRAINING & SERVING WITH GPUS…
…AND AWS SAGEMAKER, GOOGLE CLOUD ML,
AZURE ML & KUBERNETES!
CHRIS FREGLY
FOUNDER @ PIPELINE.AI
RECENT PIPELINE.AI NEWS
Sept 2017
Dec 2017
INTRODUCTIONS: ME
§ Chris Fregly, Founder & Engineer @PipelineAI
§ Formerly Netflix, Databricks, IBM Spark Tech
§ Advanced Spark and TensorFlow Meetup
§ Please Join Our 60,000+ Global Members!!
Contact Me
chris@pipeline.ai
@cfregly
Global Locations
* San Francisco
* Chicago
* Austin
* Washington DC
* Dusseldorf
* London
INTRODUCTIONS: YOU
§ Software Engineer, Data Scientist, Data Engineer, Data Analyst
§ Interested in Optimizing and Deploying TF Models to Production
§ Nice to Have a Working Knowledge of TensorFlow (Not Required)
PIPELINE.AI IS 100% OPEN SOURCE
§ https://github.com/PipelineAI/pipeline/
§ Please Star 🌟 this GitHub Repo!
§ Some VC’s Value GitHub Stars @ $15,000 Each (?!)
PIPELINE.AI OVERVIEW
450,000 Docker Downloads
60,000 Users Registered for GA
60,000 Meetup Members
40,000 LinkedIn Followers
2,200 GitHub Stars
12 Enterprise Beta Users
WHY HEAVY FOCUS ON MODEL SERVING?
Model Training
Batch & Boring
Offline in Research Lab
Pipeline Ends at Training
No Insight into Live Production
Small Number of Data Scientists
Optimizations Very Well-Known
Real-Time & Exciting!!
Online in Live Production
Pipeline Extends into Production
Continuous Insight into Live Production
Huuuuuuge Number of Application Users
**Many Optimizations Not Yet Utilized
<<<
Model Serving
100’s Training Jobs per Day 1,000,000’s Predictions per Sec
AGENDA
§ Deploy and Tune Models + Runtimes Safely in Prod
§ Compare Models Both Offline and Online
§ Auto-Shift Traffic to Winning Model or Cloud
§ Live, Continuous Model Training in Production
PACKAGE MODEL + RUNTIME AS ONE
§ Build Model with Runtime into Immutable Docker Image
§ Emphasize Immutable Deployment and Infrastructure
§ Same Runtime Dependencies in All Environments
§ Local, Development, Staging, Production
§ No Library or Dependency Surprises
§ Deploy and Tune Model + Runtime Together
pipeline predict-server-build --model-type=tensorflow 
--model-name=mnist 
--model-tag=A 
--model-path=./models/tensorflow/mnist/
Build Local
Model Server A
LOAD TEST LOCAL MODEL + RUNTIME
§ Perform Mini-Load Test on Local Model Server
§ Immediate, Local Prediction Performance Metrics
§ Compare to Previous Model + Runtime Variations
pipeline predict-server-start --model-type=tensorflow 
--model-name=mnist 
--model-tag=A
pipeline predict --model-endpoint-url=http://localhost:8080 
--test-request-path=test_request.json 
--test-request-concurrency=1000
Load Test Local
Model Server A
Start Local
Model Server A
PUSH IMAGE TO DOCKER REGISTRY
§ Supports All Public + Private Docker Registries
§ DockerHub, Artifactory, Quay, AWS, Google, …
§ Or Self-Hosted, Private Docker Registry
pipeline predict-server-push --image-registry-url=<your-registry> 
--image-registry-repo=<your-repo> 
--model-type=tensorflow 
--model-name=mnist 
--model-tag=A
Push Image To
Docker Registry
CLOUD-BASED OPTIONS
§ AWS SageMaker
§ Released Nov 2017 @ Re-invent
§ Custom Docker Images for Training & Serving ie. PipelineAI Images
§ Distributed TensorFlow Training through Estimator API
§ Traffic Splitting for A/B Model Testing
§ Google Cloud ML Engine
§ Mostly Command-Line Based
§ Driving TensorFlow Open Source API (ie. Experiment API)
§ Azure ML
TUNE MODEL + RUNTIME AS SINGLE UNIT
§ Model Training Optimizations
§ Model Hyper-Parameters (ie. Learning Rate)
§ Reduced Precision (ie. FP16 Half Precision)
§ Post-Training Model Optimizations
§ Quantize Model Weights + Activations From 32-bit to 8-bit
§ Fuse Neural Network Layers Together
§ Model Runtime Optimizations
§ Runtime Configs (ie. Request Batch Size)
§ Different Runtimes (ie. TensorFlow Lite, Nvidia TensorRT)
POST-TRAINING OPTIMIZATIONS
§ Prepare Model for Serving
§ Simplify Network
§ Reduce Model Size
§ Quantize for Fast Matrix Math
§ Some Tools
§ Graph Transform Tool (GTT)
§ tfcompile
After Training
After
Optimizing!
pipeline optimize --optimization-list=[quantize_weights, tfcompile] 
--model-type=tensorflow 
--model-name=mnist 
--model-tag=A 
--model-path=./tensorflow/mnist/model 
--output-path=./tensorflow/mnist/optimized_model
Linear
Regression
RUNTIME OPTION: TENSORFLOW LITE
§ Post-Training Model Optimizations
§ Currently Supports iOS and Android
§ On-Device Prediction Runtime
§ Low-Latency, Fast Startup
§ Selective Operator Loading
§ 70KB Min - 300KB Max Runtime Footprint
§ Supports Accelerators (GPU, TPU)
§ Falls Back to CPU without Accelerator
§ Java and C++ APIs
RUNTIME OPTION: NVIDIA TENSOR-RT
§ Post-Training Model Optimizations
§ Specific to Nvidia GPU
§ GPU-Optimized Prediction Runtime
§ Alternative to TensorFlow Serving
§ PipelineAI Supports TensorRT!
DEPLOY MODELS SAFELY TO PROD
§ Deploy from CLI or Jupyter Notebook
§ Tear-Down or Rollback Models Quickly
§ Shadow Canary Deploy: ie.20% Live Traffic
§ Split Canary Deploy: ie. 97-2-1% Live Traffic
pipeline predict-cluster-start --model-runtime=tflite 
--model-type=tensorflow 
--model-name=mnist 
--model-tag=B 
--traffic-split=2
Start Production
Model Cluster B
pipeline predict-cluster-start --model-runtime=tensorrt 
--model-type=tensorflow 
--model-name=mnist 
--model-tag=C 
--traffic-split=1
Start Production
Model Cluster C
pipeline predict-cluster-start --model-runtime=tfserving_gpu 
--model-type=tensorflow 
--model-name=mnist 
--model-tag=A 
--traffic-split=97
Start Production
Model Cluster A
AGENDA
§ Deploy and Tune Models + Runtimes Safely in Prod
§ Compare Models Both Offline and Online
§ Auto-Shift Traffic to Winning Model or Cloud
§ Live, Continuous Model Training in Production
COMPARE MODELS OFFLINE & ONLINE
§ Offline, Batch Metrics
§ Validation + Training Accuracy
§ CPU + GPU Utilization
§ Live Prediction Values
§ Compare Relative Precision
§ Newly-Seen, Streaming Data
§ Online, Real-Time Metrics
§ Response Time, Throughput
§ Cost ($) Per Prediction
VIEW REAL-TIME PREDICTION STREAM
§ Visually Compare Real-Time Predictions
Prediction
Inputs
Prediction
Results &
Confidences
Model B Model CModel A
PREDICTION PROFILING AND TUNING
§ Pinpoint Performance Bottlenecks
§ Fine-Grained Prediction Metrics
§ 3 Steps in Real-Time Prediction
1. transform_request()
2. predict()
3. transform_response()
AGENDA
§ Deploy and Tune Models + Runtimes Safely in Prod
§ Compare Models Both Offline and Online
§ Auto-Shift Traffic to Winning Model or Cloud
§ Live, Continuous Model Training in Production
LIVE, ADAPTIVE TRAFFIC ROUTING
§ A/B Tests
§ Inflexible and Boring
§ Multi-Armed Bandits
§ Adaptive and Exciting!
pipeline traffic-router-split --model-type=tensorflow 
--model-name=mnist 
--model-tag-list=[A,B,C] 
--model-weight-list=[1,2,97]
Adjust
Traffic Routing
Dynamically
SHIFT TRAFFIC TO MAX(REVENUE)
§ Shift Traffic to Winning Model using AI Bandit Algos
SHIFT TRAFFIC TO MIN(CLOUD CO$T)
§ Based on Cost ($) Per Prediction
§ Cost Changes Throughout Day
§ Lose AWS Spot Instances
§ Google Cloud Becomes Cheaper
§ Shift Across Clouds & On-Prem
AGENDA
§ Deploy and Tune Models + Runtimes Safely in Prod
§ Compare Models Both Offline and Online
§ Auto-Shift Traffic to Winning Model or Cloud
§ Live, Continuous Model Training in Production
LIVE, CONTINUOUS MODEL TRAINING
§ The Holy Grail of Machine Learning
§ Q1 2018: PipelineAI Supports Continuous Model Training!
§ Kafka, Kinesis
§ Spark Streaming
PSEUDO-CONTINUOUS TRAINING
§ Identify and Fix Borderline Predictions (~50-50% Confidence)
§ Fix Along Class Boundaries
§ Retrain Newly-Labeled Data
§ Game-ify Labeling Process
§ Enable Crowd Sourcing
DEMO: TRAIN, DEPLOY, TEST MODEL
§ https://github.com/PipelineAI/pipeline/
§ Please Star 🌟 this GitHub Repo!
pipeline predict-server-build --model-type=tensorflow 
--model-name=mnist 
--model-tag=A 
--model-path=./models/tensorflow/mnist/
THANK YOU!!
§ https://github.com/PipelineAI/pipeline/
§ Please Star 🌟 this GitHub Repo!
§ Reminder: VC’s Value GitHub Stars @ $15,000 Each (!!)
Contact Me
chris@pipeline.ai
@cfregly

More Related Content

Optimizing, Profiling, and Deploying High Performance Spark ML and TensorFlow AI

  • 1. PIPELINE.AI: HIGH PERFORMANCE MODEL TRAINING & SERVING WITH GPUS… …AND AWS SAGEMAKER, GOOGLE CLOUD ML, AZURE ML & KUBERNETES! CHRIS FREGLY FOUNDER @ PIPELINE.AI
  • 3. INTRODUCTIONS: ME § Chris Fregly, Founder & Engineer @PipelineAI § Formerly Netflix, Databricks, IBM Spark Tech § Advanced Spark and TensorFlow Meetup § Please Join Our 60,000+ Global Members!! Contact Me chris@pipeline.ai @cfregly Global Locations * San Francisco * Chicago * Austin * Washington DC * Dusseldorf * London
  • 4. INTRODUCTIONS: YOU § Software Engineer, Data Scientist, Data Engineer, Data Analyst § Interested in Optimizing and Deploying TF Models to Production § Nice to Have a Working Knowledge of TensorFlow (Not Required)
  • 5. PIPELINE.AI IS 100% OPEN SOURCE § https://github.com/PipelineAI/pipeline/ § Please Star 🌟 this GitHub Repo! § Some VC’s Value GitHub Stars @ $15,000 Each (?!)
  • 6. PIPELINE.AI OVERVIEW 450,000 Docker Downloads 60,000 Users Registered for GA 60,000 Meetup Members 40,000 LinkedIn Followers 2,200 GitHub Stars 12 Enterprise Beta Users
  • 7. WHY HEAVY FOCUS ON MODEL SERVING? Model Training Batch & Boring Offline in Research Lab Pipeline Ends at Training No Insight into Live Production Small Number of Data Scientists Optimizations Very Well-Known Real-Time & Exciting!! Online in Live Production Pipeline Extends into Production Continuous Insight into Live Production Huuuuuuge Number of Application Users **Many Optimizations Not Yet Utilized <<< Model Serving 100’s Training Jobs per Day 1,000,000’s Predictions per Sec
  • 8. AGENDA § Deploy and Tune Models + Runtimes Safely in Prod § Compare Models Both Offline and Online § Auto-Shift Traffic to Winning Model or Cloud § Live, Continuous Model Training in Production
  • 9. PACKAGE MODEL + RUNTIME AS ONE § Build Model with Runtime into Immutable Docker Image § Emphasize Immutable Deployment and Infrastructure § Same Runtime Dependencies in All Environments § Local, Development, Staging, Production § No Library or Dependency Surprises § Deploy and Tune Model + Runtime Together pipeline predict-server-build --model-type=tensorflow --model-name=mnist --model-tag=A --model-path=./models/tensorflow/mnist/ Build Local Model Server A
  • 10. LOAD TEST LOCAL MODEL + RUNTIME § Perform Mini-Load Test on Local Model Server § Immediate, Local Prediction Performance Metrics § Compare to Previous Model + Runtime Variations pipeline predict-server-start --model-type=tensorflow --model-name=mnist --model-tag=A pipeline predict --model-endpoint-url=http://localhost:8080 --test-request-path=test_request.json --test-request-concurrency=1000 Load Test Local Model Server A Start Local Model Server A
  • 11. PUSH IMAGE TO DOCKER REGISTRY § Supports All Public + Private Docker Registries § DockerHub, Artifactory, Quay, AWS, Google, … § Or Self-Hosted, Private Docker Registry pipeline predict-server-push --image-registry-url=<your-registry> --image-registry-repo=<your-repo> --model-type=tensorflow --model-name=mnist --model-tag=A Push Image To Docker Registry
  • 12. CLOUD-BASED OPTIONS § AWS SageMaker § Released Nov 2017 @ Re-invent § Custom Docker Images for Training & Serving ie. PipelineAI Images § Distributed TensorFlow Training through Estimator API § Traffic Splitting for A/B Model Testing § Google Cloud ML Engine § Mostly Command-Line Based § Driving TensorFlow Open Source API (ie. Experiment API) § Azure ML
  • 13. TUNE MODEL + RUNTIME AS SINGLE UNIT § Model Training Optimizations § Model Hyper-Parameters (ie. Learning Rate) § Reduced Precision (ie. FP16 Half Precision) § Post-Training Model Optimizations § Quantize Model Weights + Activations From 32-bit to 8-bit § Fuse Neural Network Layers Together § Model Runtime Optimizations § Runtime Configs (ie. Request Batch Size) § Different Runtimes (ie. TensorFlow Lite, Nvidia TensorRT)
  • 14. POST-TRAINING OPTIMIZATIONS § Prepare Model for Serving § Simplify Network § Reduce Model Size § Quantize for Fast Matrix Math § Some Tools § Graph Transform Tool (GTT) § tfcompile After Training After Optimizing! pipeline optimize --optimization-list=[quantize_weights, tfcompile] --model-type=tensorflow --model-name=mnist --model-tag=A --model-path=./tensorflow/mnist/model --output-path=./tensorflow/mnist/optimized_model Linear Regression
  • 15. RUNTIME OPTION: TENSORFLOW LITE § Post-Training Model Optimizations § Currently Supports iOS and Android § On-Device Prediction Runtime § Low-Latency, Fast Startup § Selective Operator Loading § 70KB Min - 300KB Max Runtime Footprint § Supports Accelerators (GPU, TPU) § Falls Back to CPU without Accelerator § Java and C++ APIs
  • 16. RUNTIME OPTION: NVIDIA TENSOR-RT § Post-Training Model Optimizations § Specific to Nvidia GPU § GPU-Optimized Prediction Runtime § Alternative to TensorFlow Serving § PipelineAI Supports TensorRT!
  • 17. DEPLOY MODELS SAFELY TO PROD § Deploy from CLI or Jupyter Notebook § Tear-Down or Rollback Models Quickly § Shadow Canary Deploy: ie.20% Live Traffic § Split Canary Deploy: ie. 97-2-1% Live Traffic pipeline predict-cluster-start --model-runtime=tflite --model-type=tensorflow --model-name=mnist --model-tag=B --traffic-split=2 Start Production Model Cluster B pipeline predict-cluster-start --model-runtime=tensorrt --model-type=tensorflow --model-name=mnist --model-tag=C --traffic-split=1 Start Production Model Cluster C pipeline predict-cluster-start --model-runtime=tfserving_gpu --model-type=tensorflow --model-name=mnist --model-tag=A --traffic-split=97 Start Production Model Cluster A
  • 18. AGENDA § Deploy and Tune Models + Runtimes Safely in Prod § Compare Models Both Offline and Online § Auto-Shift Traffic to Winning Model or Cloud § Live, Continuous Model Training in Production
  • 19. COMPARE MODELS OFFLINE & ONLINE § Offline, Batch Metrics § Validation + Training Accuracy § CPU + GPU Utilization § Live Prediction Values § Compare Relative Precision § Newly-Seen, Streaming Data § Online, Real-Time Metrics § Response Time, Throughput § Cost ($) Per Prediction
  • 20. VIEW REAL-TIME PREDICTION STREAM § Visually Compare Real-Time Predictions Prediction Inputs Prediction Results & Confidences Model B Model CModel A
  • 21. PREDICTION PROFILING AND TUNING § Pinpoint Performance Bottlenecks § Fine-Grained Prediction Metrics § 3 Steps in Real-Time Prediction 1. transform_request() 2. predict() 3. transform_response()
  • 22. AGENDA § Deploy and Tune Models + Runtimes Safely in Prod § Compare Models Both Offline and Online § Auto-Shift Traffic to Winning Model or Cloud § Live, Continuous Model Training in Production
  • 23. LIVE, ADAPTIVE TRAFFIC ROUTING § A/B Tests § Inflexible and Boring § Multi-Armed Bandits § Adaptive and Exciting! pipeline traffic-router-split --model-type=tensorflow --model-name=mnist --model-tag-list=[A,B,C] --model-weight-list=[1,2,97] Adjust Traffic Routing Dynamically
  • 24. SHIFT TRAFFIC TO MAX(REVENUE) § Shift Traffic to Winning Model using AI Bandit Algos
  • 25. SHIFT TRAFFIC TO MIN(CLOUD CO$T) § Based on Cost ($) Per Prediction § Cost Changes Throughout Day § Lose AWS Spot Instances § Google Cloud Becomes Cheaper § Shift Across Clouds & On-Prem
  • 26. AGENDA § Deploy and Tune Models + Runtimes Safely in Prod § Compare Models Both Offline and Online § Auto-Shift Traffic to Winning Model or Cloud § Live, Continuous Model Training in Production
  • 27. LIVE, CONTINUOUS MODEL TRAINING § The Holy Grail of Machine Learning § Q1 2018: PipelineAI Supports Continuous Model Training! § Kafka, Kinesis § Spark Streaming
  • 28. PSEUDO-CONTINUOUS TRAINING § Identify and Fix Borderline Predictions (~50-50% Confidence) § Fix Along Class Boundaries § Retrain Newly-Labeled Data § Game-ify Labeling Process § Enable Crowd Sourcing
  • 29. DEMO: TRAIN, DEPLOY, TEST MODEL § https://github.com/PipelineAI/pipeline/ § Please Star 🌟 this GitHub Repo! pipeline predict-server-build --model-type=tensorflow --model-name=mnist --model-tag=A --model-path=./models/tensorflow/mnist/
  • 30. THANK YOU!! § https://github.com/PipelineAI/pipeline/ § Please Star 🌟 this GitHub Repo! § Reminder: VC’s Value GitHub Stars @ $15,000 Each (!!) Contact Me chris@pipeline.ai @cfregly