Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC London 2017 - Oct 13, 2017

BUILDING GOOGLE CLOUD ML ENGINE
FROM SCRATCH WITH PIPELINE.AI
ODSC CONFERENCE
LONDON, ENGLAND
OCTOBER 13, 2017
CHRIS FREGLY,
FOUNDER @ PIPELINE.AI

INTRODUCTIONS: ME
§ Chris Fregly, Research Engineer @ PipelineAI
§ Formerly Netflix, Databricks, IBM Spark Center
§ Advanced Spark and TensorFlow Meetup
Please Join Our 40,000+ Members Globally!
Contact Me
chris@pipeline.ai
@cfregly
*San Francisco
*Chicago
*Austin
*Washington DC
*London

INTRODUCTIONS: YOU
§ Software Engineer or Data Scientist interested in optimizing
and deploying TensorFlow models to production
§ Assume you have a working knowledge of TensorFlow

CONTENT BREAKDOWN
§ PipelineAI Features
§ 50% Training Optimizations (GPUs, Pipeline, XLA+JIT)
§ 50% Prediction Optimizations (XLA+AOT, TF Serving)
§ Why Heavy Focus on Predicting?
§ Training: boring batch O(num_data_scientists)
§ Inference: exciting real-time O(num_users_of_app)

100% OPEN SOURCE CODE
§ https://github.com/PipelineAI/pipeline/
§ Please 🌟 this GitHub Repo!
§ All slides, code, notebooks, and Docker images here:
https://github.com/PipelineAI/pipeline/tree/master/gpu

AGENDA
§ Experiment Safely in Production
§ Tune Both Model + Runtime Parameters
§ Compare Models Both Offline + Online
§ Shift Traffic (Across Clouds) to Winning Model
§ Optimize TensorFlow Training
§ GPUs + Ingestion + Training Pipeline
§ XLA JIT Compiler
§ Optimize TensorFlow Inference
§ XLA AOT Compiler + Graph Transform Tool (GTT)
§ TensorFlow Serving

EXPERIMENT SAFELY IN PRODUCTION
§ Setup Experiments Directly from Jupyter Notebooks
§ Deploy to 1% Prod Traffic
§ Or Deploy in Shadow Mode
§ Tear-Down Experiments Quickly

MODEL + RUNTIME PACKAGING
§ Package Model + Runtime into Immutable Docker Image
§ Same Package: Local, Dev, and Prod
§ No Dependency Surprises in Production

OPTIMIZE MODEL + RUNTIME AS ONE
§ Tune Model Params + Runtime Configs Together
§ Generate Native CPU + GPU Code
§ Quantize Model Weights + Activations
§ Swap Runtimes: TF Serving, TensorRT, CPU, GPU, TPU

NVIDIA TENSORRT RUNTIME
§ Performs Post-Training Optimizations
§ GPU-Optimized Prediction Runtime
§ Alternative to TensorFlow Serving

COMPARE MODELS OFFLINE + ONLINE
§ Offline Metrics
§ Training Accuracy
§ Validation Accuracy
§ Online / Real-Time Metrics
§ Prediction Precision
§ Latency + Throughput

PREDICTION PROFILING + TUNING
§ Pinpoint Performance Bottlenecks
§ Fine-Grained Prediction Metrics
§ 3 Logic Steps in a Prediction
1. transform_request()
2. predict()
3. transform_response()

SHIFT TRAFFIC TO MAXIMIZE REVENUE
§ Shift Traffic to Winning Model using Bandit AI Algorithms

SHIFT TRAFFIC TO MINIMIZE COST
§ Real-Time Cost Per Prediction
§ Across Clouds + On-Premise
§ Bandit-Based Explore/Exploit

VIEW LIVE PREDICTION STREAMS
§ Visually Compare Real-Time Predictions

CONTINUOUS MODEL TRAINING
§ Identify and Fix Borderline Predictions (50-50% Confidence)
§ Fix Along Class Boundaries
§ Retrain on New Labeled Data
§ Enables Crowd Sourcing
§ Game-ify Labeling Process

SETTING UP TENSORFLOW WITH GPUS
§ Very Painful!
§ Especially inside Docker
§ Use nvidia-docker
§ Especially on Kubernetes!
§ Use Kubernetes 1.7+
§ http://pipeline.ai for GitHub + DockerHub Links

GPU HALF-PRECISION SUPPORT
§ FP32 is “Full Precision”, FP16 is “Half Precision”
§ Supported by Pascal P100 (2016) and Volta V100 (2017)
§ Flexible FP32 GPU Cores Can Fit 2 FP16’s for 2x Throughput!
§ Half-Precision is OK for Approximate Deep Learning Use Cases

VOLTA V100 RECENTLY ANNOUNCED
§ 84 Streaming Multiprocessors (SM’s)
§ 5,376 GPU Cores
§ 672 Tensor Cores (ie. Google TPU)
§ Mixed FP16/FP32 Precision
§ More Shared Memory
§ New L0 Instruction Cache
§ Faster L1 Data Cache
§ V100 vs. P100 Performance
§ 12x TFLOPS @ Peak Training
§ 6x Inference Throughput

V100 AND CUDA 9
§ Independent Thread Scheduling - Finally!!
§ Similar to CPU fine-grained thread synchronization semantics
§ Allows GPU to yield execution of any thread
§ Still Optimized for SIMT (Same Instruction Multiple Thread)
§ SIMT units automatically scheduled together
§ Explicit Thread Synchronization
P100 V100

GPU CUDA PROGRAMMING
§ Barbaric, But Fun Barbaric
§ Must Know Hardware Very Well
§ Hardware Changes are Painful
§ Many Great Debuggers Exist

CUDA STREAMS
§ Asynchronous I/O Transfer
§ Overlap Compute and I/O
§ Keeps GPUs Saturated
§ Fundamental to Queue Framework in TensorFlow

TRAINING TERMINOLOGY
§ Tensors: N-Dimensional Arrays
§ ie. Scalar, Vector, Matrix
§ Operations: MatMul, Add, SummaryLog,…
§ Graph: Graph of Operations (DAG)
§ Session: Contains Graph(s)
§ Feeds: Feed inputs into Placeholder
§ Fetches: Fetch output from Operation
§ Variables: What we learn through training
§ aka “weights”, “parameters”
§ Devices: Hardware device on which we train
-TensorFlow-
Trains
Variables
-User-
Fetches
Outputs
-User-
Feeds
Inputs
-TensorFlow-
Performs
Operations
-TensorFlow-
Flows
Tensors
with tf.device(“/gpu:0,/gpu:1”)

TENSORFLOW MODEL
§ MetaGraph
§ Combines GraphDef and Metadata
§ GraphDef
§ Architecture of your model (nodes, edges)
§ Metadata
§ Asset: Accompanying assets to your model
§ SignatureDef: Maps external : internal tensors
§ Variables
§ Stored separately during training (checkpoint)
§ Allows training to continue from any checkpoint
§ Variables are “frozen” into Constants when deployed for inference
GraphDef
x
W
mul add
b
MetaGraph
Metadata
Assets
SignatureDef
Tags
Version
Variables:
“W” : 0.328
“b” : -1.407

TENSORFLOW SESSION
Session
graph: GraphDef
Variables:
“W” : 0.328
“b” : -1.407
Variables are
Periodically
Checkpointed
GraphDef
is Static

EXTEND EXISTING DATA PIPELINES
§ Data Processing
§ HDFS/Hadoop
§ Spark
§ Containers
§ Docker
§ Google Container
§ Container Orchestrators
§ Kubernetes
§ Mesos
<dependency>
<groupId>org.tensorflow</groupId>
<artifactId>tensorflow-hadoop</artifactId>
</dependency>
https://github.com/tensorflow/ecosystem

DON’T USE FEED_DICT
§ Not Optimized for Production Pipelines
§ feed_dict Requires Python <-> C++ Serialization
§ Single-threaded, Synchronous, SLOW!
§ Can’t Retrieve Until Current Batch is Complete
§ CPUs/GPUs Not Fully Utilized!
§ Use Queue or Dataset API

QUEUES
§ More than just a traditional Queue
§ Perform I/O, pre-processing, cropping, shuffling
§ Pulls from HDFS, S3, Google Storage, Kafka, ...
§ Combine many small files into large TFRecord files
§ Use CPUs to free GPUs for compute
§ Uses CUDA Streams
§ Helps saturate CPUs and GPUs

QUEUE CAPACITY PLANNING
§ batch_size
§ # examples / batch (ie. 64 jpg)
§ Limited by GPU RAM
§ num_processing_threads
§ CPU threads pull and pre-process batches of data
§ Limited by CPU Cores
§ queue_capacity
§ Limited by CPU RAM (ie. 5 * batch_size)

DETECT UNDERUTILIZED CPUS, GPUS
§ Instrument training code to generate “timelines”
§ Analyze with Google Web
Tracing Framework (WTF)
§ Monitor CPU with `top`, GPU with `nvidia-smi`
http://google.github.io/tracing-framework/
from tensorflow.python.client import timeline
trace =
timeline.Timeline(step_stats=run_metadata.step_stats)
with open('timeline.json', 'w') as trace_file:
trace_file.write(
trace.generate_chrome_trace_format(show_memory=True))

SINGLE NODE, MULTI-GPU TRAINING
§ cpu:0
§ By default, all CPUs
§ Requires extra config to target a CPU
§ gpu:0..n
§ Each GPU has a unique id
§ TF usually prefers a single GPU
§ xla_cpu:0, xla_gpu:0..n
§ “JIT Compiler Device”
§ Hints TensorFlow to attempt JIT Compile
with tf.device(“/cpu:0”):
with tf.device(“/gpu:0”):
with tf.device(“/gpu:1”):
GPU 0 GPU 1

MULTI-NODE DISTRIBUTED TRAINING
§ TensorFlow Automatically Inserts Send and Receive Ops into Graph
§ Parameter Server Synchronously Aggregates Updates to Variables
§ Nodes with Multiple GPUs will Pre-Aggregate Before Sending to PS
Worker0 Worker0
Worker1
Worker0 Worker1 Worker2
gpu0 gpu1
gpu2 gpu3
gpu0 gpu1
gpu2 gpu3
gpu0 gpu1
gpu2 gpu3
gpu0
gpu1
gpu0
gpu0

SYNCHRONOUS VS. ASYNCHRONOUS
§ Synchronous
§ Nodes compute gradients
§ Nodes update Parameter Server (PS)
§ Nodes sync on PS for latest gradients
§ Asynchronous
§ Some nodes delay in computing gradients
§ Nodes don’t update PS
§ Nodes get stale gradients from PS
§ May not converge due to stale reads!

BATCH NORMALIZATION
§ Each mini-batch may have wildly different distributions
§ Normalize per batch (and layer)
§ Speeds up training!!
§ Weights are learned quicker
§ Final model is more accurate
§ Final mean and variance will be folded into Graph later
-- Pretty Much Always Use Batch Normalization! --
z = tf.matmul(a_prev, W)
a = tf.nn.relu(z)
a_mean, a_var = tf.nn.moments(a, [0])
scale = tf.Variable(tf.ones([depth/channels]))
beta = tf.Variable(tf.zeros ([depth/channels]))
bn = tf.nn.batch_normalizaton(a, a_mean, a_var,
beta, scale, 0.001)

OPTIMIZE GRAPH EXECUTION ORDER
§ https://github.com/yaroslavvb/stuff
Linearize to
minimize graph
memory usage

SEPARATE TRAINING + VALIDATION
§ Separate Training and Validation Clusters
§ Validate Upon Checkpoint
§ Avoids Resource Contention
Training
Cluster
Validation
Cluster
Parameter Server
Cluster

XLA FRAMEWORK
§ Accelerated Linear Algebra (XLA)
§ Goals:
§ Reduce reliance on custom operators
§ Improve execution speed
§ Improve memory usage
§ Reduce mobile footprint
§ Improve portability
§ Helps TensorFlow Stay Both Flexible and Performant

XLA HIGH LEVEL OPTIMIZER (HLO)
§ Compiler Intermediate Representation (IR)
§ Independent of Source and Target Language
§ Define Graphs using HLO Operations
§ XLA Step 1 Emits Target-Independent HLO
§ XLA Step 2 Emits Target-Dependent LLVM
§ LLVM Emits Native Code Specific to Target
§ Supports x86-64, ARM64 (CPU), and NVPTX (GPU)

JIT COMPILER
§ Just-In-Time Compiler
§ Built on XLA Framework
§ Goals:
§ Reduce memory movement – especially useful on GPUs
§ Reduce overhead of multiple function calls
§ Similar to Spark Operator Fusing in Spark 2.0
§ Unroll Loops, Fuse Operators, Fold Constants, …
§ Scope to session, device, or `with jit_scope():`

VISUALIZING JIT COMPILER IN ACTION
Before After
Google Web Tracing Framework:
http://google.github.io/tracing-framework/
from tensorflow.python.client import timeline
trace =
timeline.Timeline(step_stats=run_metadata.step_stats)
with open('timeline.json', 'w') as trace_file:
trace_file.write(
trace.generate_chrome_trace_format(show_memory=True))

VISUALIZING FUSING OPERATORS
pip install graphviz
dot -Tpng
/tmp/hlo_graph_1.w5LcGs.dot
-o hlo_graph_1.png
GraphViz:
http://www.graphviz.org
hlo_*.dot files generated by XLA

AOT COMPILER
§ Standalone, Ahead-Of-Time (AOT) Compiler
§ Built on XLA framework
§ tfcompile
§ Creates executable with minimal TensorFlow Runtime needed
§ Includes only dependencies needed by subgraph computation
§ Creates functions with feeds (inputs) and fetches (outputs)
§ Packaged as cc_libary header and object files to link into your app
§ Commonly used for mobile device inference graph
§ Currently, only CPU x86-64 and ARM are supported - no GPU

GRAPH TRANSFORM TOOL (GTT)
§ Optimize Trained Models for Inference
§ Remove training-only Ops (checkpoint, drop out, logs)
§ Remove unreachable nodes between given feed -> fetch
§ Fuse adjacent operators to improve memory bandwidth
§ Fold final batch norm mean and variance into variables
§ Round weights/variables improves compression (ie. 70%)
§ Quantize (FP32 -> INT8) to speed up math operations

GRAPH TRANSFORM TOOL
transform_graph
--in_graph=tensorflow_inception_graph.pb ß Original Graph
--out_graph=optimized_inception_graph.pb ß Transformed Graph
--inputs='Mul' ß Feed (Input)
--outputs='softmax' ß Fetch (Output)
--transforms=' ß List of Transforms
strip_unused_nodes
remove_nodes(op=Identity, op=CheckNumerics)
fold_constants(ignore_errors=true)
fold_batch_norms
fold_old_batch_norms
quantize_weights
quantize_nodes'

AFTER STRIPPING UNUSED NODES
§ Optimizations
§ strip_unused_nodes
§ Results
§ Graph much simpler
§ File size much smaller

AFTER REMOVING UNUSED NODES
§ Optimizations
§ remove_nodes
§ Results
§ Pesky nodes removed
§ File size a bit smaller

AFTER FOLDING CONSTANTS
§ Optimizations
§ remove_nodes
§ fold_constants
§ Results
§ Placeholders (feeds) -> Variables*
(*Why Variables and not Constants?)

AFTER FOLDING BATCH NORMS
§ Optimizations
§ remove_nodes
§ fold_constants
§ fold_batch_norms
§ Results
§ Graph remains the same
§ File size approximately the same

WEIGHT QUANTIZATION
§ FP16 and INT8 Are Computationally Simpler and Faster
§ Weights/Variables are Constants
§ Easy to Linearly Quantize

AFTER QUANTIZING WEIGHTS
§ Optimizations
§ remove_nodes
§ fold_constants
§ fold_batch_norms
§ quantize_weights
§ Results
§ Graph is same, file size is smaller, compute is faster

ACTIVATION QUANTIZATION
§ Activations Not Known Ahead of Time
§ Depends on input, not easy to quantize
§ Requires Additional Calibration Step
§ Use a “representative” dataset
§ Per Neural Network Layer…
§ Collect histogram of activation values
§ Generate many quantized distributions with different saturation thresholds
§ Choose threshold to minimize…
KL_divergence(ref_distribution, quant_distribution)
§ Not Much Time or Data is Required (Minutes on Commodity Hardware)

AFTER ACTIVATION QUANTIZATION
§ Optimizations
§ remove_nodes
§ fold_constants
§ fold_batch_norms
§ quantize_weights
§ quantize_nodes (activations)
§ Results
§ Larger graph, needs calibration!
Requires additional
freeze_requantization_ranges

FREEZING MODEL FOR DEPLOYMENT
§ Optimizations
§ remove_nodes
§ fold_constants
§ fold_batch_norms
§ quantize_weights
§ quantize_nodes
§ freeze_graph
§ Results
§ Variables -> Constants
Finally!
We’re Ready to Deploy!!

TENSORFLOW SERVING OVERVIEW
§ Inference
§ Only Forward Propagation through Network
§ Predict, Classify, Regress, …
§ Bundle
§ GraphDef, Variables, Metadata, …
§ Assets
§ ie. Map of ClassificationID -> String
§ {9283: “penguin”, 9284: “bridge”}
§ Version
§ Every Model Has a Version Number (Integer)
§ Version Policy
§ ie. Serve Only Latest (Highest), Serve Both Latest and Previous, …

MULTI-HEADED INFERENCE
§ Multiple “heads” (aka “responses”) from 1 model prediction
§ Optimizes bandwidth, CPU, latency, memory, coolness
§ Response includes both class and scores
§ Inputs sent only once
§ Feed scores into ensemble models
§ Use model for feature engineering

REQUEST BATCHING
§ max_batch_size
§ Enables throughput/latency tradeoff
§ Bounded by RAM
§ batch_timeout_micros
§ Defines batch time window, latency upper-bound
§ Bounded by RAM
§ num_batch_threads
§ Defines parallelism
§ Bounded by CPU cores
§ max_enqueued_batches
§ Defines queue upper bound, throttling
§ Bounded by RAM
Reaching either threshold
will trigger a batch

YOU JUST LEARNED…
§ Experiment Safely in Production
§ Tune Both Model + Runtime Parameters
§ Compare Models Both Offline + Online
§ Shift Traffic (Across Clouds) to Winning Model
§ Optimize TensorFlow Training
§ GPUs + Ingestion + Training Pipeline
§ XLA JIT Compiler
§ Optimize TensorFlow Inference
§ XLA AOT Compiler + Graph Transform Tool (GTT)
§ TensorFlow Serving

THANKS! ANY QUESTIONS?
§ https://github.com/PipelineAI/pipeline/
§ Please 🌟 this GitHub Repo!
§ All slides, code, notebooks, and Docker images here:
https://github.com/PipelineAI/pipeline/tree/master/gpu
Contact Me
chris@pipeline.ai
@cfregly

Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC London 2017 - Oct 13, 2017

More Related Content

What's hot

What's hot (20)

Similar to Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC London 2017 - Oct 13, 2017

Similar to Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC London 2017 - Oct 13, 2017 (20)

More from Chris Fregly

More from Chris Fregly (15)

Recently uploaded

Recently uploaded (20)

Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC London 2017 - Oct 13, 2017