SlideShare a Scribd company logo
JUSTIN LONG | justin@skymind.io
Deep Learning with GPUs in Production
AI By the Bay 2017
DEEPLEARNING4J &
KAFKA
April 2019
| OBJECTIVES
By the end of this presentation, you should…
1. Know the Deeplearning4j stack and how it works
2. Understand why aggregation is useful
3. Have an example of using Deeplearning4j and Kafka together
the Deeplearning4j stack
DL4J Ecosystem
Deeplearning4j, ScalNet
Build, train, and deploy neural
networks on JVM and in Spark.
ND4J /libND4J
High performance linear algebra
on GPU/CPU. Numpy for JVM.
DataVec
Data ingestion, normalization, and
vectorization. Pandas integration.
SameDiff
Symbolic differentiation and
computation graphs.
Arbiter
Hyperparameter search for optimizing
neural networks.
RL4J
Reinforcement learning on JVM.
Model Import
Import neural nets from ONNX,
TensorFlow, Keras (Theano, Caffe).
Jumpy
Python API for ND4J.
DL4J Training API
MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
.updater(new AMSGrad(0.05))
.l2(5e-4).activation(Activation.RELU)
.list(
new ConvolutionLayer.Builder(5, 5).stride(1, 1).nOut(20).build(),
new SubsamplingLayer.Builder(PoolingType.MAX).kernelSize(2, 2).build(),
new ConvolutionLayer.Builder(5, 5).stride(1, 1).nOut(50).build(),
new SubsamplingLayer.Builder(PoolingType.MAX).kernelSize(2, 2).padding(2,2).build(),
new DenseLayer.Builder().nOut(500).build(),
new DenseLayer.Builder().nOut(nClasses).activation(Activation.SOFTMAX).build(),
new LossLayer.Builder().lossFunction(LossFunction.MCXENT).build()
)
.setInputType(InputType.convolutionalFlat(28, 28, 1))
.build()
MultiLayerNetwork model = new MultiLayerNetwork(conf);
model.init();
model.fit(...);
DL4J Training Features
A very extensive feature rich library
- Large set of layers, including VAE
- Elaborate architectures, eg. center loss
- Listeners: score and performance, checkpoint
- Extensive Eval classes
- Custom Activation, Custom Layers
- Learning Rate Schedules
- Dropout, WeightNoise, WeightConstraints
- Transfer Learning
- And so much more
Inference with imported models
//Import model
model = KerasModelImport.import...
//Featurize input data into an INDArray
INDArray features = …
//Get prediction
INDArray prediction = model.output(features)
Featurizing Data
DataVec: A tool for ETL
Runs natively on Spark with GPUs and CPUs
Designed to support all major types of input data (text, CSV, audio,
image and video) with these specific input formats
Define Schemas and Transform Process
Serialize the transform processes, which allows them to be more
portable when they’re needed for production environments.
DataVec Schema
Define Schemas
Schema inputDataSchema = new Schema.Builder()
.addColumnsString("CustomerID", "MerchantID")
.addColumnInteger("NumItemsInTransaction")
.addColumnCategorical("MerchantCountryCode",
Arrays.asList("USA","CAN","FR","MX"))
.addColumnDouble("TransactionAmountUSD",0.0,null,false,false) //$0.0 or
more, no maximum limit, no NaN and no Infinite values
.addColumnCategorical("FraudLabel", Arrays.asList("Fraud","Legit"))
.build()
DataVec Transform Process
Basic Transform Example
- Filter rows by column value
- Handle invalid values with replacement (-ve $ amt)
- Handle datetime, extract hour of day etc
- Operate on columns in place
- Derive new columns from existing columns
- Join multiple sources of data
- AND much more...
Serialize to JSON!!
https://gist.github.com/eraly/3b15d35eb4285acd444f2f18976dd226
DataVec Data Analysis
DataAnalysis dataAnalysis =
AnalyzeSpark.analyze(schema, parsedInputData, maxHistogramBuckets);
HtmlAnalysis.createHtmlAnalysisFile(dataAnalysis, new File("DataVecAnalysis.html"));
Parallel Inference
Model model =
ModelSerializer.restoreComputationGraph("PATH_TO_YOUR_MODEL_FILE", false);
ParallelInference pi = new ParallelInference.Builder(model)
.inferenceMode(InferenceMode.BATCHED)
.batchLimit(32)
.workers(2)
.build();
INDArray result = pi.output(..);
DL4J Transfer Learning API
- Ability to freeze layers
- Modify layers, add new layers; change graph structure etc
- FineTuneConfiguration for changing learning
- Helper functions to presave featurized frozen layer outputs
(.featurize method in TransferLearningHelper)
Example with vgg16 that keeps bottleneck and below frozen and edits
new layers:
https://github.com/deeplearning4j/dl4j-examples/blob/5381c5f86170dc5
44522eb7926d8fbf8119bec67/dl4j-examples/src/main/java/org/deeplear
ning4j/examples/transferlearning/vgg16/EditAtBottleneckOthersFrozen.ja
va#L74-L90
DL4J Training UI
Helps with training and tuning by tracking gradients and updates works with Spark
Parallel Inference
• Skymind integrates
Deeplearning4j into it’s
commercial model server,
SKIL
• Underlying code uses
ParallelInference class
• Promising scalability as
minibatch and number of
local devices increases
Commercial Performance
minibatch size
• ParallelInference class
automatically picks up
available GPUs and balances
requests to them
• Backpressure can be
handled by “batching” the
requests in a queue
• Single-node, up to
programmer to scale out or
can use commercial
solution like SKIL
Parallel GPUs
GPU 1 GPU 1
ParallelInference
https://github.com/deeplearni
ng4j/dl4j-examples/blob/mast
er/dl4j-examples/src/main/jav
a/org/deeplearning4j/examples
/inference/ParallelInferenceEx
ample.java
Example
ParallelInference pi =
new ParallelInference
.Builder(model)
.inferenceMode(InferenceMo
de.BATCHED)
.batchLimit(32)
.workers(2)
.build();
Backlogged Inference
Prerequisites
What is anomaly detection?
In layman’s terms, anomaly detection is the identification of rare
events or items that are significantly different from the “normal” of a
dataset.
Something is not like the others...
The Problem
How to monitor 1 terabyte of CDN logs per day and detect anomalies.
We want to monitor the health of a live sports score websocket API.
Let’s analyze packet logs from a server farm streaming the latest
NFL game. It produces 1 TB of logs per day with files that look like:
91739747923947 live.nfl.org GET /panthers_chargers 0
1554863750 250 6670 wss 0
Let’s do some math. This line is 73 bytes...
Analysis
What’s the most efficient way to monitor for system disruptions?
I’ve seen attempts to perform anomaly detection on every single
packet! Ummm okay so if we have 1 TB of logs per day and each line
is 73 bytes, that is how many lines….
1e+12 bytes / 73 bytes =
13,698,630,137 log lines
Available Hardware
I have a 2 x Titan X Pascal GPU Workstation at home.
Titan X has 342.9 GFLOPS of FP64 (double) computing power.
Sounds like a lot? We can process a terabyte of logs per day?
Let’s benchmark it!
Data Vectorization
Format of log file is:
{id} {domain} {http_method} {uri} {server_errors}
{timestamp} {round_trip} {payload_size} {protocol}
{client_errors}
How anomalous is our packet when comparing errors, timing, and
round trip?
Let’s build an input using the above...
MLP Architecture
We need to encode our data into a representation that has some sort
of computational meaning. Potentially a small MLP encoder can work.
Model size: 158 parameters (very small)
Benchmarks: 43,166 logs/sec on 2xGPU
Total Capacity: 3,729,542,400 logs/day
We need at least 8 GPUs!!! And backpressure!
Analysis
What if there was a better way?
We already know we can leverage Kafka for backpressure. That
eliminates high burst loads. What if there was a way we could turn 13
billion packet logs into a fraction of that?
Aggregate!
We can add a Spark streaming component, use microbatching and
aggregate into smaller sequences.
LSTM Architecture
Our MLP encoder turns into an LSTM sequence encoder. We
aggregate across a rolling window of 30 seconds, every second. Do
we become more efficient?
Model size: 14,178 parameters (small)
Benchmarks: 1,494 aggregations/sec on 2xGPU
Total Capacity: 129,081,600 aggregations/day
Aggregation gains significant efficiency.
Lessons
Still need additional hardware.
Spark streaming will still require additional hardware. However you’re
optimizing this and not requiring expensive GPU usage. Aggregation
across all packets also gives big picture which is indicator of health.
Number of parameters.
While the models used for this thought experiment are small, you
could very well increase the size by 10x for performance or
dimensionality. That requires additional hardware.
Real Code
Github Example
Kafka, Keras, and Deeplearning4j.
A simplified real-world example involves a data science team training
in python via Keras, importing your model into Deeplearning4j and
Java, and deploying your model to perform inference from data fed
by Kafka.
Repository.
https://github.com/crockpotveggies/kafka-streams-machine-learning
-examples
Questions?
help@skymind.ai

More Related Content

Deep learning with kafka

  • 1. JUSTIN LONG | justin@skymind.io
  • 2. Deep Learning with GPUs in Production AI By the Bay 2017 DEEPLEARNING4J & KAFKA April 2019
  • 3. | OBJECTIVES By the end of this presentation, you should… 1. Know the Deeplearning4j stack and how it works 2. Understand why aggregation is useful 3. Have an example of using Deeplearning4j and Kafka together
  • 5. DL4J Ecosystem Deeplearning4j, ScalNet Build, train, and deploy neural networks on JVM and in Spark. ND4J /libND4J High performance linear algebra on GPU/CPU. Numpy for JVM. DataVec Data ingestion, normalization, and vectorization. Pandas integration. SameDiff Symbolic differentiation and computation graphs. Arbiter Hyperparameter search for optimizing neural networks. RL4J Reinforcement learning on JVM. Model Import Import neural nets from ONNX, TensorFlow, Keras (Theano, Caffe). Jumpy Python API for ND4J.
  • 6. DL4J Training API MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder() .updater(new AMSGrad(0.05)) .l2(5e-4).activation(Activation.RELU) .list( new ConvolutionLayer.Builder(5, 5).stride(1, 1).nOut(20).build(), new SubsamplingLayer.Builder(PoolingType.MAX).kernelSize(2, 2).build(), new ConvolutionLayer.Builder(5, 5).stride(1, 1).nOut(50).build(), new SubsamplingLayer.Builder(PoolingType.MAX).kernelSize(2, 2).padding(2,2).build(), new DenseLayer.Builder().nOut(500).build(), new DenseLayer.Builder().nOut(nClasses).activation(Activation.SOFTMAX).build(), new LossLayer.Builder().lossFunction(LossFunction.MCXENT).build() ) .setInputType(InputType.convolutionalFlat(28, 28, 1)) .build() MultiLayerNetwork model = new MultiLayerNetwork(conf); model.init(); model.fit(...);
  • 7. DL4J Training Features A very extensive feature rich library - Large set of layers, including VAE - Elaborate architectures, eg. center loss - Listeners: score and performance, checkpoint - Extensive Eval classes - Custom Activation, Custom Layers - Learning Rate Schedules - Dropout, WeightNoise, WeightConstraints - Transfer Learning - And so much more
  • 8. Inference with imported models //Import model model = KerasModelImport.import... //Featurize input data into an INDArray INDArray features = … //Get prediction INDArray prediction = model.output(features)
  • 9. Featurizing Data DataVec: A tool for ETL Runs natively on Spark with GPUs and CPUs Designed to support all major types of input data (text, CSV, audio, image and video) with these specific input formats Define Schemas and Transform Process Serialize the transform processes, which allows them to be more portable when they’re needed for production environments.
  • 10. DataVec Schema Define Schemas Schema inputDataSchema = new Schema.Builder() .addColumnsString("CustomerID", "MerchantID") .addColumnInteger("NumItemsInTransaction") .addColumnCategorical("MerchantCountryCode", Arrays.asList("USA","CAN","FR","MX")) .addColumnDouble("TransactionAmountUSD",0.0,null,false,false) //$0.0 or more, no maximum limit, no NaN and no Infinite values .addColumnCategorical("FraudLabel", Arrays.asList("Fraud","Legit")) .build()
  • 11. DataVec Transform Process Basic Transform Example - Filter rows by column value - Handle invalid values with replacement (-ve $ amt) - Handle datetime, extract hour of day etc - Operate on columns in place - Derive new columns from existing columns - Join multiple sources of data - AND much more... Serialize to JSON!! https://gist.github.com/eraly/3b15d35eb4285acd444f2f18976dd226
  • 12. DataVec Data Analysis DataAnalysis dataAnalysis = AnalyzeSpark.analyze(schema, parsedInputData, maxHistogramBuckets); HtmlAnalysis.createHtmlAnalysisFile(dataAnalysis, new File("DataVecAnalysis.html"));
  • 13. Parallel Inference Model model = ModelSerializer.restoreComputationGraph("PATH_TO_YOUR_MODEL_FILE", false); ParallelInference pi = new ParallelInference.Builder(model) .inferenceMode(InferenceMode.BATCHED) .batchLimit(32) .workers(2) .build(); INDArray result = pi.output(..);
  • 14. DL4J Transfer Learning API - Ability to freeze layers - Modify layers, add new layers; change graph structure etc - FineTuneConfiguration for changing learning - Helper functions to presave featurized frozen layer outputs (.featurize method in TransferLearningHelper) Example with vgg16 that keeps bottleneck and below frozen and edits new layers: https://github.com/deeplearning4j/dl4j-examples/blob/5381c5f86170dc5 44522eb7926d8fbf8119bec67/dl4j-examples/src/main/java/org/deeplear ning4j/examples/transferlearning/vgg16/EditAtBottleneckOthersFrozen.ja va#L74-L90
  • 15. DL4J Training UI Helps with training and tuning by tracking gradients and updates works with Spark
  • 17. • Skymind integrates Deeplearning4j into it’s commercial model server, SKIL • Underlying code uses ParallelInference class • Promising scalability as minibatch and number of local devices increases Commercial Performance minibatch size
  • 18. • ParallelInference class automatically picks up available GPUs and balances requests to them • Backpressure can be handled by “batching” the requests in a queue • Single-node, up to programmer to scale out or can use commercial solution like SKIL Parallel GPUs GPU 1 GPU 1 ParallelInference
  • 21. Prerequisites What is anomaly detection? In layman’s terms, anomaly detection is the identification of rare events or items that are significantly different from the “normal” of a dataset. Something is not like the others...
  • 22. The Problem How to monitor 1 terabyte of CDN logs per day and detect anomalies. We want to monitor the health of a live sports score websocket API. Let’s analyze packet logs from a server farm streaming the latest NFL game. It produces 1 TB of logs per day with files that look like: 91739747923947 live.nfl.org GET /panthers_chargers 0 1554863750 250 6670 wss 0 Let’s do some math. This line is 73 bytes...
  • 23. Analysis What’s the most efficient way to monitor for system disruptions? I’ve seen attempts to perform anomaly detection on every single packet! Ummm okay so if we have 1 TB of logs per day and each line is 73 bytes, that is how many lines…. 1e+12 bytes / 73 bytes = 13,698,630,137 log lines
  • 24. Available Hardware I have a 2 x Titan X Pascal GPU Workstation at home. Titan X has 342.9 GFLOPS of FP64 (double) computing power. Sounds like a lot? We can process a terabyte of logs per day? Let’s benchmark it!
  • 25. Data Vectorization Format of log file is: {id} {domain} {http_method} {uri} {server_errors} {timestamp} {round_trip} {payload_size} {protocol} {client_errors} How anomalous is our packet when comparing errors, timing, and round trip? Let’s build an input using the above...
  • 26. MLP Architecture We need to encode our data into a representation that has some sort of computational meaning. Potentially a small MLP encoder can work. Model size: 158 parameters (very small) Benchmarks: 43,166 logs/sec on 2xGPU Total Capacity: 3,729,542,400 logs/day We need at least 8 GPUs!!! And backpressure!
  • 27. Analysis What if there was a better way? We already know we can leverage Kafka for backpressure. That eliminates high burst loads. What if there was a way we could turn 13 billion packet logs into a fraction of that? Aggregate! We can add a Spark streaming component, use microbatching and aggregate into smaller sequences.
  • 28. LSTM Architecture Our MLP encoder turns into an LSTM sequence encoder. We aggregate across a rolling window of 30 seconds, every second. Do we become more efficient? Model size: 14,178 parameters (small) Benchmarks: 1,494 aggregations/sec on 2xGPU Total Capacity: 129,081,600 aggregations/day Aggregation gains significant efficiency.
  • 29. Lessons Still need additional hardware. Spark streaming will still require additional hardware. However you’re optimizing this and not requiring expensive GPU usage. Aggregation across all packets also gives big picture which is indicator of health. Number of parameters. While the models used for this thought experiment are small, you could very well increase the size by 10x for performance or dimensionality. That requires additional hardware.
  • 31. Github Example Kafka, Keras, and Deeplearning4j. A simplified real-world example involves a data science team training in python via Keras, importing your model into Deeplearning4j and Java, and deploying your model to perform inference from data fed by Kafka. Repository. https://github.com/crockpotveggies/kafka-streams-machine-learning -examples