SlideShare a Scribd company logo
Optimizing training on
Apache MXNet
Julien Simon, AI Evangelist, EMEA
@julsimon
What to expect from this session
• Techniques and tips to optimize training on Apache MXNet
• Infrastructure performance: storage and I/O, GPU throughput, distributed
training, CPU-based training, cost
• Model performance: data augmentation, initializers, optimizers, etc.
• Level 666: you should be familiar with Deep Learning and MXNet
Optimizing Infrastructure Performance
Deploying data sets to instances
• Deep Learning training sets are often very large, with a huge number of files
• How can we deploy them quickly, easily and reliably to instances?
• We strongly recommend packing the training set in a RecordIo file
• https://mxnet.incubator.apache.org/architecture/note_data_loading.html
• https://mxnet.incubator.apache.org/how_to/recordio.html
• Only one file to move around!
• Worth the effort: pack once, train many times
• In any case, you need to copy your data set to a central location
• Let’s look at Amazon EBS, Amazon S3 and Amazon EFS
Storing data sets in Amazon EBS
1. Prepare your data set on a dedicated EBS volume
2. Take a snapshot
3. Deploying to a new instance only takes a few seconds
a. Create a volume from the snapshot
b. Attach the volume to the instance
c. Mount the volume
• Easy to automate, including at boot time (UserData or cfn-init)
• Easy to scale to many instances, even in different accounts
• Large choice of EBS volume types (cost vs. performance)
• Caveat: no sharing for distributed training, copying is required
Storing data sets in Amazon S3
• MXNet has an S3 connector  build option USE_S3=1
https://mxnet.incubator.apache.org/how_to/s3_integration.html
• Best durability (11 9’s)
• Distributed training possible
• Caveats
• Lower performance than EBS-optimized instances
• Beware of hot spots if a lot of instances are running
https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
train_dataiter = mx.io.MNISTIter(
image="s3://bucket-name/training-data/train-images-idx3-ubyte",
label="s3://bucket-name/training-data/train-labels-idx1-ubyte", ...
Storing data sets in Amazon EFS
1. Copy your data set on an EFS volume
2. Mount the volume on instances
• Simple way to set up distributed training (no copying required)
• Caveats
• You probably want the “Max I/O” performance mode, but I’d test both
to see if latency is an issue or not
• EFS is more expensive than S3 and EBS: use it for training only, not
for long-term storage
Maximizing GPU usage
• GPUs need a high-throughput, stable flow of training data to run at top speed
• Large datasets cannot fit in RAM
• Adding more GPUs requires more throughput
• How can we check that training is running at full speed?
• Keep track of performance indicators from previous trainings (images / sec, etc.)
• Look at performance indicators and benchmarks reported by others
• Use nvidia-smi
• Look at power consumption, GPU utilization and GPU RAM
• All these values should be maxed out and stable
Maximizing GPU usage: batch size
• Picking a batch size is a tradeoff between training speed and accuracy
• Larger batch size is more computationally efficient
• Smaller batch size helps find a better minimum
• Smaller data sets, few classes (MNIST, CIFAR)
• Start with 32*GPU_COUNT
• 1024 is probably the largest reasonable batch size
• Large data sets, lot of classes (ImageNet)
• Use the largest possible batch size
• Start at 32*GPU_COUNT and increase it until MXNet OOMs
Maximizing GPU usage: compute & I/O
• Check power consumption and GPU usage after each modification
• If they’re not maxed out, GPUs are probably stalling
• Can the Python process keep up? Loading images, pre-processing, etc.
• Use top to check load and count threads
• Use RecordIO and add more decoding threads
• Can the I/O layer keep up?
• Use iostat to look at volume stats
• Use faster storage: SSD or even a ramdisk!
Using distributed training
• MXNet scales almost linearly up to 256 GPUs
http://www.allthingsdistributed.com/2016/11/mxnet-default-framework-deep-learning-aws.html
• Easy to set up
https://mxnet.incubator.apache.org/how_to/multi_devices.html
• Blog post + AWS CloudFormation template
https://aws.amazon.com/blogs/compute/distributed-deep-learning-made-easy/
• Master node must have SSH access to slave nodes
• Data set must be accessible on all nodes
• Shared storage: great!
• No shared storage  automatic copy with rsync
What about CPU training?
• Several libraries help speed up Deep Learning on CPUs
• Fast implementation of math primitives
• Dedicated instruction sets, e.g. Intel AVX or ARM NEON
• Fast memory allocation
• Intel Math Kernel Library https://software.intel.com/en-us/mkl  USE_MKL = 1
• NNPACK https://github.com/Maratyszcza/NNPACK  USE_NNPACK = 1
• Libjpeg-turbo https://www.libjpeg-turbo.org/  USE_TURBO_JPEG = 1
• Jemalloc http://jemalloc.net/  USE_JEMALLOC = 1
• Google Perf Tools https://github.com/gperftools  USE_GPERFTOOLS = 1
Distribution Details
 Open Source
 Apache 2.0 License
 Common DNN APIs across all Intel hardware.
 Rapid release cycles, iterated with the DL community, to
best support industry framework integration.
 Highly vectorized & threaded for maximal performance,
based on the popular Intel® MKL library.
For developers of deep learning frameworks featuring optimized performance on Intel hardware
http://github.com/01org/mkl-dnn
Direct 2D
Convolution
Rectified linear unit
neuron activation
(ReLU)
Maximum
pooling
Inner product
Local response
normalization
(LRN)
Intel® MKL-dnn
Math Kernel Library for Deep Neural Networks
Examples:
Optimizing cost
• Use Spot instances
https://aws.amazon.com/blogs/aws/natural-language-
processing-at-clemson-university-1-1-million-vcpus-
ec2-spot-instances/
• Sharing is caring: it’s easy to share an
instance for multiple jobs
mod = mx.mod.Module(lenet, context=(mx.gpu(7), mx.gpu(8), mx.gpu(9)))
p2.16xlarge 89%
discount
Demo: C5 + Intel MKL = ♥ ♥ ♥
Optimizing Model Performance
Using data augmentation
• Data augmentation lets you add more samples to smaller data sets
• Even a large data set may benefit from it and generalize better
• The ImageRecordIter object lets you do that easily from a RecordIO image file
• Images: crop, rotate, change colors, etc.
• https://mxnet.incubator.apache.org/api/python/io.html#mxnet.io.ImageRecordIter
• Careful: this processing is performed by the Python process: add more threads!
data_iter = mx.io.ImageRecordIter(path_imgrec="./data/caltech_train.rec",
data_shape=(3, 227, 227),
batch_size=4,
resize=256
…
# you can add more augumentation options here.
# use help(mx.io.ImageRecordIter) to see all possible choices )
Picking an initializer
• MXNet supports many different initializers
https://mxnet.incubator.apache.org/api/python/optimization.html
• Initial weights should neither be ”too large” or “too small”
• There seems to be some sort of consensus on:
https://www.quora.com/What-are-good-initial-weights-in-a-neural-network
• Xavier for Convolutional Neural Networks
• Random values between 0 and 1 for everything else
• I wouldn’t use anything else unless I really knew better 
Managing the learning rate
• The learning rate is probably the most discussed parameter in Deep Learning
• Too small: your model may never converge
• Too large: your model may never reach a minimum
• Try keeping a large learning rate for a long time, then reduce it
• Here are common techniques you could use with MXNet:
1. Use a fixed learning rate
2. Use steps: scale the learning rate
• once a number of batches have been completed,
• after each epoch,
• once specific epochs have been completed
3. Use an optimizer which automatically adapts the learning rate
Scaling the learning rate with steps
• Number of steps = number of samples / batch size / number of distributed workers
• FactorScheduler object: update the learning rate after ‘n’ steps
• MultiFactorScheduler object: update the learning rate after specific step counts
• MXNet scripts let you use command-line parameters (--step-epochs)
https://github.com/apache/incubator-mxnet/tree/master/example/image-classification
lr_sch = mx.lr_scheduler.FactorScheduler(step=100, factor=0.9)
mod.init_optimizer( ... optimizer='sgd', optimizer_params=(('learning_rate', 0.1),
('lr_scheduler', lr_sch)))
steps = [0, 100, 200, 250, 300, 325, 350]
lr_sch = mx.lr_scheduler.MultiFactorScheduler(step=steps, factor=0.9)
mod.init_optimizer( ... optimizer='sgd', optimizer_params=(('learning_rate', 0.1),
('lr_scheduler', lr_sch)))
Picking an optimizer
• MXNet supports many different optimizers
https://mxnet.incubator.apache.org/api/python/optimization.html
http://ruder.io/optimizing-gradient-descent/
• It’s unlikely that a single one will work best every time. Experiment!
• Several SGD variants adapt the learning rate during training
• Some of them even use a specific learning rate for each parameter
Example: learning MNIST with the LeNet CNN (20 epochs)
Algorithm SGD NAG Adam NAdam AdaGrad AdaMax
Time / epoch 2.5s 2.55s 18.5s 15.1s 5.7s 7.5s
Validation accuracy 98.5% 98.5% 98.3% 98.4% 99.2% 98.55%
Reducing model size
• Complex neural networks are too large for resource-constrained environments
• MXNet supports Mixed Precision Training
• Use float16 instead of float32
• Almost 2x reduction in memory consumption, no loss of accuracy
• https://devblogs.nvidia.com/parallelforall/mixed-precision-training-deep-neural-networks/
• http://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#mxnet
• BMXNet: Binary Neural Network Implementation
• Use binary values for weights and activations
• 20x to 30x reduction in model size, with limited loss
• https://github.com/hpi-xnor/BMXNet
Monitoring the training process
• You can run callbacks at the end of each batch and at the end of each epoch.
• This allows you to display training speed…
• … and save parameters after each epoch
module.fit(iterator, num_epoch=n_epoch, ...
batch_end_callback=mx.callback.Speedometer(64, 10))
Epoch[0] Batch [10] Speed: 1910.41 samples/sec Train-accuracy=0.200000
Epoch[0] Batch [20] Speed: 1764.83 samples/sec Train-accuracy=0.400000
module.fit(iterator, num_epoch=n_epoch, ...
epoch_end_callback = mx.callback.do_checkpoint("mymodel", 1))
Start training with [cpu(0)]
Epoch[0] Resetting Data Iterator
Epoch[0] Time cost=0.100 Saved checkpoint to "mymodel-0001.params"
Epoch[1] Resetting Data Iterator
Epoch[1] Time cost=0.060 Saved checkpoint to "mymodel-0002.params"
Early stopping
Training accuracy
Loss function
Accuracy
100%
Epochs
Validation accuracy
Loss
Best checkpoint
OVERFITTING
Conclusion
• There is a lot of literature on selecting and tweaking hyper-parameters
• You should definitely read it but please experiment with your own data
• Train 1,000 models and pick the best one
• Optimizing infrastructure is all the more important, then!
• Make sure all parts are firing on all cylinders
• Spot instances!
• I hope this was useful. Please don’t forget to send your feedback
• Go build cool stuff and let me know! Happy to share and retweet 
Resources
https://aws.amazon.com/ai
https://aws.amazon.com/blogs/ai
https://mxnet.io
https://github.com/apache/incubator-mxnet
https://github.com/gluon-api
https://aws.amazon.com/blogs/machine-learning/speeding-up-apache-mxnet-using-the-nnpack-library/
https://medium.com/@julsimon/speeding-up-apache-mxnet-part-3-lets-smash-it-with-c5-and-intel-mkl-
90ab153b8cc1
https://medium.com/@julsimon/imagenet-part-1-going-on-an-adventure-c0a62976dc72
https://medium.com/@julsimon/imagenet-part-2-the-road-goes-ever-on-and-on-578f09a749f9
Thank you!
Julien Simon, AI Evangelist, EMEA
@julsimon
THANK YOU!
J u l i e n S i m o n , P r i n c i p a l A I / M L E v a n g e l i s t , E M E A
@ j u l s i m o n

More Related Content

Optimizing training on Apache MXNet (January 2018)

  • 1. Optimizing training on Apache MXNet Julien Simon, AI Evangelist, EMEA @julsimon
  • 2. What to expect from this session • Techniques and tips to optimize training on Apache MXNet • Infrastructure performance: storage and I/O, GPU throughput, distributed training, CPU-based training, cost • Model performance: data augmentation, initializers, optimizers, etc. • Level 666: you should be familiar with Deep Learning and MXNet
  • 4. Deploying data sets to instances • Deep Learning training sets are often very large, with a huge number of files • How can we deploy them quickly, easily and reliably to instances? • We strongly recommend packing the training set in a RecordIo file • https://mxnet.incubator.apache.org/architecture/note_data_loading.html • https://mxnet.incubator.apache.org/how_to/recordio.html • Only one file to move around! • Worth the effort: pack once, train many times • In any case, you need to copy your data set to a central location • Let’s look at Amazon EBS, Amazon S3 and Amazon EFS
  • 5. Storing data sets in Amazon EBS 1. Prepare your data set on a dedicated EBS volume 2. Take a snapshot 3. Deploying to a new instance only takes a few seconds a. Create a volume from the snapshot b. Attach the volume to the instance c. Mount the volume • Easy to automate, including at boot time (UserData or cfn-init) • Easy to scale to many instances, even in different accounts • Large choice of EBS volume types (cost vs. performance) • Caveat: no sharing for distributed training, copying is required
  • 6. Storing data sets in Amazon S3 • MXNet has an S3 connector  build option USE_S3=1 https://mxnet.incubator.apache.org/how_to/s3_integration.html • Best durability (11 9’s) • Distributed training possible • Caveats • Lower performance than EBS-optimized instances • Beware of hot spots if a lot of instances are running https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html train_dataiter = mx.io.MNISTIter( image="s3://bucket-name/training-data/train-images-idx3-ubyte", label="s3://bucket-name/training-data/train-labels-idx1-ubyte", ...
  • 7. Storing data sets in Amazon EFS 1. Copy your data set on an EFS volume 2. Mount the volume on instances • Simple way to set up distributed training (no copying required) • Caveats • You probably want the “Max I/O” performance mode, but I’d test both to see if latency is an issue or not • EFS is more expensive than S3 and EBS: use it for training only, not for long-term storage
  • 8. Maximizing GPU usage • GPUs need a high-throughput, stable flow of training data to run at top speed • Large datasets cannot fit in RAM • Adding more GPUs requires more throughput • How can we check that training is running at full speed? • Keep track of performance indicators from previous trainings (images / sec, etc.) • Look at performance indicators and benchmarks reported by others • Use nvidia-smi • Look at power consumption, GPU utilization and GPU RAM • All these values should be maxed out and stable
  • 9. Maximizing GPU usage: batch size • Picking a batch size is a tradeoff between training speed and accuracy • Larger batch size is more computationally efficient • Smaller batch size helps find a better minimum • Smaller data sets, few classes (MNIST, CIFAR) • Start with 32*GPU_COUNT • 1024 is probably the largest reasonable batch size • Large data sets, lot of classes (ImageNet) • Use the largest possible batch size • Start at 32*GPU_COUNT and increase it until MXNet OOMs
  • 10. Maximizing GPU usage: compute & I/O • Check power consumption and GPU usage after each modification • If they’re not maxed out, GPUs are probably stalling • Can the Python process keep up? Loading images, pre-processing, etc. • Use top to check load and count threads • Use RecordIO and add more decoding threads • Can the I/O layer keep up? • Use iostat to look at volume stats • Use faster storage: SSD or even a ramdisk!
  • 11. Using distributed training • MXNet scales almost linearly up to 256 GPUs http://www.allthingsdistributed.com/2016/11/mxnet-default-framework-deep-learning-aws.html • Easy to set up https://mxnet.incubator.apache.org/how_to/multi_devices.html • Blog post + AWS CloudFormation template https://aws.amazon.com/blogs/compute/distributed-deep-learning-made-easy/ • Master node must have SSH access to slave nodes • Data set must be accessible on all nodes • Shared storage: great! • No shared storage  automatic copy with rsync
  • 12. What about CPU training? • Several libraries help speed up Deep Learning on CPUs • Fast implementation of math primitives • Dedicated instruction sets, e.g. Intel AVX or ARM NEON • Fast memory allocation • Intel Math Kernel Library https://software.intel.com/en-us/mkl  USE_MKL = 1 • NNPACK https://github.com/Maratyszcza/NNPACK  USE_NNPACK = 1 • Libjpeg-turbo https://www.libjpeg-turbo.org/  USE_TURBO_JPEG = 1 • Jemalloc http://jemalloc.net/  USE_JEMALLOC = 1 • Google Perf Tools https://github.com/gperftools  USE_GPERFTOOLS = 1
  • 13. Distribution Details  Open Source  Apache 2.0 License  Common DNN APIs across all Intel hardware.  Rapid release cycles, iterated with the DL community, to best support industry framework integration.  Highly vectorized & threaded for maximal performance, based on the popular Intel® MKL library. For developers of deep learning frameworks featuring optimized performance on Intel hardware http://github.com/01org/mkl-dnn Direct 2D Convolution Rectified linear unit neuron activation (ReLU) Maximum pooling Inner product Local response normalization (LRN) Intel® MKL-dnn Math Kernel Library for Deep Neural Networks Examples:
  • 14. Optimizing cost • Use Spot instances https://aws.amazon.com/blogs/aws/natural-language- processing-at-clemson-university-1-1-million-vcpus- ec2-spot-instances/ • Sharing is caring: it’s easy to share an instance for multiple jobs mod = mx.mod.Module(lenet, context=(mx.gpu(7), mx.gpu(8), mx.gpu(9))) p2.16xlarge 89% discount
  • 15. Demo: C5 + Intel MKL = ♥ ♥ ♥
  • 17. Using data augmentation • Data augmentation lets you add more samples to smaller data sets • Even a large data set may benefit from it and generalize better • The ImageRecordIter object lets you do that easily from a RecordIO image file • Images: crop, rotate, change colors, etc. • https://mxnet.incubator.apache.org/api/python/io.html#mxnet.io.ImageRecordIter • Careful: this processing is performed by the Python process: add more threads! data_iter = mx.io.ImageRecordIter(path_imgrec="./data/caltech_train.rec", data_shape=(3, 227, 227), batch_size=4, resize=256 … # you can add more augumentation options here. # use help(mx.io.ImageRecordIter) to see all possible choices )
  • 18. Picking an initializer • MXNet supports many different initializers https://mxnet.incubator.apache.org/api/python/optimization.html • Initial weights should neither be ”too large” or “too small” • There seems to be some sort of consensus on: https://www.quora.com/What-are-good-initial-weights-in-a-neural-network • Xavier for Convolutional Neural Networks • Random values between 0 and 1 for everything else • I wouldn’t use anything else unless I really knew better 
  • 19. Managing the learning rate • The learning rate is probably the most discussed parameter in Deep Learning • Too small: your model may never converge • Too large: your model may never reach a minimum • Try keeping a large learning rate for a long time, then reduce it • Here are common techniques you could use with MXNet: 1. Use a fixed learning rate 2. Use steps: scale the learning rate • once a number of batches have been completed, • after each epoch, • once specific epochs have been completed 3. Use an optimizer which automatically adapts the learning rate
  • 20. Scaling the learning rate with steps • Number of steps = number of samples / batch size / number of distributed workers • FactorScheduler object: update the learning rate after ‘n’ steps • MultiFactorScheduler object: update the learning rate after specific step counts • MXNet scripts let you use command-line parameters (--step-epochs) https://github.com/apache/incubator-mxnet/tree/master/example/image-classification lr_sch = mx.lr_scheduler.FactorScheduler(step=100, factor=0.9) mod.init_optimizer( ... optimizer='sgd', optimizer_params=(('learning_rate', 0.1), ('lr_scheduler', lr_sch))) steps = [0, 100, 200, 250, 300, 325, 350] lr_sch = mx.lr_scheduler.MultiFactorScheduler(step=steps, factor=0.9) mod.init_optimizer( ... optimizer='sgd', optimizer_params=(('learning_rate', 0.1), ('lr_scheduler', lr_sch)))
  • 21. Picking an optimizer • MXNet supports many different optimizers https://mxnet.incubator.apache.org/api/python/optimization.html http://ruder.io/optimizing-gradient-descent/ • It’s unlikely that a single one will work best every time. Experiment! • Several SGD variants adapt the learning rate during training • Some of them even use a specific learning rate for each parameter Example: learning MNIST with the LeNet CNN (20 epochs) Algorithm SGD NAG Adam NAdam AdaGrad AdaMax Time / epoch 2.5s 2.55s 18.5s 15.1s 5.7s 7.5s Validation accuracy 98.5% 98.5% 98.3% 98.4% 99.2% 98.55%
  • 22. Reducing model size • Complex neural networks are too large for resource-constrained environments • MXNet supports Mixed Precision Training • Use float16 instead of float32 • Almost 2x reduction in memory consumption, no loss of accuracy • https://devblogs.nvidia.com/parallelforall/mixed-precision-training-deep-neural-networks/ • http://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#mxnet • BMXNet: Binary Neural Network Implementation • Use binary values for weights and activations • 20x to 30x reduction in model size, with limited loss • https://github.com/hpi-xnor/BMXNet
  • 23. Monitoring the training process • You can run callbacks at the end of each batch and at the end of each epoch. • This allows you to display training speed… • … and save parameters after each epoch module.fit(iterator, num_epoch=n_epoch, ... batch_end_callback=mx.callback.Speedometer(64, 10)) Epoch[0] Batch [10] Speed: 1910.41 samples/sec Train-accuracy=0.200000 Epoch[0] Batch [20] Speed: 1764.83 samples/sec Train-accuracy=0.400000 module.fit(iterator, num_epoch=n_epoch, ... epoch_end_callback = mx.callback.do_checkpoint("mymodel", 1)) Start training with [cpu(0)] Epoch[0] Resetting Data Iterator Epoch[0] Time cost=0.100 Saved checkpoint to "mymodel-0001.params" Epoch[1] Resetting Data Iterator Epoch[1] Time cost=0.060 Saved checkpoint to "mymodel-0002.params"
  • 24. Early stopping Training accuracy Loss function Accuracy 100% Epochs Validation accuracy Loss Best checkpoint OVERFITTING
  • 25. Conclusion • There is a lot of literature on selecting and tweaking hyper-parameters • You should definitely read it but please experiment with your own data • Train 1,000 models and pick the best one • Optimizing infrastructure is all the more important, then! • Make sure all parts are firing on all cylinders • Spot instances! • I hope this was useful. Please don’t forget to send your feedback • Go build cool stuff and let me know! Happy to share and retweet 
  • 27. Thank you! Julien Simon, AI Evangelist, EMEA @julsimon
  • 28. THANK YOU! J u l i e n S i m o n , P r i n c i p a l A I / M L E v a n g e l i s t , E M E A @ j u l s i m o n

Editor's Notes

  1. ImageNet: 1.2 million files, 152 GB
  2. ImageNet: 1.2 million files, 152 GB
  3. ImageNet: 1.2 million files, 152 GB
  4. ImageNet: 1.2 million files, 152 GB
  5. Intel® MKL-DNN (Math Kernel Library for Deep Neural Networks) is highly optimized using industry leading techniques and low level assembly code where appropriate. The API has been developed with feedback and interaction with the major framework owners, and as an open source project will track new and emerging trends in these frameworks. Intel is using this internally for our work in optimizing industry frameworks, as well as supporting the industry in their optimizations.