Siddha Ganju. Deep learning on mobile

1
Deep Learning On
Mobile
A PRACTITIONER’S GUIDE

3
Deep Learning On
Mobile
A PRACTITIONER’S GUIDE

@MeherKasam
@AnirudhKoul
@SiddhaGanju
4

Why Deep Learning on Mobile?
5
ReliabilityPrivacy
Cost Latency

Latency Is Expensive!
6
100 ms 1% loss
[Amazon 2008]

Latency Is Expensive!
7
>3 sec
load time
53%
bounce
Mobile Site Visits
[Google Research, Webpagetest.org]

Power of 10
9
0.1s
Seamless Uninterrupted
flow of thought
1s 10s
Limit of
attention
[Miller 1968; Card et al. 1991; Nielsen 1993]

10
Efficient Mobile
Inference Engine
Efficient
Model = DL App
High Quality
Dataset + Hardware + +

12
Learn to Play
Melodica
3 Months

14
FINE TUNE
your skills
3
months
1
week

Fine tuning
15
Assemble a
dataset
Find a pre-
trained
model
Fine-tune a
pre-trained
model
Run using
existing
frameworks
“Don’t Be A Hero” — Andrej Karpathy

CustomVision.ai
16
Use Fatkun Browser Extension to download images from Search Engine, or use Bing Image Search
API to programmatically download photos with proper rights

Apple Ecosystem
22
Metal
2014
BNNS +
MPS
2016
Core ML
2017
Core ML 2
2018
Core ML 3
2019
§ Tiny models (~ KB)!
§ 1-bit model quantization support
§ Batch API for improved performance
§ Conversion support for MXNet, ONNX
§ tf-coreml

Apple Ecosystem
23
Metal
2014
BNNS +
MPS
2016
Core ML
2017
Core ML 2
2018
Core ML 3
2019
§ On-device training
§ Personalization
§ Create ML UI

Core ML Benchmark
538
129
75
557
109
7877 44 3674 35 3071 33 2926 18 15
0
100
200
300
400
500
600
ResNet-50 MobileNet SqueezeNet
EXECUTION TIME (MS) ON APPLE
DEVICES
iPhone 5s (2013) iPhone 6 (2014) iPhone 6s (2015)
iPhone 7 (2016) iPhone X (2017) iPhone XS (2018)
24
https://heartbeat.fritz.ai/ios-12-core-ml-benchmarks-
b7a79811aac1
GPUs became a
thing here!

TensorFlow Ecosystem
25
TensorFlow
2015
TensorFlow Mobile
2016
TensorFlow Lite
2018
Smaller Faster Minimal
dependencies
Allows running
custom operators

TensorFlow Lite is small
26
1.5MB
TensorFlow
Mobile
300KB
Core Interpreter +
Supported
Operations

TensorFlow Lite is Fast
27
Takes advantage of on-
device hardware
acceleration
FlatBuffers
Reduces code footprint,
memory usage
Reduces CPU cycles on
serialization and
deserialization
Improves startup time
Pre-fused activations
Combining batch
normalization layer with
previous convolution
Static memory and static
execution plan
Decreases load time

28
TensorFlow
2015
TensorFlow Mobile
2016
TensorFlow Lite
2018
Smaller Faster Minimal
dependencies
Allows running
custom operators

29
TensorFlow
2015
TensorFlow Mobile
2016
TensorFlow Lite
2018
$ tflite_convert --keras_model_file = keras_model.h5 --output_file=foo.tflite

30
TensorFlow
2015
TensorFlow Mobile
2016
TensorFlow Lite
2018
Trained
TensorFlow Model
TF Lite Converter .tflite model
Android App
iOS App

ML Kit
31
Easy to use
Abstraction over TensorFlow Lite
Built-in APIs for image labeling, OCR, face detection, barcode scanning,
landmark detection, smart reply
Model management with Firebase
A/B testing
var vision = Vision.vision()
let faceDetector = vision.faceDetector(options: options)
let image = VisionImage(image: uiImage)
faceDetector.process(visionImage) { // callback }

Fritz
Full fledged mobile lifecycle support
Deployment, instrumentation, etc. from Python
33

35
Does my
model make
me look fat?
§ Apple does not allow apps over 200 MB to be
downloaded over cellular network.
§ Download on demand, and interpret on device instead.

36
What effect does
hardware have on
performance?

Big Things Come In Small Packages
37

Effect of Hardware
L-R: iPhone XS,
iPhone X, iPhone 5
38
https://twitter.com/matthieurouif/status/1126575118812110854?s=11

TensorFlow Lite Benchmarks
Alpha Lab releases Numericcal: http://alpha.lab.numericcal.com/

40
TensorFlow Lite Benchmarks
Crowdsourcing AI Benchmark App by Andrey Ignatov from ETH Zurich. http://ai-benchmark.com/

41
Alchemy by Fritz
https://alchemy.fritz.ai/
Python library to analyze and
estimate mobile performance
No need to deploy on mobile

42
Which devices
should I support?
§ To get 95% device coverage, support phones released in
the last 4 years.
§ For unsupported phones, offer graceful degradation
(lower frame rate, cloud inference, etc.)

43
Could all of this
result in heavy
energy use?

Energy Considerations
44
§ You don’t usually run AI models constantly; you run it for a few seconds.
§ With a modern flagship phone, running MobileNet at 30 FPS should burn
battery in 2–3 hours.
§ Bigger question — do you really need to run it at 30 FPS? Could it be run
at 1 FPS?

Energy reduction from 30 FPS to 1 FPS
45
iPad Pro 2017

46
What exciting
applications can I
build?

Seeing AI
Audible Barcode recognition
Aim: Help blind users identify products using barcode
Issue: Blind users don’t know where the barcode is
Solution: Guide user in finding a barcode with audio cues
47

AR Hand Puppets
Hart Woolery from 2020CV
Object Detection (Hand) + Key Point Estimation
48
[https://twitter.com/2020cv_inc/status/1093219359676280832]
AR Hand Puppets, Hart Woolery from 2020CV, Object Detection (Hand) + Key Point Estimation

Remove objects
Brian Schulman, Adventurous Co.
Object Segmentation + Image In Painting
50
https://twitter.com/smashfactory/status/1139461813710442496

Magic Sudoku App
Edge Detection + Classification + AR Kit
51
https://twitter.com/braddwyer/status/910030265006923776

53
Can I make my
model even more
efficient?

How To Find Efficient Pre-Trained Models
54
Papers with Code
https://paperswithcode.com/sota
Model Zoo
https://modelzoo.co

55
What you can affordWhat you want

Model Pruning
Aim: Remove all connections with absolute weights below a threshold
56
Song Han, Jeff Pool, John Tran, William J. Dally, "Learning both Weights and Connections for Efficient Neural Networks", 2015

Pruning in Keras
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(512, activation=tf.nn.relu),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
57
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(),
prune.Prune(tf.keras.layers.Dense(512, activation=tf.nn.relu)),
tf.keras.layers.Dropout(0.2),
prune.Prune(tf.keras.layers.Dense(10, activation=tf.nn.softmax))
])

Model Quantization
58
Quantized to
Percent size
reduction
(approx)
Percent match with 32-bit results
linear linear_lut kmeans_lut
16-bit 50% 100% 100% 100%
8-bit 75% 88.37% 80.62% 98.45%
4-bit 88% 0% 0% 81.4%
2-bit 94% 0% 0% 10.08%
1-bit 97% 0% 0% 7.75%
1
0.0039
0.00780 0.9921
0.9961
Quantized
Value
Intervals
0 …
…
1 254 255

So many techniques — so little time!
Channel pruning
Model quantization
ThiNet (Filter pruning)
Weight sharing
Automatic Mixed Precision
Network distillation
59

Pocket Flow – 1 Line to Make a Model Efficient
Tencent AI Labs created an Automatic Model Compression (AutoMC) framework
60

61
Can I design a better
architecture myself?
Maybe? But AI can
do it much better!

AutoML – Let AI Design an Efficient Arch
62
§Neural Architecture Search (NAS) — An
automated approach for designing models
using reinforcement learning while
maximizing accuracy.
§Hardware Aware NAS = Maximizes accuracy
while minimizing run-time on device
§Incorporates latency information into the
reward objective function
§Measure real-world inference latency by
executing on a particular platform
§1.5x faster than MobileNetV2 (MnasNet)
§ResNet-50 accuracy with 19x less
parameters
§SSD300 mAP with 35x fewer FLOPs

63
Evolution of Mobile NAS Methods
Method Top-1 Acc (%) Pixel-1 Runtime Search Cost
(GPU Hours)
MobileNetV1 70.6 113 Manual
MobileNetV2 72.0 75 Manual
MnasNet 74.0 76 40,000 (4 years+)
ProxylessNas-R 74.6 78 200
Single-Path NAS 74.9 79.5 3.75 hours

ProxylessNAS – Per Hardware Tuned CNNs
64
Han Cai and Ligeng Zhu and Song Han, "ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware", ICLR 2019

On-Device Training in Core ML
67
let updateTask = try MLUpdateTask(
forModelAt: modelUrl,
trainingData: trainingData,
configuration: configuration,
completionHandler: { [weak self]
self.model = context.model context.model.write(to: newModelUrl)
})
§ Core ML 3 introduced on device learning.
§ Never have to send training data to the server with the help
of MLUpdateTask.
§ Schedule training when device is charging to save power.

68
FEDERATED LEARNING!!!
https://federated.withgoogle.com/

69
TensorFlow
Federated
Train a global model using 1000s of devices
without access to data
Encryption + Secure Aggregation Protocol
Can take a few days to wait for
aggregations to build up
https://github.com/tensorflow/federated

70
1. Collect
Data
2. Label
Data
3. Train
Model
4. Convert
Model
5. Optimize
Performance
6. Deploy
7. Monitor
Mobile AI
Development
Lifecycle

What we learned today
71
§ Why deep learning on mobile?
§ Building a model
§ Running a model
§ Hardware factors
§ Benchmarking
§ State-of-the-art applications
§ Making a model more efficient
§ Federated learning

How do I access the slides
instantly?
http://PracticalDeepLearning.ai
@PracticalDLBook
Icons credit - Orion Icon Library - https://orioniconlibrary.com

73
@MeherKasam
@AnirudhKoul
@SiddhaGanju
Releases October 2019

Siddha Ganju. Deep learning on mobile

Siddha Ganju. Deep learning on mobile

More Related Content

Siddha Ganju. Deep learning on mobile