08 neural networks

Legal Notices and Disclaimers
This presentation is for informational purposes only. INTEL MAKES NO WARRANTIES,
EXPRESS OR IMPLIED, IN THIS SUMMARY.
Intel technologies’ features and benefits depend on system configuration and may require
enabled hardware, software or service activation. Performance varies depending on system
configuration. Check with your system manufacturer or retailer or learn more at intel.com.
This sample source code is released under the Intel Sample Source Code License
Agreement.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
Copyright © 2018, Intel Corporation. All rights reserved.
2

Neural Networks
A fancy, tunable way to get an f when given data and target.
• That is, f(data)  tgt

Neural Network Example: OR Logic
A logic gate takes in two Boolean (true/false or 1/0) inputs.
Returns either a 0 or 1, depending on its rule.
The truth table for a logic gate shows the outputs for each combination of inputs.

Truth Table
For example, let's look at the truth table for an Or-gate:

OR as a Neuron
A neuron that uses the sigmoid activation function outputs a value between
(0, 1).
This naturally leads us to think about Boolean values.
Imagine a neuron that takes in two inputs, x1x1 and x2x2, and a bias term:

Nodes
Nodes are the primitive elements.
out = activation(f(in) + bias)
𝑧 = 𝑎(𝑏 + 𝑖=1
𝑚
𝑊𝑖 ⋅ 𝑥𝑖)
= 𝑎(𝑊 𝑡
𝑥 + 𝑏)
Z
𝑥1
𝑥2
+1
a
out =
a(Z)

Classic Visualization of Neurons
𝑥1
𝑥2
+1
Inputs
Bias neuron
(constant 1)
Activation function
Weights are shown to be arrows
in classical visualizations of NNs
Z a
out =
a(Z)

Classic Visualization of Neurons
𝑥1
𝑥2
+1
Inputs
Bias neuron
(constant 1)
𝑊1
𝑊2
𝑏
Z a
out =
a(Z)

Training
z is a dot-product between inputs and weights of the node.
• sum-of-squares
We initialize the weights with constants and/or random values.
Learning is the process of finding good weights.

Activation Function: Sigmoid
Model inspired by biological neurons.
Biological neurons either pass no signal, full signal, or something in between.
Want a function that is like this and has an easy derivative.

𝜎 𝑧 =
1
1 + 𝑒−𝑧
• Value at 𝑧 ≪ 0?
• Value at 𝑧 = 0?
• Value at 𝑧 ≫ 0?
≈ 0
= 0.5
≈ 1

0.5
1.0
1.0
-40
5
5
𝜎 𝑧 = 𝜎(−10) ≈ 0.0
≈ 0.0
𝑥1
𝑥2
+1
a
0.0

Activation Function: ReLU
Many modern networks use rectified linear units (ReLU)
𝑅𝑒𝐿𝑈 𝑧 =
0, 𝑧 < 0
𝑧, 𝑧 ≥ 0
Value at 𝑧 ≪ 0?
Value at 𝑧 = 0?
Value at 𝑧 ≫ 0?
= 0
= 0.
= 𝑧
= max 0, 𝑧

Activation Function: ReLU
𝑅𝑒𝐿𝑈 𝑧 =
0, 𝑧 < 0
𝑧, 𝑧 ≥ 0

Layers and Networks
Inputs don’t need to be limited to passing data into a single neuron.
They can pass data to as many as we like
𝑥1
𝑥2
+1
a
a

Layers and Networks
Typically, neurons are grouped into layers.
Each neuron in the layer receives input from the same neurons.
Weights are different for each neuron
All neurons in this layer output to the same neurons in a subsequent layer.

Layers and Networks: Input/Output Layers
Input layer depends on:
• Form of raw data
• First level of our internal network architecture
Output layer depends on:
• Last layer of our internal network architecture
• Type of prediction we want to make
• Regression versus classification

Layers and Networks: Input/Output Layers
𝑥1
𝑥2
+1
a
a
a2
a1a1 and a2 receive the same x1 value
But having different weights mean a1 and a2 neurons
respond differently.

Feed Forward Neural Network
Weights
𝑥1
𝑥2
𝑥3
a
a
a
a
a
a
a
a
𝑦1
𝑦2
𝑦3

Feed Forward Neural Network
𝑥1
𝑥2
𝑥3
a
a
a
a
a
a
a
a
𝑦1
𝑦2
𝑦3
Input Layer
Hidden Layers
Output Layer

Optimization and Loss: Gradient Descent
We will start with the cost function: J(x) = x2
• Cost is what we pay for an error
• For example, an error of -3 gives a cost of 9
Take the gradient of x2 = 2x.
Select datapoints to generate a gradient slope line.
Plot x2 with a given gradient slope and annotations.
We want the lowest cost.

Gradient Descent: Starting From Left Side

Gradient Descent: Starting From Right Side

Process of Gradient Descent: Math
1. Find the gradient with respect to weights over training data.
 Plug data into our derivative function and sum up over data points
∆𝑊 =
𝑖=1
𝑛
𝜕𝐽
𝜕𝑊
𝑥𝑖, 𝑦𝑖
𝜕𝐽
𝜕𝑊
(𝑥𝑖, 𝑦𝑖) =
1
𝑛
𝑖=1
𝑛
𝑥𝑖 𝑦𝑖 − 𝑦𝑖
The number we’ll use to
adjust the weight
Derivative of MSE

Process of Gradient Descent: Math
2. Adjust the weight by subtracting some amount of ∆𝑊.
𝛼 (alpha) is known as the learning rate
A hyper-parameter we choose
3. Repeat until model is done training.
We can also adjust the learning rate as we train
𝑊: = 𝑊 − 𝛼 ∙ ∆𝑊
Minus adjusts W in the correct direction

J (cost)
W
𝜕𝐽
𝜕𝑊
< 0
𝛼 ∙ ∆𝑊
Adjusting the Learning Rate

J (cost)
W
𝜕𝐽
𝜕𝑊
< 0
𝛼 ∙ ∆𝑊
Bigger 𝛼

J (cost)
W
𝜕𝐽
𝜕𝑊
< 0
𝛼 ∙ ∆𝑊
Smaller 𝛼

Batches
How much data do we use for one training step?
• One training step takes us from old network weights to new network weights
We could use ALL of the examples at one time.
• Terrible performance -- if it is even possible
• We'll constantly be swapping memory to slow disks
We could use one example at a time.
• But terrible performance
• It doesn't take advantage of caching, vectorized operations, and so on
• We want good data processing size for vectorized operations

Batching
How much data do we use for one training step?
• One training step takes us from old network weights to new network weights
Options
• Full batch
• Update weights after considering all data in batch
• Mini-batch
• Update weights after considering part of batch, repeat
• Approximating the gradient
• Can help with local minima

Batching
Options continued…
• Stochastic gradient descent (SGD)
• Mini batch with size 1
• Also called online training
• Very sporadic, very easy to compute
• With a big network, performance comes from many weights

Comparing Full Batch, Mini Batch, and SGD
Stochastic Mini batch Full batch
Batch size1 N
Use all of training data per
step
Use small portion of training
data per step
Use single example per step

Epoch
One epoch is one pass through the entire dataset.
• Generally, the dataset is too big for system memory.
• Can't do this all in one go
General measure of the amount of training.
• How many epochs did I perform?

Shuffling Datasets for Epochs
After each epoch, shuffle the training data.
Prevents resampling in the exact same way.
• Different epochs sample data in different ways.
So…
Shuffle, make batches, repeat.

Splitting Data Up Into Batches
Batch 5
Batch 4
Batch 3
Batch 2
Batch 1
FULL
BATCH
Step 1

Splitting Data Up Into Batches
Batch 5
Batch 4
Batch 3
Batch 2
Batch 1 Step 1
Step 2
Step 3
Step 4
Step 5
First Epoch Completed

Shuffle Data
Batch 2
Batch 5
Batch 4
Batch 3
Batch 2
Batch 1 Step 6

Special Issues With Overfitting
Very simple neural network architectures can approximate arbitrarily complex functions
very well.
• Consequence of universal representation theorem
• Three layers, finite # nodes  arbitrarily good approximation
• Although better approximations may require n  big
Even simple neural networks are, in some sense, too powerful.

Many architectures easily overfit data.
• Simply chugging through the data over-and-over leads to overfit.
• Memorizes data but doesn't learn the generality.
• Easily mislead by noise.
Traditionally, we control this by monitoring the performance on a test set.
• As long as it improves, we're good.
• When it starts going the wrong way, we stop.

Modern method uses a technique called dropout:
• Here we randomly have nodes disappear from the network.
• Everyone else still has to perform.
The overall network has to be more robust.
• Single nodes can't be too important.
• The nodes can't all be highly correlated with one another.
• Different nodes must respond to different stimuli

Knocking Out and Rescaling Neurons
• During training, we randomly drop each neuron
with probability 1 − 𝑝.
• When running the model, we scale the
outputs of the neuron by 𝑝.
• This ensures that the expected value of the
weights stays the same at run time.

Concept of a “Pseudo-Ensemble”

Multilayer Perceptron (MLP)
𝑥1
𝑥2
𝑥3
a
a
a
a
a
a
a
a
𝑦1
𝑦2
𝑦3

MLP: General Process
1. Shuffle the data and split between train and test sets
2. Flatten the data
3. Convert class vectors to binary class matrices
4. Generate network architecture
5. Display network architecture
6. Define learning procedure
7. Fit model
8. Evaluate

MLP
Trains a simple MLP with dropout on the MNIST* dataset.
Gets to 98.40 percent test accuracy after 20 epochs.
• There is a lot of margin for parameter tuning
• 0.2 seconds per epoch on a K520 GPU

Convolution Neural Networks (CNN)
Good to use when you have:
• Translational variance
• Huge number of parameters
We need to train models on translated data

CNN: General Process
Trains a simple convnet on the MNIST* dataset>
Gets to 99.25 percent test accuracy after 12 epochs.
• There is still a lot of margin for parameter tuning
• 0.16 seconds per epoch on a GRID K520 GPU

CNN
1. Shuffle dataset and split between train and test sets
2. Maintain grid structure of data
• Add a dimension to account for the single-channel images
3. Convert class vectors to binary class matrices
4. Define architecture
5. Define learning procedure
6. Fit model
7. Evaluate

CNN: Kernels
Like our image processing kernels, but we learn their weightings
• Instead of assuming Gaussian, we let the data determine the weights.
Example: 3 x 3
Input Kernel Output
3 2 1
1 2 3
1 1 1
-1 0 1
-2 0 2
-1 0 1

Kernel Math
Input Kernel Output
3 2 1
1 2 3
1 1 1
-1 0 1
-2 0 2
-1 0 1
= (3 * -1) + (2 * 0) + (1 * 1) + (1 * -2) … and so on.

Kernel Math
Input Kernel Output
3 2 1
1 2 3
1 1 1
-1 0 1
-2 0 2
-1 0 1
= (3 * -1) + (2 * 0) + (1 * 1) + (1 * -2) 1 ⋅ −2 + 2 ⋅ 0 + 3 ⋅ 2 + 1 ⋅ −1 + 1 ⋅ 0 + 1 ⋅ 1
= −3 + 1 − 2 + 6 − 1 + 1
= 2
2

CNN: Pooling Layers
Reduce neighboring pixels.
Reduce dimensions of inputs (height and width).
No parameters!

CNN: Pooling Layers
(Average pool over whole layer)

LeNet*: Example CNN Architecture
Use convolutions to learn features on image data.
• Used on the MNIST* dataset
Input: 28 x 28, with two pixels of padding (on all sides)
Convolution size: 5 x 5

LeNet*
C1 layer depth: 6 S2 Pooling: 2 x 2
Convolution size: 5 x 5
C3 layer depth: 16
S4 Pooling: 2 x 2
Flatten from 5 x 5 x 16 to 400 x 1
Fully connected layer: from 400 to 120
Fully connected layer: from 120 to 84
Fully connected layer: from
84 to 10
Softmax

Table Description of LeNet*-5
Layer Name Parameters
1. Convolution 5 x 5, stride 1, padding 2 (‘SAME’)
2. Max pool 2 x 2, stride 2
3. Convolution 5 x 5, stride 1, padding 2 (‘SAME’)
4. Max pool 2 x 2, stride 2
5. Fully connected (ReLU) Depth: 120
6. Fully connected (ReLU) Depth: 84
7. Output (fully connected ReLU) Depth: 10

What’s the Point? Count Parameters
Conv1: 1*6*5*5 + 6 = 156
Pool2: 0
Conv3: 6*16*5*5 + 16 = 2416
Pool4: 0
FC1: 400*120 + 120 = 48120
FC2: 120*84 + 84 = 10164
FC3: 84*10 + 10 = 850
Total: = 61706
Less than a single FC layer with [1200 x 1200] weights!

What’s the Point? CNN Learns Features!
Layers replace manual image processing, transforming, and feature
extraction!
• For example, a slightly different architecture called AlexNet has a
layer that essentially performs Sobel filtering.
• Edge detection as a layer
• See:
• http://cs231n.github.io/assets/cnnvis/filt1.jpeg

Nodes
W
var
b
var
Add
MATMULInputs
Activation
Represents the activation
function 𝑎 = 𝑓(𝑧)

Nodes
W
var
b
var
Add
MATMULInputs
Activation
X: [m x 1] vector of inputs
W: [m x 1] vector of weights
Result of MATMUL is scalar
Bias is scalar
The add operation outputs z
z
The activation
function applies
a non-linear
transformation
and
passes it along
to the next layer

Batched Nodes
X: [n x m] matrix of inputs (batched)
W: [m x 1] vector of weights
W
var
b
var
Add
MATMULInputs
Activation
Result of MATMUL is vector
(one entry for each example)
Bias is scalar
(each prediction gets same bias added)
The add operation outputs z as a vector
one entry for each example
z
Activation
is a vector,
one entry
for each
input
example

08 neural networks

More Related Content

08 neural networks

Editor's Notes