Recurrent Neural Networks

Recurrent Neural Network
● Predicting the future is what we do all the time
○ Finishing a friend’s sentence
○ Anticipating the smell of coffee at the breakfast or
○ Catching the ball in the field
● In this chapter, we will cover RNN
○ Networks which can predict future
● Unlike all the nets we have discussed so far
○ RNN can work on sequences of arbitrary lengths
○ Rather than on fixed-sized inputs

Recurrent Neural Network - Applications
● RNN can analyze time series data
○ Such as stock prices, and
○ Tell you when to buy or sell

● In autonomous driving systems, RNN can
○ Anticipate car trajectories and
○ Help avoid accidents

● RNN can take sentences, documents, or audio samples as input and
○ Make them extremely useful
○ For natural language processing (NLP) systems such as
■ Automatic translation
■ Speech-to-text or
■ Sentiment analysis

● RNNs’ ability to anticipate also makes them capable of surprising
creativity.
○ You can ask them to predict which are the most likely next notes in a
melody
○ Then randomly pick one of these notes and play it.
○ Then ask the net for the next most likely notes, play it, and repeat the
process again and again.
Here is an example melody produced by Google’s Magenta project

● In this chapter we will learn about
○ Fundamental concepts in RNNs
○ The main problem RNNs face
○ And the solution to the problems
○ How to implement RNNs
● Finally, we will take a look at the
○ Architecture of a machine translation system

Recurrent Neurons

Recurrent Neurons
● Up to now we have mostly looked at feedforward neural networks
○ Where the activations flow only in one direction
○ From the input layer to the output layer
● RNN looks much like a feedforward neural network
○ Except it also has connections pointing backward

Recurrent Neurons
● Let’s look at the simplest possible RNN
○ Composed of just one neuron receiving inputs
○ Producing an output, and
○ Sending that output back to itself
Input
Output
Sending output back to
itself

Recurrent Neurons
● At each time step t (also called a frame)
○ This recurrent neuron receives the inputs x(t)
○ As well as its own output from the previous time step y(t–1)
A recurrent neuron (left), unrolled through time (right)

Recurrent Neurons
● We can represent this tiny network against the time axis (See below
figure)
● This is called unrolling the network through time
A recurrent neuron (left), unrolled through time (right)

Recurrent Neurons
● We can easily create a layer of recurrent neurons
● At each time step t, every neuron receives both the
○ Input vector x(t)
and
○ Output vector from the previous time step y(t–1)
A layer of recurrent neurons (left), unrolled through time(right)

Recurrent Neurons
● Each recurrent neuron has two sets of weights
○ One for the inputs x(t)
and the
○ Other for the outputs of the previous time step, y(t–1)
● Let’s call these weight vectors wx
and wy
● Below equation represents the output of a single recurrent neuron
Output of a single recurrent neuron for a single instance
bias
ϕ() is the activation function like
ReLU

Recurrent Neurons
● We can compute a whole layer’s output
○ In one shot for a whole mini-batch
○ Using a vectorized form of the previous equation
Outputs of a layer of recurrent neurons for all instances in a mini-batch

Recurrent Neurons
● Y(t)
is an m x nneurons
matrix containing the
○ Layer’s outputs at time step t for each instance in the minibatch
○ m is the number of instances in the mini-batch
○ nneurons
is the number of neurons

Recurrent Neurons
● X(t)
is an m × ninputs
matrix containing the inputs for all instances
○ ninputs
is the number of input features

Recurrent Neurons
● Wx
is an ninputs
× nneurons
matrix containing the connection weights for the
inputs of the current time step
● Wy
is an nneurons
× nneurons
matrix containing the connection weights for
the outputs of the previous time step

Recurrent Neurons
● The weight matrices Wx
and Wy
are often concatenated into a single
weight matrix W of shape (ninputs
+ nneurons
) × nneurons
● b is a vector of size nneurons
containing each neuron’s bias term

Memory Cells
● Since the output of a recurrent neuron at time step t is a
○ Function of all the inputs from previous time steps
○ We can say that it has a form of memory
● A part of a neural network that
○ Preserves some state across time steps is called a memory cell

Memory Cells
● In general a cell’s state at time step t, denoted h(t)
is a
○ Function of some inputs at that time step and
○ Its state at the previous time step h(t)
= f(h(t–1)
, x(t)
)
● Its output at time step t, denoted y(t)
is also a
○ Function of the previous state and the current inputs

Memory Cells
● In the case of basics cells we have discussed so far
○ The output is simply equal to the state
○ But in more complex cells this is not always the case
A cell’s hidden state and its output may be different

Input and Output Sequences
Sequence-to-sequence Network
● An RNN can simultaneously take a
○ Sequence of inputs and
○ Produce a sequence of outputs

Sequence-to-sequence Network
● This type of network is useful for predicting time series
○ Such as stock prices
● We feed it the prices over the last N days and
○ It must output the prices shifted by one day into the future
○ i.e., from N – 1 days ago to tomorrow

Sequence-to-vector Network
● Alternatively we could feed the network a sequence of inputs and
○ Ignore all outputs except for the last one

Sequence-to-vector Network
● We can feed this network a sequence of words
○ Corresponding to a movie review and
○ The network would output a sentiment score
○ e.g., from –1 [hate] to +1 [love]

Vector-to-sequence Network
● We could feed the network a single input at the first time step and
○ Zeros for all other time steps and
○ Let is output a sequence
● For example, the input could be an image and the
○ Output could be a caption for the image

Encoder-Decoder
● In this network, we have
○ sequence-to-vector network, called an encoder followed by
○ vector-to-sequence network, called a decoder

Encoder-Decoder
● This can be used for translating a sentence
○ From one language to another
● We feed the network sentence in one language
○ The encoder converts this sentence into single vector representation
○ Then the decoder decodes this vector into a sentence in another
language

Encoder-Decoder
● This two step model works much better than
○ Trying to translate on the fly with a
○ Single sequence-to-sequence RNN
● Since the last words of a sentence can affect the
○ First words of the translation
○ So we need to wait until we know the whole sentence

Basic RNNs in TensorFlow

● Let’s implement a very simple RNN model
○ Without using any of the TensorFlow’s RNN operations
○ To better understand what goes on under the hood
● Let’s create an RNN composed of a layer of five recurrent neurons
○ Using the tanh activation function and
○ Assume that the RNN runs over only two time steps and
○ Taking input vectors of size 3 at each time step

● This network looks like a two-layer feedforward neural network with two
differences
○ The same weights and bias terms are shared by both layers and
○ We feed inputs at each layer, and we get outputs from each layer

● To run the model, we need to feed it the inputs at both time steps
● Mini-batch contains four instances
○ Each with an input sequence composed of exactly two inputs

● At the end, Y0_val and Y1_val contain the outputs of the network
○ At both time steps for all neurons and
○ All instances in the mini-batch

Checkout the complete code under “Manual
RNN” section in notebook

Static Unrolling Through Time
● Let’s look at how to create the same model
○ Using TensorFlow’s RNN operations
● The static_rnn() function creates
○ An unrolled RNN network by chaining cells
● The below code creates the exact same model as the previous one
>>> X0 = tf.placeholder(tf.float32, [None, n_inputs])
>>> basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
>>> output_seqs, states = tf.contrib.rnn.static_rnn(
basic_cell, [X0, X1], dtype=tf.float32
)
>>> Y0, Y1 = output_seqs

)
● First we create the input placeholders

)
● Then we create a BasicRNNCell
○ It is like a factory that creates
○ Copies of the cell to build the unrolled RNN
■ One for each time step

)
● Then we call static_rnn(), giving it the cell factory and the input
tensors
● And telling it the data type of the inputs
○ This is used to create the initial state matrix
○ Which by default is full of zeros

)
● The static_rnn() function returns two objects
● The first is a Python list containing the output tensors for each time step
● The second is a tensor containing the final states of the network
● When we use basic cells
○ Then the final state is equal to the last output

Checkout the complete code under “Using
static_rnn()” section in notebook

● In the previous example, if there were 50 time steps then
○ It would not be convenient to define
○ 50 place holders and 50 output tensors
● Moreover, at execution time we would have to feed
○ Each of the 50 placeholders and manipulate the 50 outputs
● Let’s do it in a better way

>>> X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
>>> X_seqs = tf.unstack(tf.transpose(X, perm=[1, 0, 2]))
basic_cell, X_seqs, dtype=tf.float32
)
>>> outputs = tf.transpose(tf.stack(output_seqs), perm=[1, 0, 2])
● The above code takes a single input placeholder of
○ shape [None, n_steps, n_inputs]
○ Where the first dimension is the mini-batch size

)
● Then it extracts the list of input sequences for each time step
● X_seqs is a Python list of n_steps tensors of shape [None, n_inputs]
○ Where first dimension is the minibatch size

)
● To do this, we first swap the first two dimensions
○ Using the transpose() function so that the
○ Time steps are now the first dimension

)
● Then we extract a Python list of tensors along the first dimension
○ i.e., one tensor per time step
○ Using the unstack() function

)
● The next two lines are same as before

)
● Finally, we merge all the output tensors into a single tensor
○ Using the stack() function
● And then we swap the first two dimensions to get a
○ Final outputs tensor of shape [None, n_steps, n_neurons]

● Now we can run the network by
○ Feeding it a single tensor that contains
○ All the mini-batch sequences

● And then we get a single outputs_val tensor for
○ All instances
○ All time steps, and
○ All neurons

Checkout the complete code under “Packing
sequences” section in notebook

● The previous approach still builds a graph
○ Containing one cell per time step
● If there were 50 time steps, the graph would look ugly
● It is like writing a program without using for loops
○ Y0=f(0,X0); Y1=f(Y0, X1); Y2=f(Y1, X2); ...; Y50=f(Y49, X50))
● With such a large graph
○ Since it must store all tensor values during the forward pass
○ So it can use them to compute gradients during the reverse pass
○ We may get out-of-memory (OOM) errors
○ During backpropagation (in GPU cards because of limited memory)

Dynamic Unrolling Through Time
Let’s look at the better solution than previous
approach using the dynamic_rnn() function

● The dynamic_rnn() function uses a while_loop() operation to
○ Run over the cell the appropriate number of times
● We can set swap_memory=True
○ If we want it to swap the GPU’s memory to the CPU’s
○ Memory during backpropagation to avoid out of memory errors
● It also accepts a single tensor for
○ All inputs at every time step (shape [None, n_steps, n_inputs]) and
○ It outputs a single tensor for all outputs at every time step
■ (shape [None, n_steps, n_neurons])
○ There is no need to stack, unstack, or transpose

RNN using dynamic_rnn
>>> outputs, states = tf.nn.dynamic_rnn(basic_cell, X,
dtype=tf.float32)

Checkout the complete code under “Using
dynamic_rnn()” section in notebook

Note
● During backpropagation
○ The while_loop() operation does the appropriate magic
○ It stores the tensor values for each iteration during the forward pass
○ So it can use them to compute gradients during the reverse pass

Handling Variable Length Input Sequences
● So far we have used only fixed-size input sequences
● What if the input sequences have variable lengths (e.g., like sentences)
● In this case we should set the sequence_length parameter
○ When calling the dynamic_rnn() function
○ It must be a 1D tensor indicating the length of the
○ Input sequence for each instance

● Suppose the second input sequence contains
○ Only one input instead of two
○ Then It must be padded with a zero vector
○ In order to fit in the input tensor X

● Now we need to feed values for both placeholders X and seq_length

● Now the RNN outputs zero vectors for
○ Every time step past the input sequence length
○ Look at the second instance’s output for the second time step

● Moreover, the states tensor contains the final state of each cell
○ Excluding the zero vectors

Checkout the complete code under “Setting
the sequence lengths” section in notebook

Handling Variable-Length Output Sequences
● What if the output sequences have variable lengths
● If we know in advance what length each sequence will have
○ For example if we know that it will be the same length as the input
sequence
○ Then we can set the sequence_length parameter as discussed
● Unfortunately, in general this will not be possible
○ For example,
■ The length of a translated sentence is generally different from the
■ Length of the input sentence

Handling Variable-Length Output Sequences
● In this case, the most common solution is to define
○ A special output called an end-of-sequence token (EOS token)
● Any output past the EOS should be ignored - We will discuss it later in
details

Till now we have learnt how to build an RNN
network. But how do we train it?

Training RNNs

Training RNNs
● To train an RNN, the trick is to unroll it through time and
then simply use regular backpropagation
● This strategy is called backpropagation through time
(BPTT)

Training RNNs
Understanding how RNNs are trained
Just like in regular backpropagation, there is a first forward pass
through the unrolled network, represented by the dashed
arrows

Training RNNs
Then the output sequence is evaluated using a cost function
where tmin
and tmax
are the first
and last output time steps, not counting the ignored outputs

Then the gradients of that cost function are propagated
backward through the unrolled network, represented by the
solid arrows
Training RNNs

And finally the model parameters are updated using the
gradients computed during BPTT
Training RNNs

Note that the gradients flow backward through all the outputs
used by the cost function, not just through the final output
Training RNNs

Here, the cost function is computed using the last three outputs
of the network, Y(2)
, Y(3)
, and Y(4)
, so gradients flow through
these three outputs, but not through Y(0)
and Y(1)
Training RNNs

Moreover, since the same parameters W and b are used at
each time step, backpropagation will do the right thing and sum
over all time steps
Training RNNs

Training a Sequence Classifier
Let’s train an RNN to classify MNIST images

● A convolutional neural network would be better suited for image
classification
● But this makes for a simple example that we are already familiar with

Overview of the task
● We will treat each image as a sequence of 28 rows of 28 pixels each,
since each MNIST image is 28 × 28 pixels
● We will use cells of 150 recurrent neurons, plus a fully connected
layer containing 10 neurons, one per class, connected to the output of
the last time step
● This will be followed by a softmax layer

Overview of the task

Construction Phase
● The construction phase is quite straightforward
● It’s pretty much the same as the MNIST classifier we built previously,
except that an unrolled RNN replaces the hidden layers
● Note that the fully connected layer is connected to the states tensor,
which contains only the final state of the RNN i.e., the 28th output

Construction Phase
>>> from tensorflow.contrib.layers import fully_connected
>>> n_steps = 28
>>> n_inputs = 28
>>> n_neurons = 150
>>> n_outputs = 10
>>> learning_rate = 0.001
>>> y = tf.placeholder(tf.int32, [None])
>>> outputs, states = tf.nn.dynamic_rnn(basic_cell, X,
dtype=tf.float32)
Run it on Notebook

Construction Phase
>>> logits = tf.layers.dense(states, n_outputs, activation_fn=None)
>>> xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(
labels=y, logits=logits)
>>> loss = tf.reduce_mean(xentropy)
>>> optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
>>> training_op = optimizer.minimize(loss)
>>> correct = tf.nn.in_top_k(logits, y, 1)
>>> accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
>>> init = tf.global_variables_initializer()
Run it on Notebook

Load the MNIST data and reshape it
Now we will load the MNIST data and reshape the test data to [batch_size,
n_steps, n_inputs] as is expected by the network
>>> from tensorflow.examples.tutorials.mnist import
input_data
>>> mnist = input_data.read_data_sets("data/mnist/")
>>> X_test = mnist.test.images.reshape((-1, n_steps,
n_inputs))
>>> y_test = mnist.test.labels
Run it on Notebook

Training the RNN
We reshape each training batch before feeding it to the network
>>> n_epochs = 100
>>> batch_size = 150
>>> with tf.Session() as sess:
init.run()
for epoch in range(n_epochs):
for iteration in range(mnist.train.num_examples // batch_size):
X_batch, y_batch = mnist.train.next_batch(batch_size)
X_batch = X_batch.reshape((-1, n_steps, n_inputs))
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
acc_test = accuracy.eval(feed_dict={X: X_test, y: y_test})
print(epoch, "Train accuracy:", acc_train, "Test accuracy:", acc_test)
Run it on Notebook

The Output
The output should look like this:
0 Train accuracy: 0.713333 Test accuracy: 0.7299
...

Conclusion
● We get over 98% accuracy — not bad!
● Plus we would certainly get a better result by
○ Tuning the hyperparameters
○ Initializing the RNN weights using He initialization
○ Training longer
○ Or adding a bit of regularization e.g., dropout

Training to Predict Time Series
Now, we will train an RNN to predict the next value in a
generated time series

● Each training instance is a randomly selected sequence of 20 consecutive
values from the time series
● And the target sequence is the same as the input sequence, except it is
shifted by one time step into the future

Construction Phase
● It will contain 100 recurrent neurons and we will unroll it over 20
time steps since each training instance will be 20 inputs long
● Each input will contain only one feature, the value at that time
● The targets are also sequences of 20 inputs, each containing a single
value

Construction Phase
>>> n_steps = 20
>>> n_inputs = 1
>>> n_neurons = 100
>>> n_outputs = 1
>>> y = tf.placeholder(tf.float32, [None, n_steps, n_outputs])
>>> cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons,
activation=tf.nn.relu)
>>> outputs, states = tf.nn.dynamic_rnn(cell, X, dtype=tf.float32)
Run it on Notebook

Construction Phase
● At each time step we now have an output vector of size 100
● But what we actually want is a single output value at each time step
● The simplest solution is to wrap the cell in an
OutputProjectionWrapper

Construction Phase
● A cell wrapper acts like a normal cell, proxying every method call to an
underlying cell, but it also adds some functionality
● The OutputProjectionWrapper adds a fully connected layer of linear
neurons i.e., without any activation function on top of each output,
but it does not affect the cell state
● All these fully connected layers share the same trainable weights and bias
terms.

RNN cells using output projections

Wrapping a cell is quite easy
Let’s tweak the preceding code by wrapping the BasicRNNCell into an
OutputProjectionWrapper
>>> cell = tf.contrib.rnn.OutputProjectionWrapper(
tf.contrib.rnn.BasicRNNCell(num_units=n_neurons,
activation=tf.nn.relu),output_size=n_outputs)
Run it on Notebook

Cost Function and Optimizer
● Now we will define the cost function
● We will use the Mean Squared Error (MSE)
● Next we will create an Adam optimizer, the training op, and the variable
initialization op
●
>>> learning_rate = 0.001
>>> loss = tf.reduce_mean(tf.square(outputs - y))
>>> optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
>>> training_op = optimizer.minimize(loss)
Run it on Notebook

Execution Phase
>>> n_iterations = 10000
>>> batch_size = 50
init.run()
for iteration in range(n_iterations):
X_batch, y_batch = [...] # fetch the next training batch
sess.run(training_op, feed_dict={X: X_batch, y:y_batch})
if iteration % 100 == 0:
mse = loss.eval(feed_dict={X: X_batch, y: y_batch})
print(iteration, "tMSE:", mse)
Run it on Notebook

Execution Phase
The program’s output should look like this
0 MSE: 379.586
100 MSE: 14.58426
200 MSE: 7.14066
300 MSE: 3.98528
400 MSE: 2.00254
[...]

Making Predictions
Once the model is trained, you can make predictions:
>>> X_new = [...] # New sequences
>>> y_pred = sess.run(outputs, feed_dict={X: X_new})

Making Predictions
Shows the predicted sequence for the instances, after 1,000 training iterations

● Although using an OutputProjectionWrapper is the simplest solution
to reduce the dimensionality of the RNN’s output sequences down to just
one value per time step per instance
● But it is not the most efficient

● There is a trickier but more efficient solution:
○ We can reshape the RNN outputs from [batch_size, n_steps,
n_neurons] to [batch_size * n_steps, n_neurons]
○ Then apply a single fully connected layer with the appropriate output
size in our case just 1, which will result in an output tensor of shape
[batch_size * n_steps, n_outputs]
○ And then reshape this tensor to [batch_size, n_steps, n_outputs]

Reshape the RNN outputs
from [batch_size, n_steps,
n_neurons] to
[batch_size * n_steps,
n_neurons]

Apply a single fully connected
layer with the appropriate
output size in our case just
1, which will result in an
output tensor of shape
[batch_size * n_steps,
n_outputs]

And then reshape this tensor
to [batch_size, n_steps,
n_outputs]

Let’s implement this solution
● We first revert to a basic cell, without the OutputProjectionWrapper
>>> cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons,
activation=tf.nn.relu)
>>> rnn_outputs, states = tf.nn.dynamic_rnn(cell, X,
dtype=tf.float32)
Run it on Notebook

● Then we stack all the outputs using the reshape() operation, apply the
fully connected linear layer without using any activation function; this is
just a projection, and finally unstack all the outputs, again using reshape()
>>> stacked_rnn_outputs = tf.reshape(rnn_outputs, [-1, n_neurons])
>>> stacked_outputs = fully_connected(stacked_rnn_outputs,
n_outputs, activation_fn=None)
>>> outputs = tf.reshape(stacked_outputs, [-1, n_steps, n_outputs])
Run it on Notebook

● The rest of the code is the same as earlier. This can provide a significant
speed boost since there is just one fully connected layer instead of one
per time step.

Creative RNN
Let’s use our to generate some creative sequences

Creative RNN
● All we need is to provide it a seed sequence containing n_steps
values e.g., full of zeros
● Use the model to predict the next value
● Append this predicted value to the sequence
● Feed the last n_steps values to the model to predict the next value
● And so on
This process generates a new sequence that has some resemblance to the
original time series

Creative RNN
>>> sequence = [0.] * n_steps
>>> for iteration in range(300):
X_batch = np.array(sequence[-n_steps:]).reshape(1, n_steps, 1)
y_pred = sess.run(outputs, feed_dict={X: X_batch})
sequence.append(y_pred[0, -1, 0])
Run it on Notebook

Creative RNN
Creative sequences seeded with zeros

Creative RNN
Creative sequences seeded with an instance

Deep RNNs

Deep RNNs
● It is quite common to stack multiple layers of cells.
● This gives you a Deep RNN
A Deep RNN

Deep RNNs
Deep RNN unrolled through time

Deep RNNs
How to implement Deep RNN in TensorFlow

● To implement a deep RNN in TensorFlow
● We can create several cells and stack them into a MultiRNNCell
● In the following code we stack three identical cells
>>> n_neurons = 100
>>> n_layers = 3
>>> multi_layer_cell = tf.contrib.rnn.MultiRNNCell([basic_cell] *
n_layers)
>>> outputs, states = tf.nn.dynamic_rnn(multi_layer_cell, X,
dtype=tf.float32)
Deep RNNs - Implementation in TensorFlow
Run it on Notebook

>>> outputs, states = tf.nn.dynamic_rnn(multi_layer_cell, X,
dtype=tf.float32)
● The states variable is a tuple containing one tensor per layer, each
representing the final state of that layer’s cell with shape [batch_size,
n_neurons]
● If you set state_is_tuple=False when creating the MultiRNNCell,
then states becomes a single tensor containing the states from every
layer, concatenated along the column axis i.e., its shape is [batch_size,
n_layers * n_neurons]
Deep RNNs - Implementation in TensorFlow

● If you build a very deep RNN, it may end up overfitting the training set
● To prevent that, a common technique is to apply dropout
● You can simply add a dropout layer before or after the RNN as usual
● But if you also want to apply dropout between the RNN layers, you need
to use a DropoutWrapper
Deep RNNs - Applying Dropout

● The following code applies dropout to the inputs of each layer in the
RNN, dropping each input with a 50% probability
>>> keep_prob = 0.5
>>> cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
>>> cell_drop = tf.contrib.rnn.DropoutWrapper(cell,
input_keep_prob=keep_prob)
>>> multi_layer_cell = tf.contrib.rnn.MultiRNNCell([cell_drop] *
n_layers)
>>> rnn_outputs, states = tf.nn.dynamic_rnn(multi_layer_cell, X,
dtype=tf.float32)
Run it on Notebook

● It is also possible to apply dropout to the outputs by setting
output_keep_prob
● The main problem with this code is that it will apply dropout not only
during training but also during testing, which is not what we want
● Since dropout should be applied only during training

● Unfortunately, the DropoutWrapper does not support an is_training
placeholder
● So we must either write our own dropout wrapper class, or have two
different graphs:
○ One for training
○ And the other for testing
Let’s implement the second option

>>> import sys
>>> is_training = (sys.argv[-1] == "train")
>>> y = tf.placeholder(tf.float32, [None, n_steps, n_outputs])
>>> cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
>>> if is_training:
cell = tf.contrib.rnn.DropoutWrapper(cell, input_keep_prob=keep_prob)
>>> multi_layer_cell = tf.contrib.rnn.MultiRNNCell([cell] * n_layers)
>>> rnn_outputs, states = tf.nn.dynamic_rnn(multi_layer_cell, X,
dtype=tf.float32)
[...] # build the rest of the graph
>>> saver = tf.train.Saver()
>>> if is_training:
init.run()
for iteration in range(n_iterations):
[...] # train the model
save_path = saver.save(sess, "/tmp/my_model.ckpt")
else:
saver.restore(sess, "/tmp/my_model.ckpt")
[...] # use the model Run it on Notebook

The Difficulty of Training over Many Time Steps
● To train an RNN on long sequences, we will need to run it over many
time steps, making the unrolled RNN a very deep network
● Just like any deep neural network it may suffer from the
vanishing/exploding gradients problem and take forever to train
Deep RNNs

● Many of the tricks we discussed to alleviate this problem can be used for
deep unrolled RNNs as well:
○ good parameter initialization,
○ nonsaturating activation functions e.g., ReLU
○ Batch Normalization,
○ Gradient Clipping,
○ And faster optimizers
Deep RNNs

● However, if the RNN needs to handle even moderately long sequences
e.g., 100 inputs, then training will still be very slow
● The simplest and most common solution to this problem is to unroll the
RNN only over a limited number of time steps during training
● This is called truncated backpropagation through time
Deep RNNs

● In TensorFlow you can implement truncated backpropagation
through time by simply by truncating the input sequences
● For example, in the time series prediction problem, you would simply
reduce n_steps during training
● The problem with this is that the model will not be able to learn
long-term patterns
How can we solve this problem?
Deep RNNs

● One workaround could be to make sure that these shortened sequences
contain both old and recent data
● So that the model can learn to use both
● E.g., the sequence could contain monthly data for the last five months,
then weekly data for the last five weeks, then daily data over the last five
days
● But this workaround has its limits:
○ What if fine-grained data from last year is actually useful?
○ What if there was a brief but significant event that absolutely must be
taken into account, even years later
○ E.g., the result of an election
Deep RNNs

● Besides the long training time
○ A second problem faced by long-running RNNs is the fact that the
memory of the first inputs gradually fades away
○ Indeed, due to the transformations that the data goes through when
traversing an RNN, some information is lost after each time step.
● After a while, the RNN’s state contains virtually no trace of the first
inputs
Let’s understand this with an example
Deep RNNs

● Say you want to perform sentiment analysis on a long review that starts
with the four words “I loved this movie,”
● But the rest of the review lists the many things that could have made the
movie even better
● If the RNN gradually forgets the first four words, it will completely
misinterpret the review
Deep RNNs

● To solve this problem, various types of cells with long-term memory have
been introduced
● They have proved so successful that the basic cells are not much used
anymore
Let’s study about these long memory cells
Deep RNNs

LSTM Cell

● The Long Short-Term Memory (LSTM) cell was proposed in 19973 by
Sepp Hochreiter and Jürgen Schmidhuber
● And it was gradually improved over the years by several researchers,
such as Alex Graves, Haşim Sak, Wojciech Zaremba, and many more
LSTM Cell
Sepp Hochreiter Jürgen Schmidhuber

LSTM Cell
● If you consider the LSTM cell as a black box, it can be used very much
like a basic cell
● Except
○ It will perform much better
○ Training will converge faster
○ And it will detect long-term dependencies in the data
In TensorFlow, you can simply use a BasicLSTMCell instead of a
BasicRNNCell
>>> lstm_cell = tf.contrib.rnn.BasicLSTMCell(num_units=n_neurons)

LSTM Cell
● LSTM cells manage two state vectors, and for performance reasons they
are kept separate by default
● We can change this default behavior by setting state_is_tuple=False
when creating the BasicLSTMCell

LSTM Cell
The architecture of a basic LSTM cell

● The LSTM cell looks exactly like a regular cell, except that its state is
split in two vectors: h(t)
and c(t)
, here “c” stands for “cell”
LSTM Cell

● We can think of h(t)
as the short-term state and c(t)
as the long-term state
LSTM Cell

Understanding the LSTM cell structure
● The key idea is that the network can learn
○ What to store in the long-term state,
○ What to throw away,
○ And what to read from it
LSTM Cell

As the long-term
state c(t–1)
traverses the
network from left
to right, it first
goes through a
forget gate,
dropping some
memories
LSTM Cell

LSTM Cell
And then it adds
some new
memories via
the addition
operation, which
adds the
memories that
were selected by
an input
gate

The result c(t)
is
sent straight out,
without any
further
transformation.
So, at each time
step, some
memories are
dropped and
some memories
are added
LSTM Cell

Moreover, after
the addition
operation, the
long term state is
copied and passed
through the tanh
function, and then
the result is
filtered by the
output gate.
LSTM Cell

This produces the
short-term state
h(t)
, which is
equal to the cell’s
output for this
time step y(t)
LSTM Cell

Now let’s look at where new memories come from and how the
gates work
LSTM Cell

First, the current
input vector x(t)
and the previous
short-term state
h(t–1)
are fed to
four different fully
connected layers.
They all serve a
different purpose
LSTM Cell

The main layer is
the one that
outputs g(t)
. It has
the usual role of
analyzing the
current inputs x(t)
and the previous
short-term state
h(t–1)
. In an LSTM
cell this layer’s
output is partially
stored in the
long-term state.
LSTM Cell

The three other
layers are gate
controllers. Since
they use the
logistic activation
function, their
outputs range
from 0 to 1.
LSTM Cell

● This summarizes
how to compute
the cell’s
long-term state,
its short-term
state, and its
output at each
time step for a
single instance
● The equations for
a whole
mini-batch are
very similar
LSTM Cell

Conclusion
● A LSTM cell can learn to
○ Recognize an important input, that’s the role of the input gate,
○ Store it in the long-term state,
○ Learn to preserve it for as long as it is needed, that’s the role of the
forget gate,
○ And learn to extract it whenever it is needed
This explains why they have been amazingly successful at capturing
long-term patterns in time series, long texts, audio recordings, and more.
LSTM Cell

Peephole Connections
● In a basic LSTM cell, the gate controllers can look only at the input x(t)
and the previous short-term state h(t–1)
● It may be a good idea to give them a bit more context by letting them
peek at the long-term state as well
● This idea was proposed by Felix Gers and Jürgen Schmidhuber in
2000

● They proposed an LSTM variant with extra connections called
peephole connections:
○ The previous long-term state c(t–1)
is added as an input to the
controllers of the forget gate and the input gate,
○ And the current long-term state c(t)
is added as input to the controller
of the output gate.

To implement peephole connections in TensorFlow, you must use the
LSTMCell instead of the BasicLSTMCell and set use_peepholes=True:
>>> lstm_cell = tf.contrib.rnn.LSTMCell(num_units=n_neurons,
use_peepholes=True)
There are many other variants of the LSTM cell.
One particularly popular variant is the GRU cell, which we will look at now.

GRU Cell

GRU Cell
The Gated Recurrent Unit (GRU) cell was proposed by Kyunghyun Cho
et al. in a 2014 paper that also introduced the Encoder–Decoder network
we discussed earlier
Kyunghyun Cho

GRU Cell
● The GRU cell is
a simplified
version of the
LSTM cell
● It seems to
perform just as
well
● This explains its
growing
popularity

GRU Cell
The main
simplifications are:
● Both state vectors
are merged into a
single vector h(t)

The main
● A single gate
controller controls
both the forget
gate and the input
gate.
If the gate
controller outputs
a 1, the input gate
is open and the
forget gate is
closed.
GRU Cell

The main
If it outputs a 0, the
opposite happens
In other words,
whenever a memory
must be stored, the
location where it will
be stored is erased
first. This is actually a
frequent variant to
the LSTM cell in and
of itself
GRU Cell

The main
● There is no output
gate; the full state
vector is output at
every time step.
There is a new
gate controller
that controls
which part of the
previous state will
be shown to the
main layer.
GRU Cell

Equations to compute the cell’s state at each time step for a single
instance
GRU Cell
)

Implementing GRU cell in TensorFlow
>>> gru_cell = tf.contrib.rnn.GRUCell(num_units=n_neurons)
● LSTM or GRU cells are one of the main reasons behind the success of
RNNs in recent years
● In particular for applications in natural language processing (NLP)
GRU Cell

Natural Language Processing

Natural Language Processing
● Most of the state-of-the-art NLP applications, such as
○ Machine translation,
○ Automatic summarization,
○ Parsing,
○ Sentiment analysis,
○ and more, are now based on RNNs
Now we will take a quick look at what a machine translation model looks
like.
This topic is very well covered by TensorFlow’s awesome Word2Vec and
Seq2Seq tutorials, so you should definitely check them out

Natural Language Processing - Word Representation
Before we start, we need to answer this important question
How do we represent a “word” ??

In order to apply algorithms,
We need to convert everything in numbers.
What can we do about climate?
temp climate comments
12 Cold Very nice place to
visit in summers
30 Hot Do not visit. This
is a trap

What can we do about climate?
We can convert it into One-Hot vector
visit in summers
is a trap
temp climate_cold climate_hot comments
12 1 0 Very nice place
to visit in
summers
30 0 1 Do not visit.
This is a trap

And what can we do about comments?
visit in summers
is a trap

One option could be to represent each word using a one-hot vector.
But consider this :
● Suppose your vocabulary contains 50,000 words
● Then the nth word would be represented as a 50,000-dimensional
vector, full of 0s except for a 1 at the nth position
● However, with such a large vocabulary, this sparse representation would
not be efficient at all

● Ideally, we want similar words to have similar representations,
making it easy for the model to generalize what it learns about a word to
all similar words
● For example,
○ If the model is told that “I drink milk” is a valid sentence, and if it
knows that “milk” is close to “water” but far from “shoes”
○ Then it will know that “I drink water” is probably a valid sentence
as well
○ While “I drink shoes” is probably not
But how can you come up with such a meaningful representation?

● The most common solution is to represent each word in the vocabulary
using a fairly small and dense vector e.g., 150 dimensions, called an
Embedding
● And just let the neural network learn a good embedding for each word
during training
Natural Language Processing - Word Embedding

With word embeding a lot of magic is possible:
king - man + woman == queen

from gensim.models import KeyedVectors
# load the google word2vec model
filename = 'GoogleNews-vectors-negative300.bin'
model = KeyedVectors.load_word2vec_format(filename, binary=True)
# calculate: (king - man) + woman = ?
result = model.most_similar(positive=['woman', 'king'],
negative=['man'], topn=1)
print(result)
Word Embedding - word2vec
● Based on the context of word, people have generated the vectors.
● One such vector is word2vec and other is Glove
[('queen', 0.7118192315101624)]

Word Embedding - Vector space models (VSMs)
Based on the Distributional Hypothesis:
○ words that appear in the same contexts share semantic meaning.
Two Approaches:
1. Count-based methods (e.g. Latent Semantic Analysis)
2. Predictive methods (e.g. neural probabilistic language models)

Word Embedding - word2vec - Approaches
1. Count-based methods (e.g. Latent Semantic Analysis)
○ Compute the statistics of how often some word co-occurs with its
neighbor words in a large text corpus
○ Map these count-statistics down to a small, dense vector for each
word

2. Predictive models
○ Directly try to predict a word from its neighbors
○ in terms of learned small, dense embedding vectors
○ (considered parameters of the model).
Word Embedding - word2vec - Approaches

Computationally-efficient predictive model
for learning word embeddings from raw text.
word2vec
Comes in two flavors:
1. Continuous Bag-of-Words model (CBOW)
2. Skip-Gram model

word2vec
○ predicts target words (e.g. 'mat') from source context words
○ e.g ('the cat sits on the'),
2. Skip-Gram model

word2vec
○ predicts target words (e.g. 'mat') from source context words
○ e.g ('the cat sits on the'),
2. Skip-Gram model
○ Predicts source context-words from the target words
○ Treats each context-target pair as a new observation
○ Tends to do better when we have larger datasets.
○ Will focus on this

Neural probabilistic language models
● are traditionally trained using the maximum likelihood (ML) principle
● to maximize the probability of the next word wt
(for "target")
● given the previous words h (for "history") in terms of a softmax function,
word2vec: Scaling up Noise-Contrastive Training

(for "target")
where score(wt
, h) computes the compatibility of word wt
with the context h(a dot product is
commonly used). We train this model by maximizing its log-likelihood i.e.

(for "target")
where score(wt
, h) computes the compatibility of word wt
with the context h(a dot product is
commonly used). We train this model by maximizing its log-likelihood i.e.
This is very expensive, because we need to compute and normalize each probability using the
score for all other V words w' in the current context , at every training step.

(for "target")
This is very expensive, because we need to compute and normalize each probability using the
score for all other V words w' in the current context , at every training step.

Instead models trained using a binary classification objective (logistic regression)
to discriminate the real target words wt
from k imaginary (noise) words w, in the same context.
1. Computing the loss function now scales only with the number of noise words that we select
and not all words in the vocabulary
2. This makes it much faster to train.
3. will use similar noise-contrastive estimation (NCE) loss - tf.nn.nce_loss().

the quick brown fox jumped over the lazy dog
Word2vec: Context Example
([the, brown], quick), ([quick, fox], brown), ([brown, jumped], fox), ...
Context: word to the left and word to the right.

the quick brown fox jumped over the lazy dog
Word2vec: Skip Gram Model
(quick, the), (quick, brown), (brown, quick), (brown, fox), ...
Task becomes to predict 'the' and 'brown' from 'quick', 'quick' and 'fox' from 'brown', etc.
Skip-gram
● inverts contexts and targets, and
● tries to predict each context word from its target word

Let's imagine at training step t
● For first case above, the goal is to predict the from quick.
● We select num_noise number
○ of noisy (contrastive) examples
○ by drawing from some noise distribution,
○ typically the unigram distribution,
● For simplicity let's say num_noise=1 and we select sheep as a noisy
example. Next we compute the loss for this pair of observed and noisy
examples

The objective at time step t becomes:

● The goal is to make an update to the embedding parameters
● to improve (in this case, maximize) the objective function
● We do this by deriving the gradient of the loss with respect to the
embedding parameters , i.e. (luckily TensorFlow provides easy helper
functions for doing this!).
● We then perform an update to the embeddings by taking a small step in
the direction of the gradient. When this process is repeated over the
entire training set, this has the effect of 'moving' the embedding vectors
around for each word until the model is successful at discriminating real
words from noise words.

● At the beginning of training, embeddings are simply chosen randomly,
● But during training, backpropagation automatically moves the
embeddings around in a way that helps the neural network perform its
task

● Typically this means that similar words will gradually cluster close to one
another, and even end up organized in a rather meaningful way.
● For example, embeddings may end up placed along various axes that
represent
○ gender,
○ singular/plural,
○ adjective/noun,
○ and so on

How to do it in TensorFlow
In TensorFlow, we first need to create the variable representing the
embeddings for every word in our vocabulary which is initialized randomly
>>> vocabulary_size = 50000
>>> embedding_size = 150
>>> embeddings = tf.Variable(
tf.random_uniform([vocabulary_size, embedding_size],
-1.0, 1.0))

How to do it in TensorFlow - Preprocessing
Suppose we want to feed the sentence “I drink milk” to your neural
network.
● We should first preprocess the sentence and break it into a list of
known words
● For example
○ We may remove unnecessary characters, replace unknown words by a
predefined token word such as “[UNK]”,
○ Replace numerical values by “[NUM]”,
○ Replace URLs by “[URL]”,
○ And so on

How to do it in TensorFlow
● Once we have a list of known words, we can look up each word’s integer
identifier from 0 to 49999 in a dictionary, for example [72, 3335, 288]
● At that point, you are ready to feed these word identifiers to TensorFlow
using a placeholder, and apply the embedding_lookup() function to get
the corresponding embeddings
>>> train_inputs = tf.placeholder(tf.int32, shape=[None]) # from ids...
>>> embed = tf.nn.embedding_lookup(embeddings, train_inputs) # ...to
embeddings

● Once our model has learned good word embeddings, it can actually be
reused fairly efficiently in any NLP application
● In fact, instead of training your own word embeddings, we may want to
download pre-trained word embeddings
● Just like when reusing pretrained layers, we can choose to
○ Freeze the pretrained embeddings
○ Or let backpropagation tweak them for your application
● The first option will speed up training, but the second may lead to slightly
higher performance

Machine Translation

Machine Translation
We now have almost all the tools we need to implement a
machine translation system
Let’s look at this now

Machine Translation
An Encoder–Decoder Network for Machine Translation
Let’s take a look at a simple machine translation model that will translate
English sentences to French

Machine Translation
A simple machine translation model

Machine Translation
Let’s learn how this Encoder–Decoder Network for Machine
Translation is trained

The English
sentences are fed to
the encoder, and
the decoder
outputs the French
translations
Machine Translation

Note that the
French translations
are also used as
inputs to the
decoder, but
pushed back by one
step
Machine Translation

Machine Translation
In other words, the
decoder is given as input
the word that it should
have output at the
previous step.
Regardless of
what it actually output at
the current step

For the very first word,
the decoder is given a
token that represents the
beginning of the sentence
(here, “<go>”)
The decoder is expected
to end the sentence with
an end-of-sequence (EOS)
token (here, “<eos>”)
Machine Translation

Question:
Why are the English
sentences reversed before
feeding it to the encoder??
Here “I drink
milk” is reversed to
“milk drink I”
Machine Translation

Answer:
This ensures that the
beginning of the English
sentence will be fed last
to the encoder, which is
useful because that’s
generally the first thing
that the decoder needs to
translate
Machine Translation

● Each word is initially
represented by a
simple integer identifier
● e.g., 288 for the word
“milk”
Machine Translation

● Next, an embedding
lookup returns the
word embedding
● This is a dense, fairly
low-dimensional vector
● These word
embeddings are what is
actually fed to the
encoder and the
decoder
Machine Translation

● At each step, the
decoder outputs a
score for each word in
the output vocabulary
i.e., French,
Machine Translation

● And then the Softmax
layer turns these
scores into
probabilities
Machine Translation

● For example, at the
first step the word “Je”
may have a probability
of 20%, “Tu” may have
a probability of 1%, and
so on
● The word with the
highest probability is
output
Machine Translation

How can we use this Encoder–Decoder Network for Machine Translation
at the inference time, since we will not have the target sentence to feed to
the decoder ??
Machine Translation

● We will simply feed the decoder the word that it output at the previous
step
● This will require an embedding lookup that is not shown on the diagram
Machine Translation

Questions?
https://discuss.cloudxlab.com
reachus@cloudxlab.com

Recurrent Neural Networks

More Related Content

What's hot

What's hot (20)

Similar to Recurrent Neural Networks

Similar to Recurrent Neural Networks (20)

More from CloudxLab

More from CloudxLab (20)

Recently uploaded

Recently uploaded (20)

Recurrent Neural Networks