Understanding Divergence in LSTM Units with Zero Initialization and Shared Weights?

Ask Question

Asked 14 days ago

Modified 14 days ago

Viewed 14 times

I'm trying to understand how LSTM units can diverge over time when they start with zero initialization and share the same weights. Here are the key points of my confusion:

Initialization: All units in an LSTM layer typically start with the same initial hidden state (h_0) and cell state (c_0), which are usually zeros.
Shared Weights: The weights in an LSTM layer are shared across all units. This means each unit applies the same transformations to the input and previous states.
Input Sequence: Each unit in the LSTM layer processes the same input sequence at each time step.

Given these points, I don't understand how the hidden states and cell states of each unit can diverge over time. Intuitively, it seems that if all units start with the same states, use the same weights, and process the same input, they should produce identical states and outputs.

Example Calculation for Time Step t=1

Consider an LSTM layer with 2 units and the following setup:

Initial states for both units: h_0 = 0 and c_0 = 0.
Shared weights for input, forget, cell, and output gates.
Input sequence: x_1 = [0.2, 0.3, 0.8] at time step t=1.

The weights and biases are as follows for simplicity:

W_i = W_f = W_c = W_o = [0.1, 0.2, 0.3]
U_i = U_f = U_c = U_o = 0.5
b_i = b_f = b_c = b_o = 0.1

For t=1, both units will process the input x_1 with the same initial states and shared weights:

Unit 1 and Unit 2

Input Gate: i_1 = σ(0.1*0.2 + 0.2*0.3 + 0.3*0.8 + 0.5*0 + 0.1) = σ(0.42) ≈ 0.603
Forget Gate: f_1 = σ(0.1*0.2 + 0.2*0.3 + 0.3*0.8 + 0.5*0 + 0.1) = σ(0.42) ≈ 0.603
Cell Gate: 𝑐̃_1 = tanh(0.1*0.2 + 0.2*0.3 + 0.3*0.8 + 0.5*0 + 0.1) = tanh(0.42) ≈ 0.397
Output Gate: o_1 = σ(0.1*0.2 + 0.2*0.3 + 0.3*0.8 + 0.5*0 + 0.1) = σ(0.42) ≈ 0.603
Cell State: c_1 = f_1 * c_0 + i_1 * 𝑐̃_1 = 0.603 * 0 + 0.603 * 0.397 ≈ 0.239
Hidden State: h_1 = o_1 * tanh(c_1) = 0.603 * tanh(0.239) ≈ 0.603 * 0.235 ≈ 0.142

After the first time step, both units have the same hidden state and cell state because they process the same input with the same initial states and weights.

I would appreciate any insights, examples, or references that can help clarify how LSTM units achieve diversity in their states and outputs despite the same initial conditions and shared weights.

Specific Questions:

How can LSTM units initialized with zeros and using shared weights develop different states over time?
Is there any mechanism within the LSTM structure that introduces diversity among units?
Are there practical examples or intuitive explanations that demonstrate this divergence?

edited Jul 7 at 0:28

asked Jul 7 at 0:20

user164819

133 bronze badges

Add a comment |

Stack Exchange Network

Understanding Divergence in LSTM Units with Zero Initialization and Shared Weights?

Example Calculation for Time Step t=1

Unit 1 and Unit 2

Specific Questions:

0

Browse other questions tagged
deep-learning
keras
tensorflow
lstm
or ask your own question.

Hot Network Questions

Understanding Divergence in LSTM Units with Zero Initialization and Shared Weights?

Example Calculation for Time Step t=1

Unit 1 and Unit 2

Specific Questions:

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Browse other questions tagged deep-learningkerastensorflowlstm or ask your own question.

Related

Hot Network Questions

Browse other questions tagged
deep-learning
keras
tensorflow
lstm
or ask your own question.