0
$\begingroup$

I'm trying to understand how LSTM units can diverge over time when they start with zero initialization and share the same weights. Here are the key points of my confusion:

  • Initialization: All units in an LSTM layer typically start with the same initial hidden state (h_0) and cell state (c_0), which are usually zeros.
  • Shared Weights: The weights in an LSTM layer are shared across all units. This means each unit applies the same transformations to the input and previous states.
  • Input Sequence: Each unit in the LSTM layer processes the same input sequence at each time step.

Given these points, I don't understand how the hidden states and cell states of each unit can diverge over time. Intuitively, it seems that if all units start with the same states, use the same weights, and process the same input, they should produce identical states and outputs.

Example Calculation for Time Step t=1

Consider an LSTM layer with 2 units and the following setup:

  • Initial states for both units: h_0 = 0 and c_0 = 0.
  • Shared weights for input, forget, cell, and output gates.
  • Input sequence: x_1 = [0.2, 0.3, 0.8] at time step t=1.

The weights and biases are as follows for simplicity:

  • W_i = W_f = W_c = W_o = [0.1, 0.2, 0.3]
  • U_i = U_f = U_c = U_o = 0.5
  • b_i = b_f = b_c = b_o = 0.1

For t=1, both units will process the input x_1 with the same initial states and shared weights:

Unit 1 and Unit 2

  • Input Gate: i_1 = σ(0.1*0.2 + 0.2*0.3 + 0.3*0.8 + 0.5*0 + 0.1) = σ(0.42) ≈ 0.603
  • Forget Gate: f_1 = σ(0.1*0.2 + 0.2*0.3 + 0.3*0.8 + 0.5*0 + 0.1) = σ(0.42) ≈ 0.603
  • Cell Gate: 𝑐̃_1 = tanh(0.1*0.2 + 0.2*0.3 + 0.3*0.8 + 0.5*0 + 0.1) = tanh(0.42) ≈ 0.397
  • Output Gate: o_1 = σ(0.1*0.2 + 0.2*0.3 + 0.3*0.8 + 0.5*0 + 0.1) = σ(0.42) ≈ 0.603
  • Cell State: c_1 = f_1 * c_0 + i_1 * 𝑐̃_1 = 0.603 * 0 + 0.603 * 0.397 ≈ 0.239
  • Hidden State: h_1 = o_1 * tanh(c_1) = 0.603 * tanh(0.239) ≈ 0.603 * 0.235 ≈ 0.142

After the first time step, both units have the same hidden state and cell state because they process the same input with the same initial states and weights.

I would appreciate any insights, examples, or references that can help clarify how LSTM units achieve diversity in their states and outputs despite the same initial conditions and shared weights.

Specific Questions:

  1. How can LSTM units initialized with zeros and using shared weights develop different states over time?
  2. Is there any mechanism within the LSTM structure that introduces diversity among units?
  3. Are there practical examples or intuitive explanations that demonstrate this divergence?
$\endgroup$

0

Browse other questions tagged or ask your own question.