Optimizing LSTM Autoencoder Latent Dimension using Mutual Information and BO

shashank Jain
GoPenAI
Published in
5 min readJul 10, 2024

Introduction:

In the area of time series analysis and deep learning, determining the optimal architecture for neural networks often involves a degree of guesswork. One crucial parameter in autoencoder architectures is the size of the latent dimension — the compressed representation of our data. But how can we systematically determine the best latent dimension size?

In this blog post, we’ll explore an innovative approach to this problem by combining three powerful concepts: LSTM Autoencoders, Mutual Information (MI), and Bayesian Optimization (BO). Our goal is to find the optimal latent dimension for an LSTM Autoencoder that maximizes the mutual information between the input data and its latent representation.
Here is the colab link for someone who wants to experiment with this https://colab.research.google.com/drive/1GhY3p4xoaH2pTIOndofLIXChLLzda0Ef?usp=sharing

Here’s our strategy:

1. We’ll use an LSTM Autoencoder to encode our time series data into a latent space and then reconstruct it.

2. We’ll calculate the Mutual Information between the input data and its latent representation. This serves as our objective function — we want to maximize this value.

3. We’ll employ Bayesian Optimization to efficiently search for the latent dimension size that maximizes our MI-based objective function.

By the end of this post, you’ll understand how to implement this approach, which combines deep learning, information theory, and optimization techniques to automatically find the best latent dimension for your LSTM Autoencoder.

Let’s dive into the mathematical concepts, implement the algorithm, and apply it to a real-world time series dataset — the classic airline passengers dataset. This approach not only eliminates guesswork but also ensures that our latent representation captures the most informative aspects of our data.

Prerequisites:

Before we dive in, let’s quickly review the key concepts you should be familiar with:

1. LSTM (Long Short-Term Memory) networks

2. Autoencoders

3. Basic probability theory and statistics

4. Python programming and PyTorch library

If you’re comfortable with these concepts, you’re ready to explore the more advanced topics we’ll cover in this post.

Key Concepts:

1. Radial Basis Function (RBF):

The RBF is a real-valued function whose value depends only on the distance from the origin or some other center point. In our context, we use the Gaussian RBF, which is defined as:

K(x, y) = exp(-||x — y||² / (2 * sigma²))

Where x and y are input vectors, ||x — y|| is the Euclidean distance between x and y, and sigma is a parameter that controls the width of the Gaussian.

2. Gram Matrix:

The Gram matrix, also known as the kernel matrix, is a symmetric matrix of inner products. For our RBF kernel, the Gram matrix K is computed as:

K[i,j] = exp(-||xi — xj||² / (2 * sigma²))

Where xi and xj are input vectors.

3. Entropy (Shannon vs. Rényi):

Entropy is a measure of uncertainty or information content in a probability distribution. The two main types we consider are:

Shannon Entropy: H = -sum(pi * log(pi))

Rényi Entropy: Halpha = 1 / (1 — alpha) * log(sum(pi^alpha))

We choose Rényi entropy over Shannon entropy for two main reasons:

a) It generalizes Shannon entropy (as alpha approaches 1, Rényi entropy converges to Shannon entropy).

b) It allows for a more flexible characterization of the probability distribution, which can be beneficial in certain machine learning tasks.

4. Mutual Information:

Mutual Information (MI) quantifies the amount of information obtained about one random variable by observing another random variable. It’s defined as:

MI(X; Y) = H(X) + H(Y) — H(X, Y)

Where H(X) and H(Y) are the marginal entropies, and H(X, Y) is the joint entropy.

Implementation Flow:

1. Input Vectors: We start with our input time series data, transformed into sliding window samples.

2. RBF Calculation: We compute the RBF kernel between all pairs of input vectors.

3. Gram Matrix: The RBF calculations form our Gram matrix.

4. Eigenvalue Decomposition: We perform eigenvalue decomposition on the Gram matrix. Intuitively, the eigenvalues represent the amount of variation in the data along different directions in the feature space. Larger eigenvalues correspond to more important features or patterns in the data.

5. Entropy Calculation: We use the eigenvalues to compute the Rényi entropy. The distribution of eigenvalues gives us insight into the complexity or information content of our data.

6. Mutual Information: We calculate the MI between the input data and the latent representation (Z) produced by our LSTM Autoencoder. To do this, we need to compute:

a) H(X): Entropy of the input data

b) H(Z): Entropy of the latent representation

c) H(X, Z): Joint entropy of X and Z

The joint entropy H(X, Z) is computed by concatenating X and Z and treating them as a single vector before applying our entropy calculation method.

Code Implementation:

Let’s look at the key parts of our implementation:

1. LSTM Autoencoder:

class LSTMAutoencoder(nn.Module):

def init(self, inputdim, hiddendim, latentdim, numlayers=1):

super(LSTMAutoencoder, self).init()

self.encoder = nn.LSTM(inputdim, hiddendim, numlayers, batchfirst=True)

self.latent = nn.Linear(hiddendim, latentdim)

self.decoder = nn.LSTM(latentdim, hiddendim, numlayers, batchfirst=True)

self.output = nn.Linear(hiddendim, inputdim)

def forward(self, x):

, (h, ) = self.encoder(x)

h = h[-1]

z = self.latent(h)

z = z.unsqueeze(1).repeat(1, x.size(1), 1)

out, _ = self.decoder(z)

out = self.output(out)

return out, z

This LSTM Autoencoder takes in time series data, encodes it into a latent representation, and then decodes it back to the original space.

2. Gram Matrix and Entropy Calculation:

def computegrammatrix(X, sigma=1.0):

X_flat = X.view(X.size(0), -1)

pairwisedists = torch.cdist(Xflat, X_flat)

K = torch.exp(-pairwise_dists * 2 / (2 sigma ** 2))

return K

def computerenyientropy(K, alpha=2):

eigvals = torch.linalg.eigvalsh(K)

return 1 / (1 — alpha) torch.log(torch.sum(eigvals * alpha))

These functions compute the Gram matrix using the RBF kernel and then calculate the Rényi entropy using the eigenvalues of the Gram matrix.

3. Mutual Information Calculation:

def computemutualinformation(X, Z, alpha=2, sigma=1.0):

KX = computegram_matrix(X, sigma)

KZ = computegram_matrix(Z, sigma)

AX = normalizegrammatrix(KX)

AZ = normalizegrammatrix(KZ)

HX = computerenyientropy(AX, alpha)

HZ = computerenyientropy(AZ, alpha)

AXZ = AX * A_Z # Hadamard product

AXZnormalized = AXZ / torch.trace(AXZ)

HXZ = computerenyientropy(AXZ_normalized, alpha)

return HX + HZ — H_XZ

This function computes the mutual information between X and Z using their individual entropies and their joint entropy.

Dataset:

For this implementation, we’re using the classic airline passengers dataset. This dataset contains monthly totals of international airline passengers from 1949 to 1960. We create sliding window samples of length 10 from this time series data:

url = “https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv"

df = pd.read_csv(url, usecols=[1])

data = df.values.astype(‘float32’)

scaler = MinMaxScaler(feature_range=(0, 1))

datanormalized = scaler.fittransform(data)

def createsequences(data, seqlength):

sequences = []

for i in range(len(data) — seq_length + 1):

sequence = data[i:i+seq_length]

sequences.append(sequence)

return np.array(sequences)

seq_length = 10

sequences = createsequences(datanormalized, seq_length)

This preprocessing step allows our LSTM Autoencoder to learn patterns in the passenger numbers over 10-month periods.

In my run the output is
Optimal latent dimension: 42 Final model loss: 0.008036612439900637
Conclusion:

By optimizing our LSTM Autoencoder using Mutual Information, we’re encouraging the model to learn a latent representation that captures the most informative aspects of our time series data. This approach combines the power of deep learning with the insights of information theory, potentially leading to more robust and interpretable models for time series analysis. This also allows us to find the optimal size of the latent dimension.

If you like my work , you can support it here https://www.buymeacoffee.com/smjain

Colab link : https://colab.research.google.com/drive/1GhY3p4xoaH2pTIOndofLIXChLLzda0Ef?usp=sharing

--

--