2
$\begingroup$

ELBO is a lower bound, and only matches the true likelihood when the q-distribution/encoder we choose equals to the true posterior distribution. Are there any guarantees that maximizing ELBO indeed helps us find better parameters, when ELBO does not match the true likelihood?

Ιn EM, I believe there’s proof that the likelihood function monotonically increases (so that the parameters we obtain are “better and better”) and eventually converges, but are we sure there wouldn’t be a huge gap between ELBO and likelihood?

In VAE, the q-distribution is determined by the encoder network, and we optimize ELBO over both encoder and decoder parameters. I guess encoder has a rich enough families (in the sense that with optimal encoder parameters, ELBO approximates the true likelihood fairly well) due to the neural nets. But as we take gradient steps, although ELBO is increasing, do we know that the likelihood is also increasing?

In diffusion models, the latent variable model we use for calculating ELBO is the reverse decoder, and the q-distribution we choose is the predefined forward encoder/diffusion kernel. However, this q-distribution is not exactly the posterior corresponding to the model given by the decoder. Both questions from EM and VAE arise: can we really ignore the gap between ELBO and likelihood, and are we sure that likelihood is increased as we maximize ELBO?

$\endgroup$
0

2 Answers 2

0
$\begingroup$

I agree with your opinion. It is not clear if maximizing the ELBO also maximizes the true log-likelihood. This is only certain when the variational distribution exactly matches the true posterior distribution. In the case of VAEs, since the variational distribution is assumed to be from the normal distribution family, an exact match is rare.

However, there is no efficient alternative to the ELBO. Empirically, maximizing the ELBO seems to maximize the true loglikelihood somewhat. This can be seen in the successful results of VAEs and diffusion models.

$\endgroup$
0
$\begingroup$

In VAEs the ELBO is defined as ${L_{\theta,\phi}(x)=\mathbb {E} _{z\sim q_{\phi}(z|x)}\left[\ln p_{\theta}(x|z)\right]-D_{KL}(q_{\phi}({z|x})\parallel p_{\theta }(z))}$, and this lower bound relates to the true log-likelihood as $\text{log}p_{\theta}(x) \ge L_{\theta,\phi}(x)$. It can be proved thay the equality holds if and only if the variational posterior $q_{\phi}(z|x)$ matches the true posterior $p_{\theta}(z|x)$. By maximizing the ELBO we are effectively finding parameters $(\theta,\phi)$ that make the variational posterior close to the true posterior, however, as you claimed there's no guarantee maximizing ELBO will always increase the true likelihood due to neural network's non-convex optimization landscape and the variational inference nature.

And as you've also correctly pointed out in generative models such as VAEs, the encoder network for $q_\mathbf{\phi}$ belongs to the rich family of modern deep neural networks which are extremely powerful and flexible universal function approximator. Therefore even there is no guarantee that increasing ELBO will monotonically increase the true likelihood, the variational family $q_\mathbf{\phi}$ is expressive enough to often match the true posterior very well. And of course in practice improvements in the likelihood via ELBO are usually validated empirically by observing better performance on held-out data or through better quality of generated samples, there's no theoretical convergence guarantee here. The case for diffusion models' ELBO optimization is similar.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.