ELBO is a lower bound, and only matches the true likelihood when the q-distribution/encoder we choose equals to the true posterior distribution. Are there any guarantees that maximizing ELBO indeed helps us find better parameters, when ELBO does not match the true likelihood?
Ιn EM, I believe there’s proof that the likelihood function monotonically increases (so that the parameters we obtain are “better and better”) and eventually converges, but are we sure there wouldn’t be a huge gap between ELBO and likelihood?
In VAE, the q-distribution is determined by the encoder network, and we optimize ELBO over both encoder and decoder parameters. I guess encoder has a rich enough families (in the sense that with optimal encoder parameters, ELBO approximates the true likelihood fairly well) due to the neural nets. But as we take gradient steps, although ELBO is increasing, do we know that the likelihood is also increasing?
In diffusion models, the latent variable model we use for calculating ELBO is the reverse decoder, and the q-distribution we choose is the predefined forward encoder/diffusion kernel. However, this q-distribution is not exactly the posterior corresponding to the model given by the decoder. Both questions from EM and VAE arise: can we really ignore the gap between ELBO and likelihood, and are we sure that likelihood is increased as we maximize ELBO?