2
$\begingroup$

In's commonly said that in VAE, we use reparameterization trick because "we can't backpropagate through stochastic node" enter image description here

It makes sense from the picture, but I found it hard to understand exactly what it means and why. Let's say X ~ N(u, 1).

And we want to compute $$\frac{d X}{d u}$$ which is not possible because the sampling operation is non-differentiable. That is, we don't know how changing u a little bit will affect how we got the sample X.

However, say in the MLE for Gaussian. We are trying to estimate the following quantity:

$$\sum_{i=1}^N \log p(X_i;u)$$ for which the derivative $$\frac{d \log p(X_i ; u)}{d u}$$ can be easily calculated. My confusion comes from the fact that $$\frac{d \log p(X_i ; u)}{d u} = \frac{d \log p(X_i ; u)}{d X_i} \frac{d X_i}{d u}$$ by the chain rule. If we can't compute $\frac{d X_i}{d u}$, why can we compute $\frac{d \log p(X_i ; u)}{d u}$ ?

$\endgroup$
1
  • 1
    $\begingroup$ Do you really think that Variational AutoEncoders (VAE) are commonly known ? Surely not by $99.9 \%$ of people here. You should take some time explaining the context of your study $\endgroup$
    – Jean Marie
    Commented Jan 30, 2022 at 8:09

2 Answers 2

3
$\begingroup$

I think you are confused about back propagation. You never need to take the gradient of the input with respect to anything (because there is no layer BEFORE the input), nor do you need to make assumptions about the distribution of the input.

The 'reparametrization trick' makes some assumption about the parametric form of the distribution of the latent vector $z$, and represents sampling from that latent space as the output of some function of the parameter values and a noise vector. That allows you to backprop through the latent vector $z$ by taking the gradient of it with respect to the parameter values.

For example, if $z$ is assumed to be multivariate Gaussian, then $z_i = \mu_i + \sigma_i \epsilon_i$, where $\epsilon_i \sim N(0,1)$, and $$\frac{\partial z_i}{\partial \mu_i} = 1$$ $$\frac{\partial z_i}{\partial \sigma_i} = \epsilon_i$$ The vectors $\mu$ and $\sigma$ are learned, i.e. they are connected to the previous layer in the network, and you can backprop through them in the usual way. The random noise $\epsilon$ is drawn from a fixed distribution, not learned, so you do not backprop through those nodes (which is why it is orange in your diagram).

$\endgroup$
0
$\begingroup$

I think your maximum likelihood equation is not correct. In particular, if I understand your setting correctly $X_i$ is your data, which does not depend on $u$. Their log-likelihood does depend on $u$ since for a constant variance it's essentially the MSE $\sum_i (X_i-u)^2$ multiplied and added with terms that don't depend on $u$. Then taking $\sum_i \frac{d\log p(X_i;u)}{du} = O(1)\sum_i (X_i-u) + O(1)$, where $O(1)$ are terms that don't depend on $X_i$ nor $u$.

Notice the final expression depends on $X_i$ but not because we differentiated through it. This is because $dX_i/du$ because is zero, since it's the data and does not depend on $u$.

$\endgroup$

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .