In's commonly said that in VAE, we use reparameterization trick because "we can't backpropagate through stochastic node"
It makes sense from the picture, but I found it hard to understand exactly what it means and why. Let's say X ~ N(u, 1).
And we want to compute $$\frac{d X}{d u}$$ which is not possible because the sampling operation is non-differentiable. That is, we don't know how changing u a little bit will affect how we got the sample X.
However, say in the MLE for Gaussian. We are trying to estimate the following quantity:
$$\sum_{i=1}^N \log p(X_i;u)$$ for which the derivative $$\frac{d \log p(X_i ; u)}{d u}$$ can be easily calculated. My confusion comes from the fact that $$\frac{d \log p(X_i ; u)}{d u} = \frac{d \log p(X_i ; u)}{d X_i} \frac{d X_i}{d u}$$ by the chain rule. If we can't compute $\frac{d X_i}{d u}$, why can we compute $\frac{d \log p(X_i ; u)}{d u}$ ?