4
$\begingroup$

Having had a short encounter with deep neural networks, it seems to boil down to the task of determining the values of a vast amount of parameters.

The expectation maximization algorithm, of which I only learned recently, deals with similar problem, namely determining the values of a vast amount of latent variables that maximize expected log-likelyhood.

Question:

what, if any, are the essential differences between the deep neural networks and expectation maximization, w.r.t.

  • the problems they solve
  • the methods used for optimizing the values of a vast amount of parameters, resp. latent variables?
$\endgroup$
1

1 Answer 1

8
$\begingroup$

A high level view for neural networks is that they are just heuristics for approximating multi-variate functions. You pick a class of parametrizable functions and you optimize their parameters. However picks the best heuristic wins. Apart from the fact that those heuristics work surprisingly well some are also accompanied by theorems that guarantee that they are universal function approximators. Any function can be approximated with these heuristics: from PDE solutions to probability distributions on discrete sets etc.

The EM algorithm on the other hand tries to approximate the (parametric) probability distribution of a very specific class of problems: hidden variable models. Suppose I have multivariate random variables X,Y. Suppose $Y$ is observable, $X$ is unobservable, but I know the form of $p(X|Y, \theta)$ and this relationship constitutes a sufficiently constraining relationship. Then the EM algorithm estimates $X$ and $\theta$ from the values of $Y$.

The optimization procedures for the two methods differ a lot also. Neural networks need large datasets to train (because of their size), and therefore the only method that can work reasonably fast is small-batch stochastic gradient descent. Essentially break the dataset into pieces (batches) and at each step get a gradient direction computed just from the data in the batch.

The EM algorithm is a specific case of the MM algorithm, and the optimization step works as follows. My target function (it will be a log-likehood function) to be maximized is in general non-concave. However at any given every point that my optimization visits: I approximate the target with a concave function that touches on that point but is everywhere else smaller. I can maximize at each point this lower bound function with any standard convex optimization algorithm and always the target function is guaranteed to increment more.

Finally, both methods can converge to local minima/maxima, but the MM algorithm is extremely stable, whereas for many neural network heuristics stability is far from granted.

The EM algorithm vs neural networks comparison is best seen through neural networks taking over of natural language processing, where in the early days hidden markov models dominated. If you want to purse this I recommend reading Bengio's 2003 paper "A Neural Probabilistic Language Model" and lookup some of the early methods mentioned in there in Manning's "Foundations of statistical natural language processing".

Bengio's paper is important for the following reason. It's an easy to understand example of what a good heuristic looks like: it consists both of a large expressive function part and a "dimensionality reduction" part, and it can dominate over traditional probabilistic methods.

Finally for examples for the MM algorithm any Kenneth Lange paper you pick will usually have a bunch of examples.

$\endgroup$
1
  • 2
    $\begingroup$ a very clear and detailed explanation! $\endgroup$ Commented Apr 14 at 4:43

Not the answer you're looking for? Browse other questions tagged or ask your own question.