3
$\begingroup$

TL;DR: I believe MDL using NML is a special case of the joint MAP of model and parameters, and need to verify this and find sources that have acknowledges this.

This is how I understand Minimum Description Length (MDL): We have a set of models $M_i\in M$ with prior $P(M_i)$, each parameterized by $\Theta_i$, i.e. the pair $(M_i,\theta\in\Theta_i)$ fully specifies the likelihood function $P(z|M_i,\theta)$ where $z$ is the data. For every $M_i$, we define a function $f_i(z)$ and then choose $\arg\max_{M_i\in M}P(M_i)f_i(z)$. If we define $f_i(z):=P(z|M_i)=\int_{\theta\in\Theta_i}P(\theta|M_i)P(z|M_i,\theta)d\theta$, this is equivalent of MAP.

However, we can define $f_i$ in other ways. One way is Normalized Maximum Likelihood (NML):

$$f_i(z):=g^{NML}_i(z):=\frac{\max_{\theta\in\Theta_i}P(\theta|M_i)P(z|M_i,\theta)}{\int\max_{\theta\in\Theta_i}P(\theta|M_i)P(z'|M_i,\theta)dz'}$$ (Generally, $P(\theta|M_i)$ is actually replaced by a more general "luckiness function". But in this question let's stick to this formula)

The denominator here (let's call it $I_i$) is supposed to normalize the values to prevent overfitting. It is supposed to be larger for more complex models, essentially being a measure of model complexity (although I am not sure if that would be how we define complexity, or just the implication of another independent definition of complexity of models).

One way people have looked at complexity of a model is with its prior $P(M_i)$. In this view, a smaller prior means a more complex model, because an optimal code would require a longer message to encode the model with smaller prior. Now, take whatever prior we already had on the models, i.e. $P(M_i)$ and define $P_{new}(M_i)=\frac{P(M_i)}{I_i}$ as the new prior over the models. Also let's define $f_i(z):=\max_{\theta\in\Theta_i}P(\theta|M_i)P(z|M_i,\theta)$. As in MDL, we proceed to find:

\begin{align} \arg\max_{M_i\in M}P_{new}(M_i)f_i(z)&=\arg\max_{M_i\in M}\frac{P(M_i)}{I_i}\max_{\theta\in\Theta_i}P(\theta|M_i)P(z|M_i,\theta)\\&=\arg\max_{M_i\in M}P(M_i)\frac{\max_{\theta\in\Theta_i}P(\theta|M_i)P(z|M_i,\theta)}{I_i}\\&=\arg\max_{M_i\in M}P(M_i)g_i^{NML}(z) \end{align}

which is equivalent to the NML solution with priors $P(M_i)$. However, we can also think of it as finding: \begin{align} \arg\max_{M_i\in M}P_{new}(M_i)f_i(z)&=\arg\max_{M_i\in M}P_{new}(M_i)\max_{\theta\in\Theta_i}P(\theta|M_i)P(z|M_i,\theta)\\&=\arg\max_{M_i\in M}\max_{\theta\in\Theta_i}P_{new}(M_i)P(\theta|M_i)P(z|M_i,\theta)\\&=\arg\max_{M_i\in M}\max_{\theta\in\Theta_i}P_{new}(M_i, \theta)P(z|M_i,\theta)\\&=\arg\max_{M_i\in M}\max_{\theta\in\Theta_i}P_{new}(M_i, \theta|z)\\&=\arg\max_{(M_i\in M,\theta\in\Theta_i)}P_{new}(M_i, \theta|z) \end{align} which is just the MAP solution on the pairs $(M_i,\theta\in\Theta_i)$ instead of just on $M_i$s, with the new priors $P_{new}(M_i)$

However, this is actually a more general solution than NML because now we are not required to define complexity of $M_i$ by referring to $I_i$ at all. So as long as $P_{new}(M_i)$ captures the complexity correctly, we can choose the MAP solution on the pair of model and parameters as the solution that describes the data with the shortest message.

My question is, is there a source that has acknowledged this, used it in some way for model selection, and asserted that it is a valid MDL solution (i.e. in some sense minimizes the description length of the data)? In my work I like to describe my procedure in this way but if there's no literature on it, it would be more difficult to convince people that it makes sense.

$\endgroup$
1

0