0
$\begingroup$

I'm confused about why the principle of maximum entropy is needed: it seems like we could simply Bayesian inference in any case where we have some prior information, and it seems like both fail if there is no prior information. Since Jaynes advocates so strongly for both I feel like I must be misunderstanding.

As an example, consider the German tank problem. From a maximum entropy point of view, if we assume some maximum possible number of tanks $\Omega$, and the largest serial number we have seen is $m$, then the distribution which maximizes entropy while staying consistent with our information about the problem, is the uniform distribution on $[m,\Omega]$. This is certainly not the Bayesian answer.

Maybe we could use maximum entropy to derive a prior distribution, then update based on Bayesian inference. But that seems like a weird distinction: when we evaluate a situation, what information counts towards deriving the maximum entropy prior, and what information counts towards the Bayesian inference? It doesn't seem like there is a good way to distinguish which one is correct.

$\endgroup$

1 Answer 1

1
$\begingroup$

Before I answer, two caveats: Firstly, I am by no means an expert on the subject. I have an M.S. in Applied Economics and Finance and have not taken nearly as many courses on probability theory and Entropy as someone with a Ph.D in the discipline. I am an enthusiast and an up and coming practitioner of the subject. Secondly, I will only be addressing the first questions. Not the German tank problem.

Now to begin.

What is important to understand is that the Maximum Entropy Principle and Bayes's Theorem answers two different questions.

The MEP tells us how to assign probabilities to statements or uncertain quantities when we inject a set of information, that we will call $M_k$, into the MEP algorithm. Bayes's Theorem tells us how do we update probabilities once we condition on data.

Here is the MEP algorithm for reference.

$$ P(X=x_i| M_k) = Q_i= \frac{\exp(\sum_{k=1}^{m}\lambda_kf_k(x_i))}{\sum_{i=1}^{n}exp(\sum_{k=1}^{m}\lambda_kf_k(x_i))} $$

The above algorithm is David Blower's version from his book on Information Processing Volume II. The biggest difference between his notation and Jaynes's notation is that Blower's Lagrangian multipliers are the negative of Jaynes's.

I recommend all 3 of Blower's books. They provide an excellent explanation on Information Processing, the MEP, and the link to Information Geometry. Here is a link to the second book

Now there are many legitimate ways to assign probabilities. In my experience, I see practitioners selecting a well known probability distribution (Normal, Gamma, Cauchy, etc) and seeing if the properties of the data are well aligned with the assumptions of the P.D. and work from there.

What is desirable about the MEP is that you can inject information into the algorithm (Whether this information is something that you know, such as a marginal probability or if outcomes can only be integers or if there are an infinite number of possible outcomes etc, or would like to test) and when the algorithm processes the information it will simultaneously maximize the Shannon entropy (the missing information). You are guaranteed that the information inserted into the MEP is the only information that is used to assign probabilities. Whereas if you use a more generic P.D. function to assign probabilities, while legitimate, can include information that was not necessarily intended by the practitioner.

Now under a "fair" model $M_1$ that has no information to inject into the MEP, we can operationalize this lack of information by setting all Lagrange multipliers to $0$. Then process the numbers through the above MEP algorithm and the probability assigned to each quantity we are uncertain of is $P(X=x_i| M_1)= Q_i=1/n$ under the "fair" model $M_1$.These are the probabilities the MEP assigns conditioned on a "fair" model. Now that last statement is vital. The MEP assigns probabilities once we condition on a model that injects information. The MEP is NOT a method of assigning prior probabilities. A prior is an unconditional probability of a model $P(M_k)$. The prior represents our uncertainty about the models before any data or information is used. This is not where the MEP is to be used.

Understand, the MEP is not "right" or "wrong" it simply assigns probabilities that represent a state of knowledge given by the information we inject into the MEP. If you are surprised by an outcome consider what information you are excluding that you did not inject into the MEP.

So how do we assign priors and how do we update that uncertainty when we condition on data?

Well Bayes's Theorem tells us exactly how to update probabilities when we condition on data (Our data $D = \{X_1 = x_r, X_2 = x_y, ... X_N = x_l\}$).

$$ P(M_k|D) = \frac{P(D|M_k) P(M_k)}{\sum_{k=1}^m P(D|M_k) P(M_k)} $$

We can even assign probabilities to future data we are uncertain of:

$$ P(X_{N+1} = x_i |D) = \sum_{k=1}^m P(X_{N+1} = x_i |M_k) P(M_k|D) $$

But how do we assign a prior you may ask? Well this particular answer may stir some controversy.

In order to assign a prior probability to a model (a set of information), you must be willing to consider all possible models that inject information. Since you already injected all information you have into the MEP and unless you have information about the information you injected (meta-information? If you have more information place it in the MEP.) you must treat all possible models with equal prior probability. That is the point of a prior probability. It is the probability of a model before any data or information is included. You are completely ignorant at this state.

If you have additional information that you are including in your prior the MEP cannot assign probabilities that are represented by the information you have. And you will get probabilities that are inconsistent with your knowledge of the problem.

Laplace's rule of succession tells us how to assign probabilities to causes, in our case "causes" = "models" when we know nothing about the causes. You have "insufficient reason" to favor one model over another. We then let the data decide which models are best supported and reorient the probability of the model accordingly.

So at a high level:

MEP is for $P(X = x_i |M_k)$, Bayes's Theorem is for $P(M_k|D)$.

My sources are David Blower's first 2 volumes of his 3 part series on Information processing.


I am adding my response to your comment about information and data here because I can't fit it in a comment below. Apologies in advance if this is not expected on these sorts of forums I am new to them.

Yes I will elaborate on the difference. So just for starters we need to define something called the “state space”. This is the space of what could happen and the space that we would like to assign probabilities to.

Just so we have a toy example lets consider Jaynes’s favorite Foster and Corona drinking left and right handed kangaroos. Imagine a 2x2 grid where the rows descending represent Fosters and Corona while the columns from left to right represent left and then right handed kangaroos. The dimension of the state space is 4 and we would like to assign a probability for each cell we label $Q_1$ to $Q_4$. Just to make notation translations very clear, we want to understand say $Q_1=P(X=x_1|M_k)$ where $x_1$ is a left handed Foster drinking kangaroo. We want to know the probability of that statement conditioned on some model. Data in this sort of scenario would be something like past frequency counts of each of those cells.

Now we can inject information in two ways. Either through the Lagrangian multipliers (which I find incredibly un-intuitive to do so) or through the dual parameter the constraint function average. We define the constraint function average as the dot product of the constraint function and the probabilities: $\langle f_i \rangle = \vec f_i \cdot \vec Q $. Now to inject information you can arbitrarily define the constraint function (really this function exists so you can operationalize information injection) and set the dot product equal to say some made up probability that you say is the marginal probability of being left handed. So you would have a constraint function that looks like $f_1(x)=(1,0,1,0)$ and the constraint function average looks like $\langle f_1 \rangle = Q_1 + Q_3$

You can have more complex constraint functions or more simple ones depending on what sort of information you want to inject. And you can inject any sort of information you would like (as long as you don't inject contradicting information. The MEP cannot process a contradiction nor would we want it to) and the MEP will give you the corresponding probabilities.

Say you are trying to conduct inference over a continuous interval and would like to say something about the expected value of $ln(x!)$ you can. And then you can see how well the data supports this model using Bayes’s theorem.

So the MEP does not require data in order to set a constraint function average. You can set the constraint function average arbitrarily and see how well the data supports the model.

But the natural question we may ask is “Well why not just set the constraint function average to the data average?” And once again you can (but are not required to for the MEP is still consistent). As a matter of fact the probabilities that the MEP produces in this case are the probabilities that the Maximum Likelihood Estimation method will produce as well. They are connected! Not only that, but that particular model will be the model with the highest posterior probability!

So why not always set the constraint function averages to the data average? Well just the other day I was comparing that sort of model, lets call it the $M_{ML}$ to a model that did not exhibit a constraint involving a correlation I was examining. The data did support a model with the correlation best. Well even though the data best supported the $M_{ML}$ model, it was hardly better than my non-correlation model. So this tells me that this correlation the data says there is evidence for could have easily just been noise I was picking up in this data sample I have.

So data and information are two different things but we can use data to define injected information if we choose to. But doing so does not give us some sort of guarantee. Jaynes and Blower go into a much deeper conversation about the difference being a difference between epistemology and ontology but read their works because I can’t possibly do it justice.

$\endgroup$
3
  • $\begingroup$ Thanks for the very thorough answer, but I'm still wondering what separates a "set of information" $M_k$ from "data" $D$. It looks like the information used for the MEP is a series of functions of the variable $X$, so why wouldn't that be considered data? $\endgroup$
    – Sam Jaques
    Commented Jul 13, 2020 at 8:28
  • $\begingroup$ I added a response to your question in the original post above. $\endgroup$
    – Alfonso
    Commented Jul 14, 2020 at 5:24
  • $\begingroup$ 1) Do you remember in which chapter/paper Jaynes and/or Blower talk about this distinction? 2) I'm still confused about why a constraint function average doesn't count as "data", given that presumably the only reason you know the average is from some other data. 3) MLE doesn't precisely follow the rules of Bayesian inference, right? And contradicts them in certain cases? So I'm still left wondering why Jaynes - a stickler for deriving everything from Bayesian estimation - would suddenly resort to MEP. $\endgroup$
    – Sam Jaques
    Commented Jul 14, 2020 at 10:54

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .