21
$\begingroup$

I've been trying to understand the motivation for the use of the Jeffreys prior in Bayesian statistics. Most texts I've read online make some comment to the effect that the Jeffreys prior is "invariant with respect to transformations of the parameters", and then go on to state its definition in terms of the Fisher information matrix without further motivation. However, none of them then go on to show that such a prior is indeed invariant, or even to properly define what was meant by "invariant" in the first place.

I like to understand things by approaching the simplest example first, so I'm interested in the case of a binomial trial, i.e. the case where the support is $\{1,2\}$. In this case the Jeffreys prior is given by $$ \rho(\theta) = \frac{1}{\pi\sqrt{\theta(1-\theta)}}, \qquad\qquad(i) $$ where $\theta$ is the parameterisation given by $p_1 = \theta$, $p_2 = 1-\theta$.

What I would like is to understand the sense in which this is invariant with respect to a coordinate transformation $\theta \to \varphi(\theta)$. To me the term "invariant" would seem to imply something along the lines of $$ \int_{\theta_1}^{\theta_2} \rho(\theta) d \theta = \int_{\varphi(\theta_1)}^{\varphi(\theta_2)} \rho(\varphi(\theta)) d \varphi \qquad\qquad(ii) $$ for any (smooth, differentiable) function $\varphi$ -- but it's easy enough to see that this is not satisfied by the distribution $(i)$ above (and indeed, I doubt there can be any density function that does satisfy this kind of invariance for any transformation). So there must be some other sense intended by "invariant" in this context. I would like to understand this sense in the form of a functional equation similar to $(ii)$, so that I can see how it's satisfied by $(i)$.

Progress

As did points out, the Wikipedia article gives a hint about this, by starting with $$ p(\theta)\propto\sqrt{I(\theta)} $$ and deriving $$ p(\varphi)\propto\sqrt{I(\varphi)} $$ for any smooth function $\varphi(\theta)$. (Note that these equations omit taking the Jacobian of $I$ because they refer to a single-variable case.) Clearly something is invariant here, and it seems like it shouldn't be too hard to express this invariance as a functional equation. However, the more I try to do this the more confused I get. Partly this is because there's just a lot left out of the Wikipedia sketch (e.g. are the constants of proportionality the same in the two equations above, or different? Where is the proof of uniqueness?) but mostly it's because it's really unclear exactly what's being sought, which is why I wanted to express it as a functional equation in the first place.

To reiterate my question, I understand the above equations from Wikipedia, and I can see that they demonstrate an invariance property of some kind. However, I can't see how to express this invariance property in the form of a functional equation similar to $(ii)$, which is what I'm looking for as an answer to this question. I want to first understand the desired invariance property, and then see that the Jeffrey's prior (hopefully uniquely) satisfies it, but the above equations mix up those two steps in a way that I can't see how to separate.

$\endgroup$
8
  • $\begingroup$ Which part of the question is not dealt with here? $\endgroup$
    – Did
    Commented Oct 12, 2012 at 22:03
  • $\begingroup$ This should be posted as a comment rather than an answer, since it is not an answer. However, the link is helpful. To answer your question, the missing bit is the bit where I said "I'd like to understand this sense [of invariance] form of a functional equation similar to (ii), so that I can see how it's satisfied by (i)." Perhaps I can answer this myself now, but if you'd like to post a proper answer detailing it then I'd be happy to award you the bounty. $\endgroup$
    – N. Virgo
    Commented Oct 12, 2012 at 22:27
  • $\begingroup$ Sorry but I absolutely completely do not care the least about bounties and points. My answer is written as it is because yes, I believe you can answer this (your)self now. $\endgroup$
    – Did
    Commented Oct 12, 2012 at 22:33
  • 3
    $\begingroup$ Perhaps I can, but it seems not at all trivial to me. $\endgroup$
    – N. Virgo
    Commented Oct 12, 2012 at 23:01
  • 3
    $\begingroup$ The comments on this question make no sense if you don't already know that @did's comment was originally an answer, which was deleted by a moderator and made into a comment, and that the following two comments were originally comments on did's answer. $\endgroup$ Commented Oct 18, 2012 at 9:14

6 Answers 6

19
$\begingroup$

Having come back to this question and thought about it a bit more, I believe I have finally worked out how to formally express the sense of "invariance" that applies to Jeffreys' priors, as well as the logical issue that prevented me from seeing it before.

The following lecture notes were helpful in coming to this conclusion, as they contain an explanation that is clearer than anything I could find at the time of writing the question: https://www2.stat.duke.edu/courses/Fall11/sta114/jeffreys.pdf

My key stumbling point seems to be that the phrase "the Jeffreys prior is invariant" is incorrect - the invariance in question is not a property of any given prior, but rather it's a property of a method of constructing priors from likelihood functions.

That is, we want something that will take a likelihood function and give us a prior for the parameters, and will do it in such a way that if we take that prior and then transform the parameters, we will get the same result as if we first transform the parameters and then use the same method to generate the prior. I was looking for an invariance property that would apply to a particular prior generated using Jeffreys' method, whereas the desired invariance principle in fact applies to Jeffreys' method itself.

To give an attempt at fleshing this out, let's say that a "prior construction method" is a functional $M$, which maps the function $f(x \mid \theta)$ (the conditional probability density function of some data $x$ given some parameters $\theta$, considered a function of both $x$ and $\theta$) to another function $\rho(\theta)$, which is to be interpreted as a prior probability density function for $\theta$. That is, $\rho(\theta) = M\{ f(x\mid \theta) \}$.

What we seek is a construction method $M$ with the following property: (I hope I have expressed this correctly) $$ M\{ f(x\mid h(\theta)) \} = M\{ f(x \mid \theta) \}\circ h, $$ for any arbitrary smooth monotonic transformation $h$. That is, we can either apply $h$ to transform the likelihood function and then use $M$ to obtain a prior, or we can first use $M$ on the original likelihood function and then transform the resulting prior, and the end result will be the same.

What Jeffreys provides is a prior construction method $M$ which has this property. My problem arose from looking at a particular example of a prior constructed by Jeffreys' method (i.e. the function $M\{ f(x\mid \theta )\}$ for some particular likelihood function $f(x \mid \theta)$) and trying to see that it has some kind of invariance property. In fact the desired invariance is a property of $M$ itself, rather than of the priors it generates.

I do not currently know whether the particular prior construction method supplied by Jeffreys is unique in having this property. This seems to be rather an important question: if there is some other functional $M'$ that is also invariant and which gives a different prior for the parameter of a binomial distribution then there doesn't seem to be anything that picks out the Jeffreys distribution for a binomial trial as particularly special. On the other hand, if this is not the case then the Jeffreys prior does have a special property, in that it's the only prior that can be produced by a prior generating method that is invariant under parameter transformations. It would therefore seem rather valuable to find a proof that Jeffrey's prior construction method is unique in having this invariance principle, or an explicit counterexample showing that it is not.

$\endgroup$
6
  • 1
    $\begingroup$ (+1) Your answer is perhaps one of the cleareast I've found so far, together with the lecture you mention. $\endgroup$
    – SiXUlm
    Commented Apr 4, 2017 at 16:27
  • $\begingroup$ Your answer is really clear, but I think is not quite there yet. It is trivial to define an M that satisfies the condition but is not correct. For example, define M{Binom(x|theta)} = 1 (i.e. the uniform prior) and everything else as the reparameterization of that. This kind of example shows that there is some missing constraint from your mathematical formulation of invariance. $\endgroup$
    – thc
    Commented Oct 11, 2020 at 7:23
  • $\begingroup$ P.S.: your link is broken, I think you mean this one: www2.stat.duke.edu/courses/Fall11/sta114/jeffreys.pdf $\endgroup$
    – thc
    Commented Oct 11, 2020 at 7:24
  • $\begingroup$ @thc I've fixed the link. I'm not sure I understand what you mean in your other comment, though - could you spell your counterexample out in more detail? Note that if I start with a uniform prior and then transform the parameters, I will in general end up with something that's not a uniform prior over the new parameters. $\endgroup$
    – N. Virgo
    Commented Oct 11, 2020 at 11:24
  • 1
    $\begingroup$ Hi! Amazing answer. One thing I would like to note that if you look at the proof for this invariance, it is only important that we have the variance of a (differentiable) function of the density function of the sampling distribution. It's not important that this function is the logarithm of this pdf, so indeed there are infinitely many of these kinds of methods. $\endgroup$ Commented Aug 18, 2022 at 15:51
2
+50
$\begingroup$

Maybe the problem is that you are forgetting the jacobian of the transformation in (ii).

I suggest that you check carefully the formulas here (hint: $\left| \frac{d \Phi^{- 1}}{d y} \right|$ is the jacobian where $\Phi^{- 1}$ is the inverse transformation). Then, start with some simple examples of some monotonic transformations in order to see the invariance. I suggest to start with $\varphi(\theta)=2\theta$ and $\varphi(\theta)=1-\theta$.

Also, to answer your question, the constants of integration do not matter here. In (i), it is $\pi$. Do the calculations with $\pi$ in there to see that point. Let me know if you are stuck somewhere.

Edit: The dependence on the likelihood is essential for the invariance to hold, because the information is a property of the likelihood and because the object of interest is ultimately the posterior. However, regardless what likelihood you use, the invariance will hold through. This happens through the relationship $ \sqrt{I (\theta)} = \sqrt{I (\varphi (\theta))} | \varphi' (\theta) | $. Indeed this equation links the information of the likelihood to the information of the likelihood given the transformed model. Here $| \varphi' (\theta) |$ is the inverse of the jacobian of the transformation. (I will let you verify this by deriving the information from the likelihood. Just use the chain rule after applying the definition of the information as the expected value of the square of the score). Now, for the prior. \begin{eqnarray*} p (\varphi (\theta) ) & = & \frac{1}{| \varphi' (\theta) |} p (\theta )\\ & = & \frac{1}{| \varphi' (\theta) |} \sqrt{I (\theta)} \\ & = & \sqrt{I (\varphi (\theta))} \\ & = & p (\varphi (\theta)) \end{eqnarray*} The first line is only applying the formula for the jacobian when transforming between posteriors. The second line applies the definition of Jeffreys prior. The third line applies the relationship between the information matrices. The final line applies the definition of Jeffreys prior on $\varphi{(\theta)}$. You can see that the use of Jeffreys prior was essential for $\frac{1}{| \varphi' (\theta) |}$ to cancel out.

Look again at what happens to the posterior ($y$ is obviously the observed sample here) \begin{eqnarray*} p (\varphi (\theta) |y) & = & \frac{1}{| \varphi' (\theta) |} p (\theta |y)\\ & \propto & \frac{1}{| \varphi' (\theta) |} p (\theta) p (y| \theta)\\ & \propto & \frac{1}{| \varphi' (\theta) |} \sqrt{I (\theta)} p (y| \theta)\\ & \propto & \sqrt{I (\varphi (\theta))} |p (y| \theta)\\ & \propto & p (\varphi (\theta)) p (y| \theta) \end{eqnarray*} The only difference is that the second line applies Bayes rule.

As I explained earlier in the comments, it is essential to understand how jacobians work (or differential forms).

$\endgroup$
7
  • $\begingroup$ Thanks for the hints. Those equations (quoted from Wikipedia) omit the Jacobian because they refer to the case of a binomial trial, where there is only one variable and the Jacobian of $I$ is just $I$. The problem is not that I don't understand those equations. What I want is to see a definition of the sought invariance property that doesn't refer to the specific form of the Jeffrey's prior, and then see that $\sqrt{|I(\theta)|}$ (hopefully uniquely) satisfies it - and I can't see how to do that. $\endgroup$
    – N. Virgo
    Commented Oct 16, 2012 at 15:55
  • $\begingroup$ I still think that your problem is with jacobians and the fact that the formula (ii) is correct for the special case I does not make correct in general. $\endgroup$
    – Per
    Commented Oct 17, 2012 at 3:41
  • $\begingroup$ Formula (ii) is not correct in either the special case or in general. What I'm looking for is something like formula (ii), but correct. $\endgroup$
    – N. Virgo
    Commented Oct 17, 2012 at 11:26
  • $\begingroup$ zyx's answer is excellent but it uses differential forms. I will add some clarifications to my answer regarding your question about the invariance depending on the likelihood to my answer. $\endgroup$
    – Per
    Commented Oct 18, 2012 at 7:56
  • 1
    $\begingroup$ By the way, I don't want to seem obstinate. This is genuinely very helpful, and I'll go through it very carefully later, as well as brushing up on my knowledge of Jacobians in case there's something I've misunderstood. But still, it seems like having a better understanding of how to go from $p(\theta)$ to $p(\varphi(\theta))$ isn't automatically giving me a grasp of what the "XXX" is. I'm fairly certain it's a logical point that I'm missing, rather than something to do with the formal details of the mathematics. $\endgroup$
    – N. Virgo
    Commented Oct 18, 2012 at 10:53
2
$\begingroup$

What is invariant is the volume density $|p_{L_{\theta}}(\theta) dV_{\theta}|$ where $V_\theta$ is the volume form in coordinates $\theta_1, \theta_2, \dots \theta_n$ and $L_\theta$ is the likelihood parametrized by $\theta$. Locally the Fisher matrix $F$ transforms to $(J^{-1})^TFJ^{-1}$ under a change of coordinates with Jacobian $J$, and $\sqrt{\det}$ of this cancels the multiplication of volume forms by $\det J$.

The presentation in Wikipedia is confusing, because

  • the equations are between densities $p(x) dx$, but written as though for the density functions $p()$ that define the priors,

  • the first equality is a claim still to be proven. The following ones are the derivation of that equation.

To read the Wikipedia argument as a chain of equalities of unsigned volume forms, multiply every line by $|d\varphi|$, and use absolute value of all determinants, not the usual signed determinant. Then "$p_{L_{\varphi}}(\varphi) d\varphi \rm{\hskip2pt(claimed)} = p_{L_{\theta}}(\theta) d\theta = (\rm{...Fisher \hskip3pt I \hskip3pt quantities...)} d\varphi = \sqrt{I(\varphi)} d\varphi $.


To answer some of the other questions,

The invariance of $|p dV|$ is the definition of "invariance of prior". Because changes of coordinate alter $dV$, an invariant prior has to depend on more than $p(\theta)$. It is natural to ask for something local on the parameter space, so the invariant prior will be built from a finite number of derivatives of the likelihood evaluated at $\theta$. This means some local finite dimensional linear space of differential quantities at each point with linear maps between the before- and after- coordinate change spaces. Determinants appear because there is a factor of $\det J$ to be killed from the change in $dV$, and because we will want the changes of the local quantities to multiply and cancel each other as is the case in Jeffreys prior, which practically requires a reduction to one dimension where the coordinate change can act on each factor by multiplication by a single number. The Jeffreys prior is a product of two locally defined quantities one of which scales by $\sqrt{A^{-2}}$ and the other by $A$ where $A(\theta)$ is a local factor that depends on $\theta$ and on the coordinate transformation. Computationally it is expressed by Jacobians but only the power-of-$A$ dependences matter and having those cancel out on multiplication. This shows that the invariant prior is very non-unique as there are many other ways to achieve the cancellation. The preference for Jeffreys form of invariant prior is based on other considerations.

$\endgroup$
5
  • $\begingroup$ In the univariate case, does the expression in your first sentence reduce to $p(\theta) d\theta$? If so I don't think that can be the thing that's invariant. Since, as you say, $p(\varphi)d\varphi \equiv p(\theta)d\theta$ is an identity, it holds for every pdf $p(\theta)$, not just the Jeffreys prior. $\endgroup$
    – N. Virgo
    Commented Oct 17, 2012 at 20:51
  • $\begingroup$ Also, it would help me a lot if you could expand on the distinction you make between "densities $p(x) dx$" and "the density functions $p()$ that define the priors" - I can sort-of see what you mean, but it's not quite clear to me yet. $\endgroup$
    – N. Virgo
    Commented Oct 17, 2012 at 20:53
  • $\begingroup$ Finally, whatever the thing that's invariant is, it must surely depend in some way on the likelihood function! $\endgroup$
    – N. Virgo
    Commented Oct 17, 2012 at 20:55
  • $\begingroup$ re the second comment, the distinction is between functions and differential forms. What you need for Bayesian statistics (resp., likelihood-based methods) is the ability to integrate against a prior (likelihood), so really $p(x) dx$ is the object of interest. $\endgroup$
    – zyx
    Commented Oct 17, 2012 at 21:18
  • $\begingroup$ I made some edits, I think it explains clearly now why the Wikipedia link is not a real answer. $\endgroup$
    – zyx
    Commented Oct 18, 2012 at 0:50
2
$\begingroup$

The property of "Invariance" does not necessarily mean that the prior distribution is Invariant under "any" transformation. To make sure that we are on the same page, let us take the example of the "Principle of Indifference" used in the problem of Birth rate analysis given by Laplace. The link given by the OP contains the problem statement in good detail. Here the argument used by Laplace was that he saw no difference in considering any value p$_1$ over p$_2$ for the probability of the birth of a girl.

Suppose there was an alien race that wanted to do the same analysis as done by Laplace. But let us say they were using some log scaled parameters instead of ours. (Say they were reasoning in terms of log-odds ratios). It is perfectly alright for them to do so because each and every problem of ours can be translated to their terms and vice-versa as long as the transform is a bijection.

The problem here is about the apparent "Principle of Indifference" considered by Laplace. Though his prior was perfectly alright, the reasoning used to arrive at it was at fault. Say if the aliens used the same principle, they would definitely arrive at a different answer than ours. But whatever we estimate from our priors and the data must necessarily lead to the same result. This "Invariance" is what is expected of our solutions. But using the "Principle of Indifference" violates this.

In the above case, the prior is telling us that "I don't want to give one value p$_1$ more preference than another value p$_2$" and it continues to say the same even on transforming the prior. The prior does not lose the information. In other words, on transforming the prior to a log-odds scale, the prior still says "See, I still consider no value of p1 to be preferable over another p2" and that is why the log-odds transform is not going to be flat. It says that there is some prior information which is why this transformed pdf is not flat.

Now how do we define a completely "uninformative" prior? That seems to be an open-ended question full of debates. But nonetheless, we can make sure that our priors are at least uninformative in some sense. That is where this "Invariance" comes into the picture.

Say that we have 2 experimenters who aim to find out the number of events that occurred in a specific time (Poisson dist.). But unfortunately, if their clocks were running at different speeds (say, t' = qt) then their results will definitely be conflicting if they did not consider this difference in time-scales. Whatever priors they use must be completely uninformative about the scaling of time between the events. This is ensured by the use of Jeffrey's prior which is completely scale and location-invariant. So they will use the $\lambda^{-1}d\lambda$ prior, the Jeffrey's prior (because it is the only general solution in the one-parameter case for scale-invariance). Jeffrey's prior has only this type of invariance in it, not to all transforms (Maybe some others too, but not all for sure). To use any other prior than this will have the consequence that a change in the time scale will lead to a change in the form of the prior, which would imply a different state of prior knowledge; but if we are completely ignorant of the time scale, then all time scales should appear equivalent. The use of these "Uninformative priors" is completely problem-dependent and not a general method of forming priors. When this property of "Uninformativeness" is needed, we seek priors that have invariance of a certain type associated with that problem.

(More info on this scale and location invariance can be found in Probability Theory the Logic of Science by E.T. Jaynes. The timescale invariance problem is also mentioned there.)

$\endgroup$
3
  • $\begingroup$ This notion of "uninformative prior" is a different thing from Jeffreys priors though, isn't it? I've read Jaynes' book, and quite a few of his papers on this topic, and I seem to remember him arguing against Jeffreys priors, on the grounds that they're not uninformative in the sense you describe. $\endgroup$
    – N. Virgo
    Commented Oct 18, 2020 at 14:46
  • $\begingroup$ In particular, I remember him arguing in favour of an "uninformative" prior for a binomial distribution that's an improper prior proportional to $1/(p(1-p))$. That's different from the Jeffreys prior, which is proportional to $1/\sqrt{p(1-p)}$. $\endgroup$
    – N. Virgo
    Commented Oct 18, 2020 at 14:49
  • 1
    $\begingroup$ I think I found out why I considered them the same, Jaynes in his book refers only to the (dv/v) rule and it's consequences as Jeffreys' priors. On applying the (dv/v) rule on the positive semi-infinite interval, we get the 1/p(1-p) dependence which Jeffreys accepts only for the semi-infinite interval. For the [0,1] interval he supports the square root dependant term stating that the weights over 0 and 1 are too high in the former distribution making the population biased over these 2 points only. Yes, I think they are different. $\endgroup$
    – user666669
    Commented Oct 18, 2020 at 16:27
2
$\begingroup$

Your answer, @N. Virgo, has greatly improved my understanding of what the Jeffreys prior is and in what sense the word "invariant" is used. The goal of this answer is to provide a rigorous mathematical framework of the "invariance" property and to show that the prior obtained by Jeffreys method is not unique. In fact, I will show that for any desired prior one can construct an "invariant" method that produces this prior. Henceforth I will use the word equivariant instead of invariant since it is a better fit in my opinion.

Throughout this answer we fix a measurable space $(\Omega,\mathcal A)$, as well as a parameter space $\Theta\subset\mathbb R$, that, for simplicity, I assume to be an interval (the arguments here should also work for more general parameter spaces and the reader is invited to repeat them in a more general setting).

Notation. We denote by $\mathrm M^1(\Omega,\mathcal A)$ the set of probability measures on $(\Omega,\mathcal A)$ and by $\mathrm M^\sigma(\Omega,\mathcal A)$ the space of all $\sigma$-finite measures on $(\Omega,\mathcal A)$. We denote by $\mathrm M^1(\Omega,\mathcal A)^\Theta$ the space of all families $(\mathrm P_\theta)_{\theta\in\Theta}$ where $\mathsf P_\theta\in\mathrm M^1(\Omega,\mathcal A)$. We denote the Borel-measurable sets on $\Theta$ by $\mathcal B(\Theta)$. For a measure $\mu$ on a measurable space $X_1$ and a measurable map $h:X_1\to X_2$ for a measurable space $X_2$, we denote by $h_\#\mu$ the pushforward measure defined by $h_\#\mu(A)=\mu(h^{-1}(A))$ for all measurable $A\subset X_2$.

Definition. An equivariant method for constructing prior distributions is a set $X\subset \mathrm M^1(\Omega,\mathcal A)^\Theta$ satisfying $(\mathsf P_\theta)_{\theta\in\Theta}\in X\implies (\mathsf P_{h(\theta)})_{\theta\in\Theta}\in X$ together with a mapping \begin{align*}\rho: X&\to \mathrm M^\sigma(\Theta, \mathcal B(\Theta))\\ (\mathsf P_\theta)_{\theta\in\Theta}&\mapsto\rho[(\mathsf P_\theta)_{\theta\in\Theta}]\end{align*} satisfying the equivariance property $$h_\# \rho[(\mathsf P_{h(\theta)})_{\theta\in\Theta}] = \rho[(\mathsf P_\theta)_{\theta\in\Theta}]$$ for all bijective $h\in C^\infty(\Theta;\Theta)$.

Example. A trivial choice is $X=\mathrm M^1(\Omega,\mathcal A)$ and $\rho=0$, because the measure assigning $0$ to all measurable sets is invariant under push-forward by any map. This choice is not at all useful or interesting.

Another trivial choice is $X=\emptyset$ and $\rho$ to be the empty map, however, this choice is also not at all useful or interesting.

Jeffreys method is also an equivariant method for constructing prior distributions, and the first "non-trivial" method mentioned here: We first fix a sigma-finite measure $\nu$ on $(\Omega,\mathcal A)$ and then define $X$ to be the set of all families of probability distributions $(\mathsf P_\theta)_{\theta\in\Theta}$ such that

  1. $\mathsf P_\theta\ll\nu$ for all $\theta\in\Theta$,
  2. the likelihood function $f_\theta=\frac{\mathrm d\mathsf P_\theta}{\mathrm d\nu}$ can be chosen so that $\theta\mapsto\ln f_\theta$ is $C^\infty$,
  3. $\frac{\partial^2}{\partial\theta^2}\ln f_\theta\in L^1(\Omega,\mathcal A, \mathsf P_\theta)$ for all $\theta\in\Theta$,
  4. Jeffreys prior defined below is indeed in $\mathrm M^\sigma(\Theta, \mathcal B(\Theta))$ (I am not sure if this condition is redundant).

We then define Jeffreys prior (not-normalized) $\rho[(\mathsf P_\theta)_{\theta\in\Theta}]$ as the measure over $\Theta$ whose density with respect to the Lebesgue measure $\lambda$ is the square root of the Fisher information, i.e.

$$\frac{\mathrm d\rho[(\mathsf P_\theta)_{\theta\in\Theta}]}{\mathrm d\lambda} =\sqrt{-\int_{\Omega} \frac{\partial^2}{\partial\theta^2}\ln f_\theta(x)\,\mathrm d\mathsf P_\theta(x)}.$$

The fact that Jeffreys method is equivariant has been proven in many places such as the Wikipedia article linked in this discussion (however the formalism there is not quite as formal as here).


I now want to show that, given any desired prior, there exists an equivariant method on a very large set $X$ producing this prior.

First, it should be noted that if, for example, $\mathsf P_{\theta}=\mathsf P_{\vartheta}$ for all $\theta,\vartheta\in\Theta$, then we must have $\rho[(\mathsf P_\theta)_{\theta\in\Theta}]=0$. This is simply because $0$ is the only $\sigma$-finite measure that remains unchanged when being pushforwarded by any smooth bijective map (actually I should prove this statement but I believe this is true).

This is in particular true for Jeffreys method: If $\mathsf P_\theta$ doesn't depend on $\theta$, then neither does $f_\theta$ and therefore the Fisher information is always equal to $0$.

This is of course undesired (we want to generate any desired prior, not just $0$) and it doesn't seem very useful in practical problems to have multiple distinct parameters with the same probability distribution assigned to them. We therefore restrict our attention to $X\subset\mathrm M^1(\Omega,\mathcal A)$ containing all $(\mathsf P_\theta)_{\theta\in\Theta}$ such that $\theta\mapsto\mathsf P_\theta$ is an injective map.

Fix now any "privileged" family of distributions $(\mathrm Q_\theta)_{\theta\in\Theta}$ (in the language of Bayesians this would be a "privileged parametrization") and the "privileged" prior $p\in\mathrm M^\sigma(\Theta,\mathcal B(\Theta))$ that you want to obtain.

We now define $$\rho:X\to\mathrm M^{\sigma}(\Theta,\mathcal B(\Theta))$$ as $$\rho[(\mathsf P_\theta)_{\theta\in\Theta}] =\begin{cases}h^{-1}_\# p, &\text{ if }(\mathsf P_{\theta})_{\theta\in\Theta}=(\mathrm Q_{h(\theta)})_{\theta\in\Theta} \text{ for some bijective }h\in C^\infty(\Theta;\Theta)\\0,&\text{otherwise}. \end{cases}$$

Note that by Definition of $X$, $\rho$ is well-defined, since $h$ in the first case is unique if it exists. $\rho$ satisfies the equivariance property by construction.

$\endgroup$
1
$\begingroup$

The clearest answer I have found (ie, the most blunt "definition" of invariance) was a comment in this Cross-Validated thread, which I combined with the discussion in "Bayesian Data Analysis" by Gelman et. al. to finally come to an understanding.

The key point is we want the following: If $\phi = h(\theta)$ for a monotone transformation $h$, then:

$$P(a \le \theta \le b) = P(h(a) \le \phi \le h(b))$$

Proof

First we show a probability density for which this is satisfied.

Let $p_{\theta}(\theta)$ be the prior on $\theta$. We will derive the prior on $\phi$, which we'll call $p_{\phi}(\phi)$. By the transformation of variables formula,

$$p_{\phi}(\phi) = p_{\theta}( h^{-1} (\phi)) \Bigg| \frac{d}{d\phi} h^{-1}(\phi) \Bigg| $$

Now, according to this Wikipedia page, the derivative the inverse gives:

$$p_{\phi}(\phi) = p_{\theta}( h^{-1} (\phi)) \Bigg| h'(h^{-1}(\phi)) \Bigg|^{-1} $$

We will write this in another way to make the next step clearer. Recalling that $\phi = h(\theta)$, we can write this as

$$p_{\phi}(h(\theta)) = p_{\theta}(\theta) \Bigg| h'(\theta) \Bigg|^{-1} $$.

Now we get to the good part.

\begin{aligned} P(h(a)\le \phi \le h(b)) &= \int_{h(a)}^{h(b)} p_{\phi}(\phi) d\phi\\ \end{aligned}

using the substitution formula from Wikipedia with $\phi = h(\theta)$

\begin{aligned} \int_{h(a)}^{h(b)} p_{\phi}(\phi) d\phi &= \int_{a}^{b} p_{\phi}(h(\theta)) h'(\theta) d\theta\\ &= \int_{a}^{b} p_{\theta}(\theta) \Bigg| h'(\theta) \Bigg|^{-1} h'(\theta) d\theta, \end{aligned}

where we have used our result above.

Now, we can drop the absolute value bars around $h'(\theta)$. If $h$ is increasing, then $h'$ is positive and we don't need the absolute value. If $h$ is decreasing, then $h(b) < h(a)$, which means the integral gets a minus in front of it. When we drop the bars, we can cancel $h'^{-1}$ and $h'$, giving

$$ \int_{h(a)}^{h(b)} p_{\phi}(\phi) d\phi = \int_{a}^{b}p_{\theta}(\theta) d\theta$$

and hence

$$ P(a \le \theta \le b) = P(h(a) \le \phi \le h(b))$$

Now, we need to show that a prior chosen as the square root of the Fisher Information admits this property. This proof is clearly laid out in these lecture notes

Hope this helps.

$\endgroup$

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .