9
$\begingroup$

Let $X$ be an $\mathcal{X}$ valued random variable. We are doing Bayesian statistics. Suppose that $\theta$ is a $\Theta$ valued random variable with known prior distribution $\Pi$ and that the regular conditional probability $P_{X \mid \theta}$ is known. If $\Pi$ is proper (i.e. $\Pi(\Theta)=1$), then we consider $X$ as a random variable whose distribution $P_X$ is, for any measurable $A \subseteq \mathcal{X}$ :

\begin{equation} P_X(A) = \int_{\Theta} P_{X \mid \theta = u}(A) d \Pi(u) \end{equation}

That is, rather than the frequentist approach where $P_X$ is based on one true $\theta$, here we model $P_X$ as an integral of all possible values of $\theta$ according to the prior chosen.


Edit :

If the sample size is $n$, then the usual assumption is that $X_1$, $\ldots$, $X_n$ are independent identically distributed when conditionned on $\theta$. So we calculate, any rectangle $A = \prod_{i=1}^n A_i$ with measurable $A_i \subseteq \mathcal{X}$ :

\begin{equation*} \begin{split} & P_{X_1, \ldots, X_n}(A) = \int_{\Theta} P_{X_1, \cdots, X_n \mid \theta = u}(A) d \Pi(u) = \int_{\Theta} \prod_{i=1}^n P_{X_i \mid \theta = u}(A_i) d \Pi(u) \\ & = \int_{\Theta} \prod_{i=1}^n P_{X_1 \mid \theta = u}(A_i) d \Pi(u) \end{split} \end{equation*}


So, the question is, how is $X$ generated if we are using an improper prior ? Given the discussion above, it does not feel like a natural generalization because $P_X$ would not be a probability distribution. Do we just think about $X$ as a generic measurable function ? I know that for practical purposes we need to consider improper priors like non-informative prior. But is there any theoretical motivation for improper priors?

$\endgroup$
2
  • 1
    $\begingroup$ I think it would be helpful for future answerers to understand what kind of answer you are looking for. Some of the theoretical motivation for improper priors comes from the fact that Bayesian estimators have nice properties (admissible, minimax, etc) so if we can express an estimator as the limit of Bayes estimators (even though there is no true prior our estimate can come from) then we get some properties. Also there is the notion of "Jeffrey's prior" which can sometimes be proper sometimes not. There is LOTS of Bayes theory but it is hard to know exactly what you are asking for right now. $\endgroup$
    – PhysicsKid
    Commented Mar 18 at 3:35
  • $\begingroup$ An expected answer would address how is $X$ generated when an improper prior is posed and why any theory would require $X$ to be generated in such an unnatural way. Importantly, if there is no satisfactory answer on how $X$ is generated when using an improper prior, should one treat improper prior strictly as a technique or a tool. $\endgroup$
    – 温泽海
    Commented Mar 18 at 15:07

2 Answers 2

12
$\begingroup$

You cannot - that is part of what makes a prior improper.

The motivation is usually in the context of conjugate priors. Take as an illustration

  • an exponentially-distributed likelihood with unknown rate $\lambda$
  • with a gamma-distributed conjugate prior distribution with shape $\alpha$ and rate $\beta$
  • so the posterior distribution after $n$ observations is also gamma-distributed but with shape $\alpha+n$ and rate $\beta +\sum x_i$

You might choose to start with $\alpha=0$ and $\beta=0$ to reduce the influence on the later posterior distributions, though your choice of a conjugate prior will still have an influence. This prior is clearly improper as there is no such gamma distribution - a lot of the calculations would be $\frac00$ - but in a sense this may be a good thing, as for example you are making no assumptions about the scale of the units being used to measure $\lambda$. As soon as you make a single observation, you have a proper posterior distribution which then works as a proper prior for subsequent observations and all is right with the world; there is nothing else special about this first observation, as once you have a second observation you get a new posterior distribution which can then act as a new prior and it is unaffected by which observation was seen first.

$\endgroup$
1
  • 1
    $\begingroup$ I guess this would be the answer. Since priors are meant to be updated, improper priors can act as an intermediate step. I should most likely only consider improper priors as tools, rather than a subject on its own. $\endgroup$
    – 温泽海
    Commented Mar 16 at 20:14
9
$\begingroup$

Just to say it again, if the sample size is 𝑛, then we will consider the sample $𝑋_1,\cdots, 𝑋_𝑛$ as independent identically distributed according to $𝑃_𝑋$ defined in equation (1) above.

This is incorrect: if the $X_i$'s are i.i.d. conditional on $\theta$, they are not unconditionally so since $$ (𝑋_1,\cdots, 𝑋_𝑛) \sim \int \prod_{i=1}^n f(x_i|\theta)\,\text d\Pi(\theta) \ne \prod_{i=1}^n \int f(x_i|\theta)\,\text d\Pi(\theta)$$ In other words, the $X_i$'s become dependent when integrating out $\theta$ because they all contain information about the same $\theta$.

To address more directly the question, the prior construction is an inferential, post-observational, choice that logically does not impact the way the $X_i$'s were generated. (Think of the usual case when the Bayesian analyst comes upon the scene after the data was collected.) This prior construction is indeed adopted (only) for Bayesian inference about the (true) parameter $\theta$ driving the generation of those $X_i$'s. Thus the per se generative model on the $X_i$'s is unrelated to $\Pi$, as it is either the postulated model attached to $f(\cdot|\theta)$ for an unknown value of $\theta$ or it is an unknown "true model" in case of model misspecification. This is why different Bayesians will adopt different priors for the same dataset with none being "wrong". In other words, there is no such thing as the "true" prior, because, otherwise, we would not find ourselves within a Bayesian setting but a genuine generation mechanism (as in error in variable models).

As to why one would ressort to noninformative priors, there are many entries on X validated about it. Like this one. But also limitations to their use, as for instance for model choice, since the marginal density of the sample $$p(\mathbf x) = \int \prod_{i=1}^n f(x_i|\theta)\,\text d\Pi(\theta)$$ is not a probability density and thus cannot be scaled (or compared with a genuine probability density). This strong limitation (DeGroot, 1973) is directly connected to the impossibility of generating samples using the prior predictive and hence to the question. This is also related to the impossibility of running ABC with an improper prior (Sisson et al., 2019) since pseudo-samples cannot be produced from the prior predictive.

$\endgroup$
5
  • $\begingroup$ If the prior we pose has no impact on the way $X$ is generated, what exact distribution generates $X$ in Bayesian statistics? There won't be a true $\theta$ for that would be frequentist. $\endgroup$
    – 温泽海
    Commented Mar 16 at 15:14
  • $\begingroup$ I forgot that the usual assumption in Bayesian statistics is that the X_i's are conditionally independent. I have editted the post to correct this. Thank you for pointing it out. The central problem however remains. $\endgroup$
    – 温泽海
    Commented Mar 16 at 15:34
  • $\begingroup$ If you want to generate samples from a Bayesian model that includes the prior, you need to generate a parameter from the prior. In which case the prior does impact on the way the $X_i$s are generated indeed. $\endgroup$ Commented Mar 16 at 18:02
  • $\begingroup$ But how do I generate samples from a Bayesian model without using the prior? This does not make a lot of sense to me. For the other point, suppose that I want to generate from a Bayesian model using a given prior, do you mean that I should generate one $\theta$ from $\Pi$ and then generate $X_1, \ldots, X_n$ iid using this single $\theta$? But how is this any different from what I proposed in the post? $\endgroup$
    – 温泽海
    Commented Mar 16 at 19:21
  • 4
    $\begingroup$ @温泽海 My comment was addressing Xi'an, not your posting. I'm basically saying what you are saying. I just think Xi'an has a different interpretation of "generating data" - you and I are talking about generating data from the full model including the prior, whereas Xi'an refers to a Bayesian who thinks of the sampling model given $\theta$ as "generating the data", whereas $\theta$ is just something we are interested in expressing uncertainty about.. $\endgroup$ Commented Mar 17 at 11:48

Not the answer you're looking for? Browse other questions tagged or ask your own question.