13
$\begingroup$

What is the intuition behind conditional expectation in a measure-theoretic sense, as opposed to a non-measure-theoretic treatment?

You may assume I know:

  • what a probability space $(\Omega, \mathcal{F}, \mathbb{P})$ refers to
  • probability without measure theory really well (i.e., discrete and continuous random variables)
  • what the measure-theoretic definition of a random variable is
  • that a Lebesgue integral has something to do with linear combinations of step functions, and intuitively, it involves partitioning the $y$-axis (as opposed to the Riemann integral, which partitions the $x$-axis).

I haven't had time to learn measure-theoretic probability lately due to graduate school creeping in as well as other commitments, and conditional expectation is often covered as one of the last topics in every measure-theoretic probability text I've seen.

I have seen notations such as $\mathbb{E}[X \mid \mathcal{F}]$, where I assume $\mathcal{F}$ is some sort of $\sigma$-algebra - but of course, this looks very different from, say, $\mathbb{E}[X \mid Y]$ from what I saw in my non-measure-theoretic treatment of probability, where $X$ and $Y$ are random variables.

I was also surprised to see that one book I have (Essentials of Probability Theory for Statisticians by Proschan and Shaw (2016)), if I recall correctly, explicity states that conditional expectation is defined as a conditional expectation, rather than the conditional expectation, which implies to me that there's more than one possible conditional expectation when given two pairs of random variables. (Unfortunately, I don't have the book on me right now, but I can update this post later).

The Wikipedia article is quite dense, and I see words such as "Radon-Nikodym" which I haven't learned yet, but I would at least like to get an idea of what the intuition of conditional expectation is in a measure-theoretic sense.

$\endgroup$
5

2 Answers 2

5
$\begingroup$

So, intuitively you can think of $E(X|\mathcal{F})$ as the "best guess" of the value of $X$ given information about the events in $\mathcal{F}$. However, more formally speaking, $E(X|\mathcal{F})$ is a random variable that satisfies the following:

  1. $E(X|\mathcal{F})$ is an $\mathcal{F}-$measurable function
  2. $\int_A E(X|\mathcal{F}) dP = \int_A X dP$ for every $A \in \mathcal{F}$

Now, the theorem that guarantees the existence of $E(X|\mathcal{F})$ is precisely the Radon-Nikodym Theorem (that's why these words appear in this context). And, in order to fully understand this definition of conditional expectation it's really recommended to cement some Measure Theory knowledge. Having said that, with the intuition above we can already "guess" the conditional expectation for some cases:

i) If $X$ itself is a $\mathcal{F}-$measurable function, then $E(X|\mathcal{F}) = X$. That is, by having all the information about $X$, then our best guess is $X$ itself.

ii) If $X$ is independent of $\mathcal{F}$, i.e. for all $A \in \mathcal{F}$ and $B \in \mathcal{B}_\mathbb{R}$ we have $P(\{X \in B\} \cap A) = P(X \in B)P(A)$, then $E(X|\mathcal{F}) = E(X)$. In other words, if we don't have any information about $X$, then our best guess of the value of $X$ is $E(X)$.

Finally, there are other useful intuitions of this conditional expectation. For instance, if you are familiar with Hilbert spaces and $X \in L^2(\mathcal{G})$ (consider the probability space $(\Omega, \mathcal{G},P)$), then $E(X|\mathcal{F})$ with $\mathcal{F} \subset \mathcal{G}$ is a projection in $L^2(\mathcal{F})$. This a commonly used geometric intuition.

$\endgroup$
3
$\begingroup$

As Oscar suggested, the most common intuition for $\mathbb{E}[X | \mathcal{F}]$ is that it is the best guess of $X$ given the information in $\mathcal{F}$. However, I find that the alternative intuition that it is the orthogonal projection of $X$ onto a subspace makes it clearer why it is defined the way it is.


First, just for clarity's sake, let me set up the orthogonal projection. Lets say you have an inner product space $V$ and $v \in V$. Then for a subspace $W \subseteq V$, we can define the orthogonal projection of $v$ onto $W$ as the unique $p_{w}(v) \in W$ such that $$ \left< v, w \right> = \left< p_{w}(v), w \right>$$ for all $w \in W$.

This is based on the idea that a vector can be determined entirely by its inner product with other vectors. That is, the projection $p_{w}(v)$ is unique because if there were some other candidate $z \in W$ we would have that $\left< p_{w}(v) - z, w \right> = 0$ for all $w \in W$.


Before we move back to random variables, let's consider a space of functions, specifically the square-integrable functions $L^{2}(\mathbb{R})$. On this space, we have an inner product $$ \left< f, g \right> = \int_{-\infty}^{\infty} f(x) g(x) \ dx. $$ Now if we consider specifically the indicator function $I_{A}(x) = I[x \in A]$ for a set $A$, we get that $$ \left< f, I_{A} \right> = \int_{A} f(x) \ dx, $$ assuming that $A$ is appropriately chosen (e.g. a closed interval). This in particular will be very convenient for re-interpreting the standard definition of the conditional expectation as a projection.

Now, a function $f$ cannot be determined entirely by its integrals with other $g$. Specifically, if $\tilde{f}$ is another function such that $f(x) = \tilde{f}(x)$ except on a very small set of points (a set of measure zero), the inner product of $\tilde{f}$ with other $g$ will be the same as for $f$. We skirt around this by considering two such functions to be equivalent, ie working with the set of square-integrable functions up to almost-everywhere equivalence.


Now, back to random variables. First, like with $L^{2}$, we consider two random variables $X$ and $Y$ the same if they are equal with probability one. In other words, we can have $X(\omega) \not = Y(\omega)$, but only on a small set of outcomes $\omega$ (small meaning probability zero).

The inner product space here is now the set of random variables with respect to our ambient $\sigma$-algebra, say $\Sigma$. Notice that this is actually a space of functions, with the constraint that they must be $\Sigma$-measurable. The inner product is $$ \left< X, Y \right> = \mathbb{E}[XY] = \int XY dP. $$

The subspace $W_{\mathcal{F}}$ we project onto is the subset of $\mathcal{F}$-measurable random variables. Then for any $X$, the projection $\mathbb{E}[X|\mathcal{F}]$ is defined by

  1. $\mathbb{E}[X|\mathcal{F}] \in W_{\mathcal{F}}$
  2. $\int XY dP = \int \mathbb{E}[X|\mathcal{F}] Y dP$ for all $Y \in W_{\mathcal{F}}$.

Now compare this to the traditional definition of $\mathbb{E}[X|\mathcal{F}]$.

  1. $\mathbb{E}[X|\mathcal{F}]$ is $\mathcal{F}$-measurable
  2. $\int_{A} \mathbb{E}[X|\mathcal{F}] dP = \int_{A} X dP$ for all $A \in \mathcal{F}$.

The first two correspond to each other since $W_{\mathcal{F}}$ is just the $\mathcal{F}$-measurable random variables. The second two seem slightly different. Specifically, the definition of conditional expectation only looks at $Y$ of the form $$ Y(\omega) = I_{A}(\omega) = I[\omega \in A] $$ where $A \in \mathcal{F}$. However, using the definition of the Lebesgue integral, considering the indicator functions is enough to ensure equality across all $\mathcal{F}$-measurable functions. In other words, the two definitions are equivalent in this context.


Basically, this shows that $\mathbb{E}[X|\mathcal{F}]$ is the closest approximation of $X$ that is $\mathcal{F}$-measurable. To put it briefly, if we know which event in $\mathcal{F}$ occurred, but not the specific outcome $\omega$, $\mathbb{E}[X|\mathcal{F}]$ allows us to make a specific guess for the value of $X$ using all the information we have.

There is a lot more to be said about this. One great resource for learning more is David Williams' "Probability with Martingales", which I think anyone even slightly interested in understanding the theory of statistics should have. It has a chapter on the conditional expectation that goes into detail. If this were not already a ridiculously long answer, I would also go into the idea that $\mathbb{E}[X|\mathcal{F}]$ can be thought of as a regression estimate of $X$ (least-squares, of course). But hopefully this gives a start.

$\endgroup$
3
  • 1
    $\begingroup$ $A$ is a set; it doesn't belong to the same vector space as $X$, so your use of the inner product notation is meaningless $\endgroup$ Commented Feb 14 at 20:03
  • $\begingroup$ @JoseAvilez clearly, I didn't explain well enough. I will make edits to clarify why it does make sense. $\endgroup$ Commented Feb 15 at 18:19
  • 1
    $\begingroup$ To anyone voting to delete, please leave a comment. I'd be happy to make edits if you think my answer could be better. $\endgroup$ Commented Feb 15 at 18:41

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .