27
$\begingroup$

I see this expectation in a lot of machine learning literature:

$$\mathbb{E}_{p(\mathbf{x};\mathbf{\theta})}[f(\mathbf{x};\mathbf{\phi})] = \int p(\mathbf{x};\mathbf{\theta}) f(\mathbf{x};\mathbf{\phi}) d\mathbf{x}$$

For example, in the context of neural networks, a slightly different version of this expectation is used as a cost function that is computed using Monte Carlo integration.

However, I am a bit confused about the notation that is used, and would highly appreciate some clarity. In classical probability theory, the expectation:

$$\mathbb{E}[X] = \int_x x \cdot p(x) \ dx$$

Indicates the "average" value of the random variable $X$. Taking it a step further, the expectation:

$$\mathbb{E}[g(X)]=\int_x g(x) \cdot p(x) \ dx$$

Indicates the "average" value of the random variable $Y=g(X)$. From this, it seems that the expectation:

$$\mathbb{E}_{p(\mathbf{x};\mathbf{\theta})}[f(\mathbf{x};\mathbf{\phi})]$$

Is shorthand for and the same as:

$$\mathbb{E}_{\mathbf{x}}[f(\mathbf{x};\mathbf{\phi})]$$

Where:

$$ \mathbf{x} \sim p(\mathbf{x};\mathbf{\theta})$$

And this indicates the average value of the random vector $\mathbf{y} = f(\mathbf{x};\mathbf{\phi})$. Is this correct?

By this logic, would this statement be correct too?

$$\mathbb{E}[X] = \mathbb{E}_{p(X)}[X]$$

$\endgroup$
10
  • 2
    $\begingroup$ Re "Is shorthand for and the same as": Not quite. Notice that the original expression explicitly mentions $\theta$ while the subsequent one does not. $\endgroup$
    – whuber
    Commented Sep 11, 2020 at 18:35
  • 3
    $\begingroup$ You got it right! This is quite a confusing notation. I prefere to use the notation $$\mathbb{E}_{\mathbf{x} \sim p(\mathbf{x}|\theta)}[X].$$ $\endgroup$ Commented Sep 11, 2020 at 19:59
  • 2
    $\begingroup$ I think you need to rely on the conventions and context established by the author. There is no universal notation. $\endgroup$
    – whuber
    Commented Sep 11, 2020 at 20:12
  • 2
    $\begingroup$ $\mathbb E[\mathbf X]$ is ambiguous, while $$\mathbb{E}_{\mathbf{X} \sim p(\mathbf{x}|\theta)}[X]$$and$$\mathbb{E}_{p(\cdot|\theta)}[X]$$and$$\mathbb{E}_{p(\mathbf{x}|\theta)}[X]$$are not. This is particularly true when considering varying values of a parameter $\theta$ such as$$\mathbb{E}_{p(\cdot;\mathbf{\theta})}[\log p(\mathbf{X};\mathbf{\phi})]$$found eg in the EM algorithm. $\endgroup$
    – Xi'an
    Commented Sep 12, 2020 at 8:08
  • 1
    $\begingroup$ Hi @jbuddy_13, in a classical neural network architecture, the posterior probability of classes $\mathbf{y}=[y_1,y_2,...,y_K]$ given an input feature vector $\mathbf{x}$ is $p(\mathbf{y}|\mathbf{x};\mathbf{w})$, where $\mathbf{w}$ are the parameters of the network. Note that $\mathbf{y}$ is in one-hot encoding. This posterior probability is estimated using maximum likelihood estimation, and therefore the objective is to maximize $E_{p(\mathbf{x},\mathbf{y})}[log(p(\mathbf{y}|\mathbf{x};\mathbf{w}))]$. $\endgroup$
    – mhdadk
    Commented Sep 13, 2020 at 12:09

1 Answer 1

8
$\begingroup$

The expression

$$\mathbb E[g(x;y;\theta;h(x,z),...)]$$

always means "the expected value with respect to the joint distribution of all things having a non-degenerate distribution inside the brackets."

Once you start putting subscripts in $\mathbb E$ then you specify perhaps a "narrower" joint distribution for which you want (for your reasons), to average over. For example, if you wrote $$\mathbb E_{\theta, z}[g(x;y;\theta;h(x,z),...)]$$ I would be inclined to believe that you mean only

$$\mathbb E_{\theta, z} = \int_{S_z}\int_{S_\theta}f_{\theta,z}(\theta, z)g(x;y;\theta;h(x,z),...) d\theta dz$$

and not $$\int_{S_z}\int_{S_\theta}\int_{S_x}\int_{S_y}f_{\theta,z,x,y}(\theta, z,x,y)g(x;y;\theta;h(x,z),...) d\theta\, dz\,dx \,dy$$

But it could also mean something else, see on the matter also https://stats.stackexchange.com/a/72614/28746

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.