4
$\begingroup$

Dear statistics experts

I need your help with something that has bothered me for a while now. My problem revolves around perfect prediction and essentially boils down to:

Why would we ever set up and estimate a statistical prediction model if we could just use the conditional distribution instead (given that we have enough data)?

An example:

Let's assume our goal is to most accurately predict a person $ i $'s Salary $ y_i $ (dependent variable) and, to keep matters simple, we have three independent variables $ X_j $, as Age. ($ X_{j=1} $), Education Level ($ X_{j=2} $), and Gender ($ X_{j=3} $). The sample we gathered is large and the people were randomly selected (i.i.d.). So, a classical regression problem. We evaluate prediction accuracy as mean squared error (MSE), hence our loss metric is:

$$ \frac{1}{N} \sum_i^{N} ( \hat{y}_i - y_i )^2 $$

The most common way to solve this prediction problem would be to fit a function predicting salary $ y_i $ using our three independent variables from a training sample. We then compare our predicted salaries $ \hat{y}_i $ to actual salaries $ y_i $ in a (previously unseen) test sample to compute the prediction errors and evaluate how far off we were. Formally, this most commonly is equivalent to:

$$ \hat{y}_i = E[y | X_j] = f(X_j)$$

Once we have formulated the prediction problem as fitting a function for estimating $ y_i $, we could restrict the function to be a linear regression model, a neural network, or any other machine learning algorithm. If we simply go for a linear regression model, then

$$ E[y | X_j] = \alpha + \beta_1 X_{j=1} + \beta_2 X_{j=2} + \beta_3 X_{j=3} $$

Now for my problem:

Regardless of what statistical prediction model we choose, even the most sophisticated machine learning model, by fitting a prediction function we will always only make a point prediction, which in our case is the conditional expected value.

Why do we not simply settle for the conditional probability $ P(y | X_j) $ instead of the conditional expected value $ E[y | X_j] $ ?

The advantage of the conditional probability would be that it gives us much more information about the relationship among dependent and independent variables than the simple point prediction. It converges in the limit to the true population value because it takes all information in the data into account and we do not need to make any further assumptions about the kind of model or function we fit, so we cannot make the error of, e.g., fitting a linear function to a heavily skewed and tailed distribution, neither do we need to calculate any confidence intervals because we already have the entire conditional distribution – so, in essence, all information we need about $ P(y | X_j) $ is already present in the data (sample). By fitting a function / statistical prediction model, we only impose additional assumptions, such as for example a joint normal distribution, which not necessarily need to hold in the population and for which we do not have any more knowledge than what is reflected in the data in the first place.

Clearly, to evaluate prediction performance, we must in the end still restrict ourselves to a point prediction since we otherwise cannot compute the MSE. But instead of directly going for the point prediction, we could derive any statistical central moment we wish for from the conditional distribution. And the conditional distribution shows us all historically recorded values of the dependent variable salary given specific values of the independent variables and not just the first central moment (expected value). This has the advantage that we would be aware of large tails or skewed prediction estimates, something we cannot see from point predictions alone.

The disadvantage of using the conditional distribution would be the curse of dimensionality and that we cannot make predictions outside of the intervals present in the historical data sample. Hence, we need a lot of data (ideally also at the boundaries). The curse of dimensionality implies that the number of data points in our sample must grow exponentially in relation to a linear increase in the number of independent variables. This is a major drawback of the conditional distribution approach. But on the flip side, it then also requires us to go, e.g., for the 3-5 or 5-10 most important independent variables only.

In practise and with real data, I guess using the conditional distribution for prediction problems would mean to go for non-parametric kernel density estimation. Whenever we receive new data points for the independent variables $ X_j $ and need to make a prediction, we could look at the conditional distribution of the dependent variable $ P(y | X_j) $ given the specific values of the independent variables. Then we could for example choose the value with the highest density (mode) as our point prediction or the expected value (or basically any other point value). The only parameter we need to estimate would be the bandwith for the kernels to get a conditional density estimate (or the bin width or number of bins if we choose to use histograms for the conditional distribution instead). I guess the process of doing this might be rather cumbersome in comparison to more established methods, such as linear regression.

However, would it not in theory be more accurate, respectively closer to the true values we want to predict? And does this approach make any sense to you and is my thought process for making an ideal prediction correct or am I blind to some mistaken assumptions or practical limitations?

$\endgroup$
5
  • 2
    $\begingroup$ My first thought is: in the context of predicting a continuous variable like salary, what do you mean exactly by conditional probability in "why do we not simply settle for the conditional probability instead of the expected value"? $\endgroup$
    – Adrià Luz
    Commented Nov 19, 2022 at 15:05
  • $\begingroup$ @AdriàLuz Thanks for your remark. You're right, conditional probability density would formally be more accurate. I meant the conditional probability density function $ f_{y | X_j}(y | X_j ) $ of salary based on the independent variables, so, e.g., $ P( Salary | Age = 43, Education Level = College, Gender = F ) $. However, in the case the data is sparse with regard to some combinations of variable values, this clearly limits the application of conditional probability density functions. Hope this answers your question. $\endgroup$
    – This_is_it
    Commented Nov 19, 2022 at 15:22
  • 6
    $\begingroup$ You almost never have the true conditional distribution available, so you are forced to use a statistical model. (Whether it is parametric one like linear regression or a nonparametric one like kernel density estimation is a secondary issue.) $\endgroup$ Commented Nov 19, 2022 at 15:50
  • $\begingroup$ @RichardHardy I agree. However, we can still estimate the conditional probability distribution with data and samples, instead of directly going for the conditional expectation value (or another point prediction). We neither have the true conditional expectation, but only estimate it by using, e.g., a regression model. However, I assume you‘re inplying that we usually cannot model the conditional distribution because we don‘t have enough data to reliably cover all combinations of independent variables. I nevertheless wonder if there are also other reasons to not use the conditional distribution $\endgroup$
    – This_is_it
    Commented Nov 20, 2022 at 10:10
  • $\begingroup$ I was not really trying to imply anything about dimensionality. Rather, I was making a point about terminology. It does not answer your question, however, so I posted it as a comment. $\endgroup$ Commented Nov 20, 2022 at 11:44

3 Answers 3

3
$\begingroup$

Before getting to your key question, I think you might be labouring under some misapprehensions about what a regression model does. A standard regression model does in fact give a full stipulation of the response variable conditional on the explanatory variables. Often the model is decomposed into a regression function that looks at the conditional expectation and then an error term that measures the deviation of the response from its true expected value under the model. This is still a full stipulation of the conditional distribution.

As an example, consider the Gaussian linear regression model:

$$y_i = \beta_0 + \beta_1 x_{i,1} + \cdots + \beta_k x_{i,k} + \varepsilon_i \quad \quad \quad \quad \quad \varepsilon_i \sim \text{IID N}(0, \sigma^2).$$

This is just a decomposed way of presenting the joint conditional distribution:

$$\mathbf{y}_n| \mathbf{x}_n \sim \text{N} ( \mathbf{x} \boldsymbol{\beta}, \sigma^2 \mathbf{I}_n ).$$

So as you can see, it is not accurate to assert that a regression model only models the conditional expectation and not the full conditional distribution. It models the latter, but of course it might impose constraints on the form of the latter such that it does not give the true conditional distribution. In a Gaussian linear regression model, the form of the conditional distribution is set as the Gaussian distribution, so if the true conditional distribution is non-Gaussian then it will not adapt to this, irrespective of the amount of data you use. That is one of the drawbacks of using a narrow parametric model.


It is of course possible to use broader models that use parametric distributional families with a larger number of model parameters, and these are more flexible in adapting to the true conditional distribution as $n \rightarrow \infty$. For example if you were to use a scalable T-distribution for the error term in a regression model then it would be more flexible than the Gaussian distribution and it could capture variation in the kurtosis of the error terms. You can generalise parametric models essentially without limit, so it is possible to add more parameters to give greater flexibility to the distributional form if you wish.

If you would like a model that adapts well to the data when the sample size is extremely large, it is useful to consider switching to a "nonparametric" model. In particular, it is possible to use nonparametric regression models where the conditional distribution adapts to the data as the amount of data within a small neighbourhood of a conditioning point tends to infinity. These models generally use distributional forms that are mixtures of parametric distributions, with the effective number of parameters growing without bound as $n \rightarrow \infty$. This lets them adapt to the true conditional distribution in the limit, so long as you have a large amount of data within a small "neighbourhood" of a conditioning point.

Finally, it is worth noting that in regression problems you are not just trying to find a single conditional distribution (i.e., for a single conditioning point). You are trying to find the conditioning distribution over a large space of possible values of the conditioning values. Usually this space is about the size of some finite-dimensional Euclidian space, which is huge compared to the size of your dataset. This means that it is infeasible to obtain good nonparametric inferences over the entire space. Even if you had a countably infinite dataset, this would still be tiny with respect to the size of the conditioning space; it might only give 0 or 1 datapoints in the neighbourhood of any given conditioning point.

$\endgroup$
6
  • $\begingroup$ Thanks for your explanations. I guess we are talking about two different things here, which I‘ll quickly try to disentangle: With a standard linear regression model, we simply model and predict the expected value $ E[y | X_j] $ or $ E[ P( y | X_j ) ] $, a point prediction, not the full conditional probability distribution $ P( y | X_j ) $. Otherwise, you‘d get multiple prediction values when having $ Salary | Age=43, Education Level=College, Gender=F) $, which clearly you do not. $\endgroup$
    – This_is_it
    Commented Nov 20, 2022 at 8:43
  • $\begingroup$ It is true that for Gaussian joint probability distributions, this still models the full conditional distribution (on the right hand side of the equation though, but not for the response, which is only the expected value). But this solely works because joint normal distributions are fully described by the first and second central moment, mean and variance (including the covariance between response and predictors). This is what I meant with additional assumptions imposed by the statistical model, for which we do not necessarily have prior information from the data alone. $\endgroup$
    – This_is_it
    Commented Nov 20, 2022 at 8:48
  • $\begingroup$ I agree with you that we‘d need a very large data set, growing exponentially with the number of additional predictors, to estimate the true conditional probability distribution of the response. I tried to mention this in my question above when I referred to the curse of dimensionality. And in practise, we may indeed often only have data which is too sparse when looking at more than a few dimensions. $\endgroup$
    – This_is_it
    Commented Nov 20, 2022 at 8:58
  • $\begingroup$ @This_is_it, Otherwise, you‘d get multiple prediction values <...>, which clearly you do not. I think you might have misunderstood the answer. The first few paragraphs explain that you actually do get an entire conditional distribution. Now if you choose to extract a single scalar-valued function of it, you get a single scalar prediction. But you do not have to. $\endgroup$ Commented Nov 20, 2022 at 11:39
  • $\begingroup$ @Ben Ok, I guess I see your point that the standard linear regression model still does model the conditional distribution due to the stochastic nature of the error terms and only taking the conditional expectation makes it a point prediction. So we would theoretically get multiple values for the same realisations of the independent variables due to the stochastic error terms. But as said previously, this is imposing strong assumptions on the nature of data we have and its distribution, as well as on the distribution of the error terms. $\endgroup$
    – This_is_it
    Commented Nov 20, 2022 at 20:47
1
$\begingroup$

Ultimately the question is how to optimally use the available information. Particularly, if we want to estimate the "predictive distribution", i.e., the conditional distribution you are asking about at a certain multivariate $X$, what information is relevant for this? Your suggestion in its simplest form seems to amount to using only the available observations at the same $X$, i.e., to estimate $P(y|X)$ by the empirical distribution of $y$ at $X$, or, say, a kernel density estimator. This (a) requires that there are enough observations at $X$ already to estimate the conditional distribution accurately, and (b) that involving observations at other values of $X$ is rather harmful than useful.

(a) obviously can only work if the number of $X$ at which observations occur is very limited, as you can't do this if a previously unseen $X$ comes up, and you won't do very well if $X$ has only been seen a very small number of times. If you're implicitly assuming that for every $X$ there are already many observations, that may not be so much of an issue, but note that even in big data sets, if there are many (even discrete) variables, the possible number of combinations of outcomes defining an $X$-vector completely can be huge, and much larger than the number of observations, even if the latter is pretty large as well.

In any case, the idea of regression modelling is that you may want to use more information, namely of the observations at other $X$-values, in case that the overall pattern of dependence of $y$ on all $X$ is also a good thing to consider if you want to predict (or even estimate the full conditional distribution) at a specific $X$. If you don't have enough observations to apply your idea to every $X$ at which you may want to predict (which is the rule rather than the exception), it is clear that there is no way around such modelling. If however there is enough information, there are pros and cons. Regression modelling allows you to take into account the information of all observations and will therefore improve your prediction and/or estimation of conditional distribution if the regression model is good enough, and may do harm otherwise, if model assumptions are imposed that are not appropriate. If you really have enough observations at every single $X$, you are in a situation where you can empirically compare your approach to a regression modelling approach, and what you suggest may or may not do better, depending on the exact situation. However, as I wrote above, I believe that this is rather an exceptional situation and more often than not, you'll need modelling because you want to predict at some $X$ where in your training data you have no or only very few observations.

PS: In some situations, for example in designed experiments with only two or few available levels per variable and few variables, indeed enough observations at every single $X$-vector may be available, however modelling may still be of interest because the model outcome will give you some more information of interest, particularly regarding separating main effects from interactions, and defining interpretable parameters for these. You don't have this if you only look at every single $X$ separately. Modelling also gives you a handle to get at a potential dependence structure between observations.

$\endgroup$
0
$\begingroup$

Why would we ever set up and estimate a statistical prediction model if we could just use the conditional distribution instead (given that we have enough data)?

You are assuming here that you know the distribution. This is usually not true.

For a start, let's assume that you know the distribution. The expected value is a prediction for $y$ (conditional on $X$), the probability distribution tells you about the probability of observing different values of $y$. To make a prediction you would need to calculate the expected value, mode, or something else. There usually won't be a closed-form solution for finding the value, so even when knowing the distribution you would need something to estimate or approximate the answer from the distribution.

But, as said before, usually you don't know the distribution. This means that you would need first to estimate the distribution, then use it to find the approximate value of the estimate from the distribution. This means two layers of approximations (linear regression does this directly).

Next problem is that you want to use a nonparametric method to estimate the distribution. Such methods are very flexible, so can easily approximate any distribution, but also can easily overfit to training data, not generalizing well beyond it. To avoid this, you need much more data compared to simpler models. Since we are talking about multivariate distribution, because of the curse of dimensionality, this means much more data.

Finally, you want to use kernel density estimation. You are saying that this “only” needs finding the bandwidth. In fact, it is a very hard problem, even harder for multivariate distributions. Because of this, people are usually using rules of thumb to find the bandwidth, which do not guarantee optimal results. The kernel density estimator is very sensitive to the choice. Taking this all together, kernel density may give you a very rough approximation of the true distribution. Another problem is that vanilla kernel density does not scale well with data size, so with more data, you usually need methods that approximate it, leading to the the the third layer of approximation.

So you are correct that if you knew the full distribution, it would give you much richer information than the point estimate alone. This is one of the upsides of Bayesian models that aim to find distributions for the predictions rather than point estimates. The problem is that you usually don't know the distribution and estimating it is a non-trivial problem, much harder than finding the expected value.

So it's like asking “should I go there with Lamborghini or by bike?” Sure, if you have Lamborghini, use it, but do you? It also has its costs like fuel, maintenance, etc, so in some cases bike would still be just enough.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.