Dear statistics experts
I need your help with something that has bothered me for a while now. My problem revolves around perfect prediction and essentially boils down to:
Why would we ever set up and estimate a statistical prediction model if we could just use the conditional distribution instead (given that we have enough data)?
An example:
Let's assume our goal is to most accurately predict a person $ i $'s Salary $ y_i $ (dependent variable) and, to keep matters simple, we have three independent variables $ X_j $, as Age. ($ X_{j=1} $), Education Level ($ X_{j=2} $), and Gender ($ X_{j=3} $). The sample we gathered is large and the people were randomly selected (i.i.d.). So, a classical regression problem. We evaluate prediction accuracy as mean squared error (MSE), hence our loss metric is:
$$ \frac{1}{N} \sum_i^{N} ( \hat{y}_i - y_i )^2 $$
The most common way to solve this prediction problem would be to fit a function predicting salary $ y_i $ using our three independent variables from a training sample. We then compare our predicted salaries $ \hat{y}_i $ to actual salaries $ y_i $ in a (previously unseen) test sample to compute the prediction errors and evaluate how far off we were. Formally, this most commonly is equivalent to:
$$ \hat{y}_i = E[y | X_j] = f(X_j)$$
Once we have formulated the prediction problem as fitting a function for estimating $ y_i $, we could restrict the function to be a linear regression model, a neural network, or any other machine learning algorithm. If we simply go for a linear regression model, then
$$ E[y | X_j] = \alpha + \beta_1 X_{j=1} + \beta_2 X_{j=2} + \beta_3 X_{j=3} $$
Now for my problem:
Regardless of what statistical prediction model we choose, even the most sophisticated machine learning model, by fitting a prediction function we will always only make a point prediction, which in our case is the conditional expected value.
Why do we not simply settle for the conditional probability $ P(y | X_j) $ instead of the conditional expected value $ E[y | X_j] $ ?
The advantage of the conditional probability would be that it gives us much more information about the relationship among dependent and independent variables than the simple point prediction. It converges in the limit to the true population value because it takes all information in the data into account and we do not need to make any further assumptions about the kind of model or function we fit, so we cannot make the error of, e.g., fitting a linear function to a heavily skewed and tailed distribution, neither do we need to calculate any confidence intervals because we already have the entire conditional distribution – so, in essence, all information we need about $ P(y | X_j) $ is already present in the data (sample). By fitting a function / statistical prediction model, we only impose additional assumptions, such as for example a joint normal distribution, which not necessarily need to hold in the population and for which we do not have any more knowledge than what is reflected in the data in the first place.
Clearly, to evaluate prediction performance, we must in the end still restrict ourselves to a point prediction since we otherwise cannot compute the MSE. But instead of directly going for the point prediction, we could derive any statistical central moment we wish for from the conditional distribution. And the conditional distribution shows us all historically recorded values of the dependent variable salary given specific values of the independent variables and not just the first central moment (expected value). This has the advantage that we would be aware of large tails or skewed prediction estimates, something we cannot see from point predictions alone.
The disadvantage of using the conditional distribution would be the curse of dimensionality and that we cannot make predictions outside of the intervals present in the historical data sample. Hence, we need a lot of data (ideally also at the boundaries). The curse of dimensionality implies that the number of data points in our sample must grow exponentially in relation to a linear increase in the number of independent variables. This is a major drawback of the conditional distribution approach. But on the flip side, it then also requires us to go, e.g., for the 3-5 or 5-10 most important independent variables only.
In practise and with real data, I guess using the conditional distribution for prediction problems would mean to go for non-parametric kernel density estimation. Whenever we receive new data points for the independent variables $ X_j $ and need to make a prediction, we could look at the conditional distribution of the dependent variable $ P(y | X_j) $ given the specific values of the independent variables. Then we could for example choose the value with the highest density (mode) as our point prediction or the expected value (or basically any other point value). The only parameter we need to estimate would be the bandwith for the kernels to get a conditional density estimate (or the bin width or number of bins if we choose to use histograms for the conditional distribution instead). I guess the process of doing this might be rather cumbersome in comparison to more established methods, such as linear regression.
However, would it not in theory be more accurate, respectively closer to the true values we want to predict? And does this approach make any sense to you and is my thought process for making an ideal prediction correct or am I blind to some mistaken assumptions or practical limitations?