0
$\begingroup$

In ML, we often talk about the bias-variance tradeoff, and how increasing model complexity both reduces bias and increases variance. I understand why increasing model complexity reduces bias at first, but it's less clear to me once you get into overfitting territory.

Here is the formula for the bias: $\operatorname{Bias}_D\big[\hat{f}(x;D)\big] = \operatorname{E}_D\big[\hat{f}(x;D)- f(x)\big]$. As you increase model complexity past a certain point (ignoring double descent), the model starts to heavily overfit its training data, and starts to make wild, increasingly incorrect predictions on much of the rest of the data distribution. Why would this reduce the bias on the entire data distribution?

In this post: Does bias eventually increase with model complexity?, the accepted answer claims that the average of the model's predictions across training sets sampled from the data distribution will average to something close to the true values. But it's not clear to me why this is true, and why it can't average to something else.

$\endgroup$

1 Answer 1

0
$\begingroup$

claims that the average of the model's predictions across training sets sampled from the data distribution will average to something close to the true values.

This is essentially the definition of model bias. With more "complex" models, we widen the scope of predictions that the model could theoretically return. Example, think degrees of a polynomial, moving from a linear function to a quadratic one, it is now more possible to get a fitted function that fits the training data well.

As for why the average prediction must be close to the true values, imagine your fitted prediction function was "off" by some amount on average (it "averaged to something else"). The amount you are off by can be expressed as some other function of the input $X$. Thus you could just add/subtract that function to your current prediction function, and then you wouldn't be "off" anymore and your model's predictions across training sets would average to the true values.

But usually you will have to increase model complexity to do this, e.g. add an additional parameter.

$\endgroup$
1
  • $\begingroup$ I understand that is the definition of model bias. This argument about how you could just add/subtract that function applies to the training dataset, but not necessarily to the rest of the distribution. I don't understand why it needs to average out to the right value in the rest of the distribution. $\endgroup$
    – user35734
    Commented Jun 10 at 21:55

Not the answer you're looking for? Browse other questions tagged or ask your own question.