0
$\begingroup$

In my experiences, binary classifiers tend do better in terms of F1 scores when the class imbalance is at least reduced. However, this leads to over-predicting in the test data.

(Thought) Example: If both the training and the test data have about 20% dogs and 80% cats and the dog/cat classifier is trained with adjustment for class imbalance, it will tend to predict the proportion of dogs in the test data to be way higher than 20%, say even ~40%.

In applications where the interest is not just the individual data point's prediction, but the aggregate # predicted positive, this over-prediction seems to create problems. For example, if the "defaulting on a loan" is both the positive label and the minority class at ~5% in the data, a classifier that predicts 15% for the default rate in the test data seems unusable.

Question: Is it possible to address class-imbalance in a way that on the test data: % positive ~ % predicted positive?

If this trade-off (better f1 with class imbalance adjustment v.s. mimicking % positive on the test data) is unavoidable, are there references / guides to think this through?

$\endgroup$
2

1 Answer 1

3
$\begingroup$

"binary classifiers tend do better in terms of F1 scores when the class imbalance is at least reduced. However, this leads to over-predicting in the test data"

This suggests that you need to work out which performance metric(s) are relevant for your application. F1 and accuracy (c.f. "over-predicting") do not measure the same thing and optimising one will generally make the other worse. In this case accuracy and F1 both depend on the setting of the threshold probability at which a pattern is assigned to the positive class, so you have directly conflicting performance measures.

Regarding the "overpredicting" issue, this is indeed expected behaviour, see my answer to a related question here for an explanation as to why this is the case.

My recommendation would be to use a probabilistic classifier, and then you can have different thresholds depending on whether you are looking at accuracy or F1. And also use the log-loss or Brier score for assessing the calibration of the probabilities and the Area Under the Reciever Operating Characteristic to assess the ranking of patterns according to their tendency to belong to the positive class. It is a good idea to use metrics that assess different aspects of the model's behaviour, not just ones that depend heavily on the threshold (like F1 or accuracy, or expected loss etc.)

As to imbalance, if you have a very small dataset then imbalance can lead to an undue bias against the minority class. However in that case the problem is that you have too little data to adequately characterise the distribution of the minority class, so the only real fix is to gather more (real) data. As you provide more data, this undue bias will rapidly disappear. In most problems class imbalance is a non-problem and you don't need to do anything about it. Often the real issue is cost-sensitive learning - the costs of false-positive and false-negative errors are not the same, and they should be built into the decision rule (and performance metric).

Note that the optimal information for classification problems is the posterior probability of class membership $P(C|x) \propto P(x|C)P(C)$ so the prior probabilities (class frequencies), $P(C)$, are a component of the optimal decision rule, so if you alter them (e.g. by balancing) you are throwing away useful information.

Whatever you do, don't use SMOTE.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.