8
$\begingroup$

In binary classification problems it seems the F1 score is often used as a performance measure. As far as I've understood the idea is to find the best tradeoff between precision and recall. The formula for the F1 score is symmetric in precision and recall. However, (and that's what bothers me) there is an asymmetry between precision and recall. While recall is a property of the classifier that is independent of prior probabilities, precision is a quantity that does depend on the prior probabilities.

Can anyone tell me what's so special about the combination of precision and recall? Why don't we use precision (which is the positive predictive value) and negative predictive value?

$\endgroup$

2 Answers 2

4
$\begingroup$

F1 score weights precision and recall equally but there are easy generalizations to any case where you consider recall $\beta$ times more important than precision. See https://en.wikipedia.org/wiki/F1_score:

$F_\beta = (1 + \beta^2) \frac{precision \cdot recall}{\beta^2 precision + recall}$

F1 is just a harmonic mean. The simple mean doesn't quite make sense because precision and recall have the same numerators (true positive) but different denominators (test positive, condition positive). So only a harmonic mean makes sense. I don't know if there's more theory to it than that -- the simplest weighted mean that make sense.

I think I get the gist of your point, which I paraphrase to refer to the fact that precision has "test positive" in the denominator so is quite sensitive to how much the classifier marks positive. For this reason you don't so often see, for example, precision-recall curves. You see ROC curves which are recall-specificity curves (true positive rate vs false positive rate).

That's closer to what you suggest but you're suggesting PPV vs NPV. Sure that could be valid depending on your use case, but I think the argument tends to cut the other way, to recall-specificity instead, not precision-NPV.

$\endgroup$
0
$\begingroup$

If it is purely a binary classification problem (class A vs. class B), then the benefit of the F-score is primarily for characterizing performance over an unbalanced data set (more instances of one class than the other) and your question/concern is more relevant. The Wikipedia page for F-score states

"Note, however, that the F-measures do not take the true negatives into account, and that measures such as the Phi coefficient, Matthews correlation coefficient, Informedness or Cohen's kappa may be preferable to assess the performance of a binary classifier."

But if the classifier is intended to be a detector, one is usually more interested in performance with respect to the target class (Positive) than the non-target class (Negative). Furthermore, the target is often the one that is under-represented in the data set. In that context, I think it is more intuitive to want to know what fraction of targets are detected (recall) and how reliable/confident each detection is (precision). While knowing how good the detector is at not detecting non-targets (negative predictive value) can have value, it is not a very insightful quantity to deal with when trying to characterize the performance of a target detector with an imbalanced data set.

In short, the F-score tuning parameter ($\beta$) provides a more intuitive way to balance the importance of detecting all the targets (high recall) with the importance of having detections with high confidence (high precision). Note also that the F-score can be written in terms of Type I and Type II errors (see the Wikipedia link above).

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.