1
$\begingroup$

I have a binary text classification model, and I would like to test how well it works, in terms of precision and recall, on a new dataset of 2 million text documents that have not been annotated yet. Given the background knowledge I have about this new dataset, I expect that the Positive class documents to be relatively rare.

I applied the model to the dataset, and it predicted that only 0.5% of this dataset belongs to the Positive class. Now I wonder what a good way might be to estimate the precision and recall without having to take a large random sample, given the potential rarity of the Positive class. For precision, I suppose, I could take a sample from the portion of documents predicted by the model to be Positive, but I am not sure how to go about estimating the recall.

$\endgroup$
4
  • $\begingroup$ If you trust the $0.5\%$ figure to be accurate, perhaps from a different study, then you can use that. But if it is only a rough order of magnitude, then large samples may be the way to go. $\endgroup$
    – Henry
    Commented Jul 9 at 0:21
  • $\begingroup$ I would be very careful about metrics like these, especially (but not only) in this "unbalanced" case - they all suffer from the same issues as accuracy: stats.stackexchange.com/q/312119/1352. And both the "hard" classifications and the metrics will be highly sensitive to how you set the threshold. Do consider proper scoring rules. And I'm with @Henry here: if you have a very weak signal, you will likely need a large sample size. $\endgroup$ Commented Jul 9 at 6:59
  • $\begingroup$ Is your classification model only capable of class predictions, or can you retrieve probabilities or some other measure of confidence? $\endgroup$ Commented Jul 9 at 12:45
  • $\begingroup$ @Henry, the 0.5% is just the positive prediction rate, so AFAICT it doesn't say anything about either precision or recall? $\endgroup$ Commented Jul 9 at 12:47

0

Browse other questions tagged or ask your own question.