I have a binary text classification model, and I would like to test how well it works, in terms of precision and recall, on a new dataset of 2 million text documents that have not been annotated yet. Given the background knowledge I have about this new dataset, I expect that the Positive class documents to be relatively rare.
I applied the model to the dataset, and it predicted that only 0.5% of this dataset belongs to the Positive class. Now I wonder what a good way might be to estimate the precision and recall without having to take a large random sample, given the potential rarity of the Positive class. For precision, I suppose, I could take a sample from the portion of documents predicted by the model to be Positive, but I am not sure how to go about estimating the recall.