10

I tried to read the docs for RepeatedStratifiedKFold and StratifiedKFold, but couldn't tell the difference between the two methods except that RepeatedStratifiedKFold repeats StratifiedKFold n times with different randomization in each repetition.

My question is: Do these two methods return the same results? Which one should I use to split an imbalanced dataset when doing GridSearchCV and what is the rationale for choosing that method?

1 Answer 1

12

Both StratifiedKFold and RepeatedStratifiedKFold can be very effective when used on classification problems with a severe class imbalance. They both stratify the sampling by the class label; that is, they split the dataset in such a way that preserves approximately the same class distribution (i.e., the same percentage of samples of each class) in each subset/fold as in the original dataset. However, a single run of StratifiedKFold might result in a noisy estimate of the model's performance, as different splits of the data might result in very different results. That is where RepeatedStratifiedKFold comes into play.

RepeatedStratifiedKFold allows improving the estimated performance of a machine learning model, by simply repeating the cross-validation procedure multiple times (according to the n_repeats value), and reporting the mean result across all folds from all runs. This mean result is expected to be a more accurate estimate of the model's performance (see this article).

Thus—to answer your question—no, these two methods would not provide the same results. Using RepeatedStratifiedKFold means that each time running the procedure would result in a different split of the dataset into stratified k-folds, and hence, the performance results would be different.

RepeatedStratifiedKFold has the benefit of improving the estimated model's performance at the cost of fitting and evaluating many more models. If, for example, 5 repeats (i.e., n_repeats=5) of 10-fold cross-validation were used for estimating the model's performance, it means that 50 different models would need to be fitted (trained) and evaluated—which might be computationally expensive, depending on the dataset's size, type of machine learning algorithm, device specifications, etc. However, RepeatedStratifiedKFold process could be executed on different cores or different machines, which could dramatically speed up the process. For instance, setting n_jobs=-1 would use all the cores available on your system (have a look here).

When it comes to evaluation, make sure to use appropriate metrics, as described in this answer.

1
  • Thanks Chris. But if we set the the argument shuffle=True in StratifiedKFold that might reduce the chance of noisy estimate of model as per your reply. Moreover, GridSearchCV(estimator(200), cv=StratifiedKFold) might be as comprehensive as GridSearchCV(estimator(100), cv=RepeatedStratifiedKFold) as it has twice as iterations/epochs?
    – Nemo
    Commented Mar 24, 2022 at 0:10

Not the answer you're looking for? Browse other questions tagged or ask your own question.