Revisions to Differences between RepeatedStratifiedKFold and StratifiedKFold in sklearn

added 569 characters in body

Source Link

edited Oct 5, 2022 at 6:45

29.2k
9
83
172

A single run ofBoth StratifiedKFold and RepeatedStratifiedKFold can be very effective when used on classification problems with a severe class imbalance. They both stratify the sampling by the class label; that is, they split the dataset in such a way that preserves approximately the same class distribution (i.e., the same percentage of samples of each class) in each subset/fold as in the original dataset. However, a single run of StratifiedKFold might result in a noisy estimate of the model's performance, as different splits of the data might result in very different results. That'sThat is where RepeatedStratifiedKFoldRepeatedStratifiedKFold comes into play.

RepeatedStratifiedKFold allows improving the estimated performance of a machine learning model, by simply repeating the cross-validation procedure multiple times (according to the n_repeats value), and reporting the mean result across all folds from all runs. This mean result is expected to be a more accurate estimate of the model's performance (see this article).

Thus—to answer your question—no, these two methods would not provide the same results. Using RepeatedStratifiedKFold means that each time running the procedure would result in a different split of the dataset into stratified k-folds, and hence, the performance results would be different.

RepeatedStratifiedKFold has the benefit of improving the estimated model's performance at the cost of fitting and evaluating many more models. If, for example, 5 repeats (i.e., n_repeats=5) of 10-fold cross-validation were used for estimating the model's performance, it means that 50 different models would need to be fitted (trained) and evaluated—which might be computationally expensive, depending on the dataset's size, type of machine learning algorithm, device specifications, etc. However, RepeatedStratifiedKFold process could be executed on different cores or different machines, which could dramatically speed up the process. For instance, setting n_jobs=-1 would use all the cores available on your system (have a look here).

When it comes to evaluation, make sure to use appropriate metrics, as described in this answer.

A single run of StratifiedKFold might result in a noisy estimate of the model's performance, as different splits of the data might result in very different results. That's where RepeatedStratifiedKFold comes into play.

RepeatedStratifiedKFold allows improving the estimated performance of a machine learning model, by simply repeating the cross-validation procedure multiple times (according to the n_repeats value), and reporting the mean result across all folds from all runs. This mean result is expected to be a more accurate estimate of the model's performance (see this article).

Thus—to answer your question—no, these two methods would not provide the same results. Using RepeatedStratifiedKFold means that each time running the procedure would result in a different split of the dataset into stratified k-folds, and hence, the performance results would be different.

RepeatedStratifiedKFold has the benefit of improving the estimated model's performance at the cost of fitting and evaluating many more models. If, for example, 5 repeats (i.e., n_repeats=5) of 10-fold cross-validation were used for estimating the model's performance, it means that 50 different models would need to be fitted (trained) and evaluated—which might be computationally expensive, depending on the dataset's size, type of machine learning algorithm, device specifications, etc. However, RepeatedStratifiedKFold process could be executed on different cores or different machines, which could dramatically speed up the process. For instance, setting n_jobs=-1 would use all the cores available on your system (have a look here).

Both StratifiedKFold and RepeatedStratifiedKFold can be very effective when used on classification problems with a severe class imbalance. They both stratify the sampling by the class label; that is, they split the dataset in such a way that preserves approximately the same class distribution (i.e., the same percentage of samples of each class) in each subset/fold as in the original dataset. However, a single run of StratifiedKFold might result in a noisy estimate of the model's performance, as different splits of the data might result in very different results. That is where RepeatedStratifiedKFold comes into play.

RepeatedStratifiedKFold allows improving the estimated performance of a machine learning model, by simply repeating the cross-validation procedure multiple times (according to the n_repeats value), and reporting the mean result across all folds from all runs. This mean result is expected to be a more accurate estimate of the model's performance (see this article).

Thus—to answer your question—no, these two methods would not provide the same results. Using RepeatedStratifiedKFold means that each time running the procedure would result in a different split of the dataset into stratified k-folds, and hence, the performance results would be different.

RepeatedStratifiedKFold has the benefit of improving the estimated model's performance at the cost of fitting and evaluating many more models. If, for example, 5 repeats (i.e., n_repeats=5) of 10-fold cross-validation were used for estimating the model's performance, it means that 50 different models would need to be fitted (trained) and evaluated—which might be computationally expensive, depending on the dataset's size, type of machine learning algorithm, device specifications, etc. However, RepeatedStratifiedKFold process could be executed on different cores or different machines, which could dramatically speed up the process. For instance, setting n_jobs=-1 would use all the cores available on your system (have a look here).

When it comes to evaluation, make sure to use appropriate metrics, as described in this answer.

added 204 characters in body

Source Link

edited Oct 5, 2022 at 5:54

Chris

29.2k
9
83
172

A single run of StratifiedKFold might result in a noisy estimate of modelthe model's performance, as different splits of the data might result in very different results. That's where RepeatedStratifiedKFold comes into play.

RepeatedStratifiedKFold allows improving the estimated performance of a machine learning model, by simply repeating the cross-validationcross-validation procedure multiple times (according to the n_repeats value), and reporting the meanmean result across all folds from all runs. This meanmean result is expected to be a more accurate estimate of the modelmodel's performance (there is a great articlesee herethis article explaining this).

Thus, toThus—to answer your question. question—Nono, these two methods would not returnnot provide the same results. As - withUsing RepeatedStratifiedKFold -means that each time running the procedure resultswould result in a different split of the dataset into stratified kk-folds, and hence, the performance results would be different.

RepeatedStratifiedKFold has the benefit of improving the estimated modelmodel's performance at the cost of fitting and evaluating many more models. If, for example, 5 repeats (i.e., n_repeats=5) of 10-fold cross-validation arewere used for estimating the modelmodel's performance, thisit means that 50 different models would need to be fitfitted (trained) and evaluated; whichevaluated—which might be computationally costlyexpensive, depending on the dataset's size, type of machine learning algorithm, device specifications, etc. However, RepeatedStratifiedKFold process cancould be executed on different cores or different machines, which could dramatically speed up the process. For instance, setting n_jobs=-1 would use all availablethe cores available on your system (have a look here).

A single run of StratifiedKFold might result in a noisy estimate of model performance, as different splits of the data might result in very different results. That's where RepeatedStratifiedKFold comes into play.

RepeatedStratifiedKFold allows improving the estimated performance of a machine learning model, by simply repeating the cross-validation procedure multiple times (according to n_repeats value) and reporting the mean result across all folds from all runs. This mean result is expected to be a more accurate estimate of the model performance (there is a great article here explaining this).

Thus, to answer your question. No, these two methods would not return the same results. As - with RepeatedStratifiedKFold - each time running the procedure results in a different split of the dataset into stratified k-folds, the performance results would be different. RepeatedStratifiedKFold has the benefit of improving the estimated model performance at the cost of fitting and evaluating many more models. If, for example, 5 repeats (n_repeats=5) of 10-fold cross-validation are used for estimating the model performance, this means that 50 different models would need to be fit and evaluated; which might be computationally costly, depending on the dataset's size, type of machine learning algorithm, device specifications, etc. However, RepeatedStratifiedKFold process can be executed on different cores or different machines. For instance, setting n_jobs=-1 would use all available cores (have a look here).

A single run of StratifiedKFold might result in a noisy estimate of the model's performance, as different splits of the data might result in very different results. That's where RepeatedStratifiedKFold comes into play.

RepeatedStratifiedKFold allows improving the estimated performance of a machine learning model, by simply repeating the cross-validation procedure multiple times (according to the n_repeats value), and reporting the mean result across all folds from all runs. This mean result is expected to be a more accurate estimate of the model's performance (see this article).

Thus—to answer your question—no, these two methods would not provide the same results. Using RepeatedStratifiedKFold means that each time running the procedure would result in a different split of the dataset into stratified k-folds, and hence, the performance results would be different.

RepeatedStratifiedKFold has the benefit of improving the estimated model's performance at the cost of fitting and evaluating many more models. If, for example, 5 repeats (i.e., n_repeats=5) of 10-fold cross-validation were used for estimating the model's performance, it means that 50 different models would need to be fitted (trained) and evaluated—which might be computationally expensive, depending on the dataset's size, type of machine learning algorithm, device specifications, etc. However, RepeatedStratifiedKFold process could be executed on different cores or different machines, which could dramatically speed up the process. For instance, setting n_jobs=-1 would use all the cores available on your system (have a look here).

Source Link

created Mar 23, 2022 at 7:21

Chris

29.2k
9
83
172

A single run of StratifiedKFold might result in a noisy estimate of model performance, as different splits of the data might result in very different results. That's where RepeatedStratifiedKFold comes into play.

RepeatedStratifiedKFold allows improving the estimated performance of a machine learning model, by simply repeating the cross-validation procedure multiple times (according to n_repeats value) and reporting the mean result across all folds from all runs. This mean result is expected to be a more accurate estimate of the model performance (there is a great article here explaining this).

Thus, to answer your question. No, these two methods would not return the same results. As - with RepeatedStratifiedKFold - each time running the procedure results in a different split of the dataset into stratified k-folds, the performance results would be different. RepeatedStratifiedKFold has the benefit of improving the estimated model performance at the cost of fitting and evaluating many more models. If, for example, 5 repeats (n_repeats=5) of 10-fold cross-validation are used for estimating the model performance, this means that 50 different models would need to be fit and evaluated; which might be computationally costly, depending on the dataset's size, type of machine learning algorithm, device specifications, etc. However, RepeatedStratifiedKFold process can be executed on different cores or different machines. For instance, setting n_jobs=-1 would use all available cores (have a look here).

Collectives™ on Stack Overflow

Return to Answer