Discrepancy between KFlold on the one hand and KFold with shuffle=True and RepeatedKFold on the other hand in sklearn

Question

I am comparing KFlold and RepeatedKFold using sklearn version 0.22. According to the documentation: RepeatedKFold "Repeats K-Fold n times with different randomization in each repetition." One would expect the results from running RepeatedKFold with only 1 repeat (n_repeats = 1) to be pretty much identical to KFold.

I ran a simple comparison:

import numpy as np
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import StratifiedKFold, KFold, RepeatedKFold, RepeatedStratifiedKFold
from sklearn import metrics

X, y = load_digits(return_X_y=True)

classifier = SGDClassifier(loss='hinge', penalty='elasticnet',  fit_intercept=True)
scorer = metrics.accuracy_score
results = []
n_splits = 5
kf = KFold(n_splits=n_splits)
for train_index, test_index in kf.split(X, y):
    x_train, y_train = X[train_index], y[train_index]
    x_test, y_test = X[test_index], y[test_index]
    classifier.fit(x_train, y_train)
    results.append(scorer(y_test, classifier.predict(x_test)))
print ('KFold')
print('mean = ', np.mean(results))
print('std = ', np.std(results))
print()

results = []
n_repeats = 1
rkf = RepeatedKFold(n_splits=n_splits, n_repeats = n_repeats)
for train_index, test_index in rkf.split(X, y):
    x_train, y_train = X[train_index], y[train_index]
    x_test, y_test = X[test_index], y[test_index]
    classifier.fit(x_train, y_train)
    results.append(scorer(y_test, classifier.predict(x_test)))
print ('RepeatedKFold')
print('mean = ', np.mean(results))
print('std = ', np.std(results))

The output is

KFold
mean =  0.9082079851439182
std =  0.04697225962068869

RepeatedKFold
mean =  0.9493562364593006
std =  0.017732595698953055

I repeated this experiment enough times to see that the difference is statistically significant.

I was trying to read and reread the documentation to see if I'm missing something but to no avail.

Btw, the same holds true for StratifiedKFold and RepeatedStratifiedKFold:

StratifiedKFold
mean =  0.9159935004642525
std =  0.026687786392525545

RepeatedStratifiedKFold
mean =  0.9560476632621479
std =  0.014405630805910506

For this data set, StratifiedKFold agrees with KFold; RepeatedStratifiedKFold agrees with RepeatedSKFold.

UPDATE Following the suggestion from @Dan and @SergeyBushmanov, I included shuffle and random_state

def run_nfold(X,y, classifier, scorer, cv,  n_repeats):
    results = []
    for n in range(n_repeats):
        for train_index, test_index in cv.split(X, y):
            x_train, y_train = X[train_index], y[train_index]
            x_test, y_test = X[test_index], y[test_index]
            classifier.fit(x_train, y_train)
            results.append(scorer(y_test, classifier.predict(x_test)))    
    return results
kf = KFold(n_splits=n_splits)
results_kf = run_nfold(X,y, classifier, scorer, kf, 10)
print('KFold mean = ', np.mean(results_kf))

kf_shuffle = KFold(n_splits=n_splits, shuffle=True, random_state = 11)
results_kf_shuffle = run_nfold(X,y, classifier, scorer, kf_shuffle, 10)
print('KFold Shuffled mean = ', np.mean(results_kf_shuffle))

rkf = RepeatedKFold(n_splits=n_splits, n_repeats = n_repeats, random_state = 111)
results_kf_repeated = run_nfold(X,y, classifier, scorer, rkf, 10)
print('RepeatedKFold mean = ', np.mean(results_kf_repeated)

produces

KFold mean =  0.9119255648406066
KFold Shuffled mean =  0.9505304859176724
RepeatedKFold mean =  0.950754100897555

Moreover, using Kolmogorov-Smirnov test:

print ('Compare KFold with KFold shuffled results')
ks_2samp(results_kf, results_kf_shuffle)
print ('Compare RepeatedKFold with KFold shuffled results')
ks_2samp(results_kf_repeated, results_kf_shuffle)

shows that KFold shuffled and RepeatedKFold (which looks it is is shuffled by default, you are right @Dan) are statistically the same, whereas the default non-shuffled KFold produces statistically significant lower result:

Compare KFold with KFold shuffled results
Ks_2sampResult(statistic=0.66, pvalue=1.3182765881237494e-10)

Compare RepeatedKFold with KFold shuffled results
Ks_2sampResult(statistic=0.14, pvalue=0.7166468440414822)

Now, note that I used different random_state for KFold and RepeatedKFold. So, the answer, or rather the partial answer, is that the difference in results is due to shuffling vs non-shuffling. Which makes sense, since using different random_state can change the exact split, and it shouldn't change the statistical properties, like the mean of multiple runs.

I'm now confused by why shuffling causes this effect. I've changed the title of the question to reflect this confusion ( I hope it doesn't break any stackoverflow rules, but I don't want to create another question).

UPDATE I agree with @SergeyBushmanov's suggestion. I posted it as a new question

No, I hadn't. I tried now and I'm getting a warning "FutureWarning: Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True." — David Makovoz, Commented Mar 2, 2020 at 16:23
So I would guess that RepeatedKFold has shuffle forced to True, so I would suggest the fairest test would be to set the same random_state for each and to set shuffle=True for Kfold. In your repeated experiments, did you get the same result for KFold each time and a different result for RepeatedKFold? — Dan, Commented Mar 2, 2020 at 16:29
@DavidMakovoz RepeatedKFold uses KFold underneath to generate folds. See the link to the code below in my answer. They are producing the same splits as long as random_seed is the same. — Sergey Bushmanov, Commented Mar 2, 2020 at 17:44
@SergeyBushmanov, yes, I implemented it above. The question now is why the default unshuffled KFold produces results that are statistically significantly different than shuffled KFold. — David Makovoz, Commented Mar 2, 2020 at 19:14

Sergey Bushmanov · Accepted Answer · 2020-03-02 16:39:20Z

To make RepeatedKFold results similar to KFold you have to:

np.random.seed(42)
n = np.random.choice([0,1],10,p=[.5,.5])
kf = KFold(2,shuffle=True, random_state=42)
list(kf.split(n))
[(array([2, 3, 4, 6, 9]), array([0, 1, 5, 7, 8])),
 (array([0, 1, 5, 7, 8]), array([2, 3, 4, 6, 9]))]

kfr = RepeatedKFold(n_splits=2, n_repeats=1, random_state=42)
list(kfr.split(n))
[(array([2, 3, 4, 6, 9]), array([0, 1, 5, 7, 8])),
 (array([0, 1, 5, 7, 8]), array([2, 3, 4, 6, 9]))]

RepeatedKFold uses KFold to generate folds, you only need to make it sure both have similar random_state.

Collectives™ on Stack Overflow

Discrepancy between KFlold on the one hand and KFold with shuffle=True and RepeatedKFold on the other hand in sklearn

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
python
scikit-learn
cross-validation
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged pythonscikit-learncross-validation or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
scikit-learn
cross-validation
or ask your own question.