0

I'm trying to understand what the shuffle parameter does in StratifiedKFold from sklearn.model_selection.

I've read the documentation but still don't understand what shuffle=True does. Can someone please explain what shuffle=True does in plain english?

From the documentation:

shuffle: bool, default=False Whether to shuffle each class’s samples before splitting into batches. Note that the samples within each split will not be shuffled.

The implementation is designed to:

  • Generate test sets such that all contain the same distribution of classes, or as close as possible.
  • Be invariant to class label: relabelling y = ["Happy", "Sad"] to y = [1, 0] should not change the indices generated.
  • Preserve order dependencies in the dataset ordering, when shuffle=False: all samples from class k in some test set were contiguous in y, or separated in y by samples from classes other than k.
  • Generate test sets where the smallest and largest differ by at most one sample.

1 Answer 1

1

It's been a while since the question was asked, but I'll try to answer it in case someone else finds it useful.

Maybe this part of the documentation is a bit confusing:

"Note that the samples within each split will not be shuffled"

Here is an example of what it means (without going into details of how it is implemented):

Suppose you have the following arrays

  • X = [0, 1, 2, 3, 4, 5, 6, 7, 8]
    
  • y = [1, 1, 1, 2, 2, 2, 3, 3, 3]
    

When you call the function split(X, y) having specified shuffle = True, the effect will be something like the following (suppose we set n_splits=3):

  1. First, X and y are randomly shuffled (at the same time), getting for example:
    • X = [1, 4, 2, 8, 6, 5, 7, 3, 0]
      
    • y = [1, 2, 1, 3, 3, 2, 3, 2, 1]
      
  2. Now the 3 splits are created by selecting examples (traveling through the shuffled array from the beggining to the end) so that we have the same proportion of classes on each split. So we'll get:
    • Split 1: X = [1, 4, 8], y = [1, 2, 3]
    • Split 2: X = [2, 6, 5], y = [1, 3, 2]
    • Split 3: X = [7, 3, 0], y = [3, 2, 1]
  3. Now cames the "Note that the samples within each split will not be shuffled" part:
    • For the first fold, we would think that we'd have test = [1, 4, 8] and train = [2, 6, 5, 7, 3, 0], but instead of that, the order in which they originally appeared is respected, so finally we'll have test = [1, 4, 8] and train = [0, 2, 3, 5, 6, 7]
    • In the same way, for the second fold we have: test = [2, 5, 6] and train = [0, 1, 3, 4, 7, 8]
    • And so on

Not the answer you're looking for? Browse other questions tagged or ask your own question.