I have hard time to understand scikit-learn's StratifiedKfold from https://scikit-learn.org/stable/modules/cross_validation.html#stratification

and implemented the example part by adding RandomOversample:

X, y = np.ones((50, 1)), np.hstack(([0] * 45, [1] * 5))

from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(sampling_strategy='minority',random_state=0)
X_ros, y_ros = ros.fit_sample(X, y)

skf = StratifiedKFold(n_splits=5,shuffle = True)

for train, test in skf.split(X_ros, y_ros):
       print('train -  {}   |   test -  {}'.format(
         np.bincount(y_ros[train]), np.bincount(y_ros[test])))
       print(f"y_ros_test  {y_ros[test]}")


train -  [36 36]   |   test -  [9 9]
y_ros_test  [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1]
train -  [36 36]   |   test -  [9 9]
y_ros_test  [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1]
train -  [36 36]   |   test -  [9 9]
y_ros_test  [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1]
train -  [36 36]   |   test -  [9 9]
y_ros_test  [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1]
train -  [36 36]   |   test -  [9 9]
y_ros_test  [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1]

My questions:

  1. where we define train and test split (80%, 20% thing in the stratifiedKfold)? I can see from the straditifiedkfold that n_splits is defining the number of folds but not the split I think. This part confuses me.

  2. Why I'm getting y_ros_test with 9 0's and 9 1's when I have n_splits=5? According to explorations it should be 50/5 = 10 , so isn't it 5 1's and 5 0's in each split ?


Regarding your first question: there is not any train-test split when using cross-validation (CV); what happens is, in each CV round, one fold is used as a test set and the rest as training. As a result, when n_splits=5, like here, in each round 1/5 (i.e. 20%) of the data is used as test set while the remaining 4/5 (i.e. 80%) for training. So yes, determining the n_splits argument uniquely defines the split, and there is no need for any further determination (for n_splits=4 you would get a 75-25 split).

Regarding your second question, you seem to forget that previous to splitting you have oversampled your data. Running your code with the initial X and y (i.e. without oversampling) gives indeed a y_test of size 50/5 = 10, although this is not balanced (balancing is the result of oversampling) but stratified (each fold retains the class analogy of the original data):

skf = StratifiedKFold(n_splits=5,shuffle = True)

for train, test in skf.split(X, y):
       print('train -  {}   |   test -  {}'.format(
         np.bincount(y[train]), np.bincount(y[test])))
       print(f"y_test  {y[test]}")


train -  [36  4]   |   test -  [9 1]
y_test  [0 0 0 0 0 0 0 0 0 1]
train -  [36  4]   |   test -  [9 1]
y_test  [0 0 0 0 0 0 0 0 0 1]
train -  [36  4]   |   test -  [9 1]
y_test  [0 0 0 0 0 0 0 0 0 1]
train -  [36  4]   |   test -  [9 1]
y_test  [0 0 0 0 0 0 0 0 0 1]
train -  [36  4]   |   test -  [9 1]
y_test  [0 0 0 0 0 0 0 0 0 1]

Since oversampling the minority class actually increases the size of the dataset, it is only expected that you get a y_ros_test that is larger relevant to y_test (here 18 samples instead of 10).

Methodologically speaking, you actually don't need a stratified sampling if you already have oversampled your data to balance the class representation.

  • Thank for your answer. Just one thing that still makes me scratching my head is the purpose of StratifiedKfold. So StratifiedKfold is just making sure that you are getting some kind of representative of each class in each fold. It has nothing to do helping with imbalanced data set? Can you clarify that?
    – Alexander
    Commented Jan 7, 2021 at 0:51
  • @Alexander nope; as I have implied in the answer, its role is only to ensure that each fold will maintain the class representation as present in the original data. If the original data are imbalanced, so will be the stratified folds, too.
    – desertnaut
    Commented Jan 7, 2021 at 0:59
  • Ok. That's what I asked. If I have imbalanced data, first I need to do some oversampling or undersampling then either use StraitifiedKfold or not. But if you use StratifiedKfold its a kind of helper function to making sure you will maintain the class representation as present in new data set. If I don't use the StratifiedKfold can I get the same result ?
    – Alexander
    Commented Jan 7, 2021 at 5:38
  • 1
    @Alexander you don't need stratified sampling when your binary data is already balanced (naturally or artificially), although it never hurts to use it; a simple random sampling will give practically the same result. Stratification is needed either when you need to maintain the imbalance, or in multi-class settings, in order to ensure that all classes will be represented in the folds with the base analogy (to the extend possible, of course).
    – desertnaut
    Commented Jan 7, 2021 at 10:44

