I have hard time to understand scikit-learn's StratifiedKfold
from https://scikit-learn.org/stable/modules/cross_validation.html#stratification
and implemented the example part by adding RandomOversample
:
X, y = np.ones((50, 1)), np.hstack(([0] * 45, [1] * 5))
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(sampling_strategy='minority',random_state=0)
X_ros, y_ros = ros.fit_sample(X, y)
skf = StratifiedKFold(n_splits=5,shuffle = True)
for train, test in skf.split(X_ros, y_ros):
print('train - {} | test - {}'.format(
np.bincount(y_ros[train]), np.bincount(y_ros[test])))
print(f"y_ros_test {y_ros[test]}")
output
train - [36 36] | test - [9 9]
y_ros_test [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1]
train - [36 36] | test - [9 9]
y_ros_test [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1]
train - [36 36] | test - [9 9]
y_ros_test [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1]
train - [36 36] | test - [9 9]
y_ros_test [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1]
train - [36 36] | test - [9 9]
y_ros_test [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1]
My questions:
where we define train and test split (80%, 20% thing in the stratifiedKfold)? I can see from the straditifiedkfold that n_splits is defining the number of folds but not the split I think. This part confuses me.
Why I'm getting
y_ros_test
with 90's
and 91's
when I haven_splits=5
? According to explorations it should be 50/5 = 10 , so isn't it 51's
and 50's
in each split ?