When training my model, I'm getting very different results when I use something like sklearn.model_selection.train_test_split(X, y, stratify=y, train_size=0.9)
vs. sklearn.model_selection.StratifiedKFold(n_splits=10)
and was wondering if there was a difference between how they stratified their data. I'm almost certain I implemented everything according to the docs, but strangely enough, the latter gives much worse testing accuracy than the first.
-
Can you post a minimal complete code which we can try and duplicate your behaviour?– Vivek KumarCommented Jun 15, 2017 at 0:55
Add a comment
|
1 Answer
When stratify
is not None train_test_split
uses StratifiedShuffleSplit internally, not StratifiedKFold. So yeah, there is a big difference.
-
@hyperdo In addition, obvious difference is that StratifiedKFold will give 10 folds of different train and test data, and train_test_split will give only one. Commented Jun 15, 2017 at 5:27