7

I just built my first random forest classifier today and I am trying to improve its performance. I was reading about how cross-validation is important to avoid overfitting of data and hence obtain better results. I implemented StratifiedKFold using sklearn, however, surprisingly this approach resulted to be less accurate. I have read numerous posts suggesting that cross-validating is much more efficient than train_test_split.

Estimator:

rf = RandomForestClassifier(n_estimators=100, random_state=42)

K-Fold:

ss = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
for train_index, test_index in ss.split(features, labels):
    train_features, test_features = features[train_index], features[test_index]
    train_labels, test_labels = labels[train_index], labels[test_index]

TTS:

train_feature, test_feature, train_label, test_label = \
    train_test_split(features, labels, train_size=0.8, test_size=0.2, random_state=42)

Below are results:

CV:

AUROC:  0.74
Accuracy Score:  74.74 %.
Specificity:  0.69
Precision:  0.75
Sensitivity:  0.79
Matthews correlation coefficient (MCC):  0.49
F1 Score:  0.77

TTS:

AUROC:  0.76
Accuracy Score:  76.23 %.
Specificity:  0.77
Precision:  0.79
Sensitivity:  0.76
Matthews correlation coefficient (MCC):  0.52
F1 Score:  0.77

Is this actually possible? Or have I wrongly set up my models?

Also, is this the correct way of using cross-validation?

2 Answers 2

7

glad to see you documented yourself !

The reason for that difference is that TTS approach introduces bias (as you are not using all of your observations for testing) this explains the difference.

In the validation approach, only a subset of the observations—those that are included in the training set rather than in the validation set—are used to fit the model. Since statistical methods tend to perform worse when trained on fewer observations, this suggests that the validation set error rate may tend to overestimate the test error rate for the model fit on the entire data set.

And the results can vary quite a lot:

the validation estimate of the test error rate can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the validation set

Cross validation deals with this problem by using all the data available and thus eliminating the bias.

Here your results for the TTS approach hold more bias and this should be kept in mind when analysing the results. Maybe you also got lucky on the Test/Validation set sampled

Again, more on that topic here with a great, beginner friendly article : https://codesachin.wordpress.com/2015/08/30/cross-validation-and-the-bias-variance-tradeoff-for-dummies/

For a more in-depth source, refer to the "Model Assessment and selection" Chapter here (source of quoted content):

https://web.stanford.edu/~hastie/Papers/ESLII.pdf

8
  • Thanks for your help, again! So basically, using cross-validation is better in the sense of ensuring that there is no statistical bias (or lower bias) in the obtained results? And also, is this the correct way of using cross-validation techniques?
    – David
    Commented Mar 6, 2018 at 15:43
  • Thanks for the references, as well.
    – David
    Commented Mar 6, 2018 at 15:50
  • By using CV you ensure that there is no bias on the Test error and you lower the variability of your results. As for the code I use R so I'm not entirely sure but this seems quite alright Commented Mar 6, 2018 at 15:51
  • Hi, I just read that in order to avoid overfitting, you should reduce the variance and increase the bias. Does that mean that TTS would then be preferred?
    – David
    Commented Mar 6, 2018 at 16:56
  • 2
    Yes, you split your data in K equals sets, you then train on K-1 sets and test on the remaining set. You do that K times, changing everytime the test set so that in the end every set will be the test set once and a training set K-1 times. You then average the K results to get the K-Fold CV result Commented Mar 7, 2018 at 9:18
3

Cross-validation tends to apply correction for selection bias in your data. So, e.g. if you focused on AUC metric and get lower AUC score within TTS approach, it means there's a bias in your TTS.

You might want conduct an analysis to figure this bias out (e.g. you can pay more attention on date features (make sure you don't use future to predict past) or trying to find any kind of leakage in your data associated with business logic)

Overall, the difference in scores doesn't seem to me that big to worry much. So, code seems ok and such difference in scores is possible.

Btw, you didn't describe the problem/data anyhow, however you used Stratified KFold CV, so I assume you have an unbalanced dataset, but if not, ordinal KFold CV might be worth to try. In your TTS you don't have class balancing implemented but it is done by Stratified CV

Not the answer you're looking for? Browse other questions tagged or ask your own question.