I am currently working on an ML experiment where I use a nested 5-cross validation procedure and obtain a NDCG@10 scores for each test user. I am comparing 6 different ML algorithms and have data for around 10,000 users.
My cross-validation process involves training 6 different models that naturally share some training data. I end up with evaluation scores from 5 folds, each containing different test users. Essentially, I have one dataset, but due to the cross-validation procedure, I obtain evaluation scores from models trained on partially overlapping training datasets for each test fold.
Here are two approaches I consider with the aim of testing for statistical significance between the 6 ML models:
1. Performing Statistical Tests for Each Fold Separately:
- Conduct statistical tests comparing the 6 models for each of the 5 test folds separately.
- Use the evaluation scores of the users in each fold to perform a Friedman test, followed by a Nemenyi post-hoc test for all pairwise model comparisons.
- Combine the resulting p-values of the post-hoc test using Stouffer’s method (Z-transform test).
2. Treating All Test Folds as Independent Datasets:
- Treat the 5 test folds as independent datasets and conduct statistical tests on the average evaluation score per dataset.
- For each ML model, test for the statistical significance of the average users’ evaluation scores across the 5 datasets.
- Use Friedman test and Nemenyi post-hoc test and no need to combine p-values.
I have read various literature on this topic, but as I am new to this field, I am confused about the best approach to take.
Could someone provide guidance for my nested cross validation procedure? Any advice or references to relevant literature would be greatly appreciated, I am a beginner!
In short:
- Nested 5-Fold CV resulting in evaluation scores for 5 test folds, where each test fold contains different users.
- Inner CV trains partly on overlapping training instances.
- 6 ML models
- All-Pairwise Multiple Comparison of ML models
- How to deal with users' evaluation scores from 5 test folds and statistical significance testing for 6 different ML algorithms?