0
$\begingroup$

I am currently working on an ML experiment where I use a nested 5-cross validation procedure and obtain a NDCG@10 scores for each test user. I am comparing 6 different ML algorithms and have data for around 10,000 users.

My cross-validation process involves training 6 different models that naturally share some training data. I end up with evaluation scores from 5 folds, each containing different test users. Essentially, I have one dataset, but due to the cross-validation procedure, I obtain evaluation scores from models trained on partially overlapping training datasets for each test fold.

Here are two approaches I consider with the aim of testing for statistical significance between the 6 ML models:

1. Performing Statistical Tests for Each Fold Separately:

  • Conduct statistical tests comparing the 6 models for each of the 5 test folds separately.
  • Use the evaluation scores of the users in each fold to perform a Friedman test, followed by a Nemenyi post-hoc test for all pairwise model comparisons.
  • Combine the resulting p-values of the post-hoc test using Stouffer’s method (Z-transform test).

2. Treating All Test Folds as Independent Datasets:

  • Treat the 5 test folds as independent datasets and conduct statistical tests on the average evaluation score per dataset.
  • For each ML model, test for the statistical significance of the average users’ evaluation scores across the 5 datasets.
  • Use Friedman test and Nemenyi post-hoc test and no need to combine p-values.

I have read various literature on this topic, but as I am new to this field, I am confused about the best approach to take.

Could someone provide guidance for my nested cross validation procedure? Any advice or references to relevant literature would be greatly appreciated, I am a beginner!


In short:

  • Nested 5-Fold CV resulting in evaluation scores for 5 test folds, where each test fold contains different users.
  • Inner CV trains partly on overlapping training instances.
  • 6 ML models
  • All-Pairwise Multiple Comparison of ML models
  • How to deal with users' evaluation scores from 5 test folds and statistical significance testing for 6 different ML algorithms?
$\endgroup$

0