Statistical Significance Testing for Nested Cross-Validation in ML Experiment

Ask Question

Asked 1 month ago

Modified 3 days ago

Viewed 18 times

I am currently working on an ML experiment where I use a nested 5-cross validation procedure and obtain a NDCG@10 scores for each test user. I am comparing 6 different ML algorithms and have data for around 10,000 users.

My cross-validation process involves training 6 different models that naturally share some training data. I end up with evaluation scores from 5 folds, each containing different test users. Essentially, I have one dataset, but due to the cross-validation procedure, I obtain evaluation scores from models trained on partially overlapping training datasets for each test fold.

Here are two approaches I consider with the aim of testing for statistical significance between the 6 ML models:

1. Performing Statistical Tests for Each Fold Separately:

Conduct statistical tests comparing the 6 models for each of the 5 test folds separately.
Use the evaluation scores of the users in each fold to perform a Friedman test, followed by a Nemenyi post-hoc test for all pairwise model comparisons.
Combine the resulting p-values of the post-hoc test using Stouffer’s method (Z-transform test).

2. Treating All Test Folds as Independent Datasets:

Treat the 5 test folds as independent datasets and conduct statistical tests on the average evaluation score per dataset.
For each ML model, test for the statistical significance of the average users’ evaluation scores across the 5 datasets.
Use Friedman test and Nemenyi post-hoc test and no need to combine p-values.

I have read various literature on this topic, but as I am new to this field, I am confused about the best approach to take.

Could someone provide guidance for my nested cross validation procedure? Any advice or references to relevant literature would be greatly appreciated, I am a beginner!

In short:

Nested 5-Fold CV resulting in evaluation scores for 5 test folds, where each test fold contains different users.
Inner CV trains partly on overlapping training instances.
6 ML models
All-Pairwise Multiple Comparison of ML models
How to deal with users' evaluation scores from 5 test folds and statistical significance testing for 6 different ML algorithms?

edited Jul 5 at 20:15

kjetil b halvorsen♦

81.2k32 gold badges198 silver badges646 bronze badges

asked Jun 5 at 15:28

Bernhard

1012 bronze badges

Add a comment |

Stack Exchange Network

Statistical Significance Testing for Nested Cross-Validation in ML Experiment

0

Browse other questions tagged
hypothesis-testing
statistical-significance
cross-validation
nonparametric
friedman-test
or ask your own question.

Hot Network Questions

Statistical Significance Testing for Nested Cross-Validation in ML Experiment

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Browse other questions tagged hypothesis-testingstatistical-significancecross-validationnonparametricfriedman-test or ask your own question.

Related

Hot Network Questions

Browse other questions tagged
hypothesis-testing
statistical-significance
cross-validation
nonparametric
friedman-test
or ask your own question.