I want to compare multiple classifiers on multiple data streams. For a stream of length $n$ I test a single classifier each $t/c$ time steps using a dedicated (hold-out) subset of my data and calculate the AUC.
How can I apply a Friedman test with a post-hoc Nemenyi test as described in Statistical Comparisons of Classifiers over Multiple Data Sets?
Lets say I have 3 stream, 5 classifiers and calculate 100 AUCs per stream.
I suppose I have to average the AUCs of each classifier calculated on a stream and not treat each AUC as a result of an experiment. The reason why I think the individual AUCs cannot be used is this part from the paper:
In our examples we have used AUCs measured and averaged over repetitions of training/testing episodes. For instance, each cell in Table 6 represents an average over five-fold cross validation. Could we also consider the variance, or even the results of individual folds? There are variations of the ANOVA and the Friedman test which can consider multiple observations per cell provided that the observations are independent (Zar, 1998). This is not the case here, since training data in multiple random samples overlaps. We are not aware of any statistical test that could take this into account.
In the case of my setup above, there is obviously overlap within a stream.