Disclaimer: This question has been inspired by this one, which is a good question but has unfortunately not attracted an answer that actually answers OPs question.
Statistical models often times have model parameters that need to be tuned in order to produce models with a good bias/variance tradeoff. This question relates to supposedly different flavors of parameter selection approaches I have been exposed to and I have difficulty reconciling in my mind.
Approach Nr. 1: Nested cross-validation
Nested cross-validation is the first approach to parameter tuning that I got to know. It consists of the following steps (assuming our algorithm of interest has only one parameter, can be easily extended to more):
- Split your data into k training/test folds (regular cross-validation approach)
- For each training fold k:
- Split the training fold further into m internal folds (this refers to the term 'nested' in 'nested cross-validation'.
- For each internal fold,
- train one model on the internal training fold for each parameter in a parameter grid, evaluate on the internal test fold and return that parameter with lowest nested test error.
- For each internal fold,
- Take the optimal parameter returned, train a model using training fold k and evaluate your model on test fold k.
- Split the training fold further into m internal folds (this refers to the term 'nested' in 'nested cross-validation'.
I find it very clear how this parameter selection approach produces models that do everything in their power to avoid overfitting while at the same time harnessing maximum amount of data, since both the model evaluation as well as parameter selection are fully cross validated.
Approach Nr. 2: The 'Caret' approach:
As the name implies, I was exposed to this approach when reading the documentation regarding parameter selection for the caret
package. Parameter selection in the caret
package apparently works as follows:
In this illustration (taken from the official caret documentation), the second loop ('resampling iteration') refers to one of many resampling strategies, one of them being 'simple' cross-validation (i.e. splitting of all data into k training/test folds without nested cross-validation).
In less general pseudo code, this would be equivalent to
- For each parameter:
- Split data into training/test folds (regular CV)
- for each training fold:
- train model on training fold
- predict on test fold
- return average performance (measured using your performance metric of choice)
- For that parameter setting with highest average performance, retrain a model on all training data.
In terms of comparing the 1. and 2. approach, there are 3 things that are unclear to me:
In the last line of the pseudocode in the 2. approach, it says to retrain the model on 'all training data' using the optimal parameters. To me, this implies splitting of all data into training and test (also called hold-out) set prior to applying everything explained in the pseudocode (this assumption is confirmed in the example following this pseudocode in the official caret documentation). This seems not optimal to me, as you get to 'use' less data for training (probably resulting in an objectively worse modell).
If instead we retrain a model on all training data used for parameter selection, then how do we make sure we account for overfitting?
On the other hand, all tutorials and overflow-posts I've studied do not do this holdout split. Instead, they assume that the classifications they get from calling the
train
function incaret
are classifications that have been produced in a fashion that accounts for overfitting. At this point, I am uncertain how caret provides us with generalizable models when we use 'cv' or 'repeatedcv' in thetrainControl
function without a holdout split of the data to begin with