75
$\begingroup$

To me, it seems that hold-out validation is useless. That is, splitting the original dataset into two-parts (training and testing) and using the testing score as a generalization measure, is somewhat useless.

K-fold cross-validation seems to give better approximations of generalization (as it trains and tests on every point). So, why would we use the standard hold-out validation? Or even talk about it?

$\endgroup$
5
  • 13
    $\begingroup$ why do you think its useless? You can read Elements of Statistical learning theory section 7 for a formal analysis of its pro's and its con's. Statistically speaking, k-fold is better, but using a test set is not necessarily bad. Intuitively, you need to consider that a test set (when used correctly) is indeed a data set that has not been used at all at training. So its definitively useful in some sense to evaluate a model. Also, k-fold is super expensive, so hold out is sort of an "approximation" to what k-fold does (but for someone with low computational power). $\endgroup$ Commented Dec 8, 2015 at 19:38
  • 2
    $\begingroup$ Sure. From a theoretical perspective, K-fold is more precise but SLIGHTLY more computationally expensive. The question was: why not ALWAYS do K-fold cross validation? $\endgroup$
    – user46925
    Commented Dec 8, 2015 at 21:12
  • 3
    $\begingroup$ I see. I would argue that the reason is mostly always computational. K-fold approximates the generalization error better so from a statistical point of view K-fold is the method of choice I believe. Hold-out is much simpler to implement AND doesn't require training as many models. In practice, training a model can be quite expensive. $\endgroup$ Commented Dec 9, 2015 at 19:11
  • 2
    $\begingroup$ Right - but I think the "too computational expensive" argument is fairly frail. Almost all the time, we're aiming to develop the most accurate models. Yet there is this paradox where a lot of the experiments conducted in the literature only have a single hold-out validation set. $\endgroup$
    – user46925
    Commented Dec 9, 2015 at 19:55
  • 2
    $\begingroup$ Question - Elements of Statistical learning theory section 7.10.1 titled "K fold cross validation" seems to indicate that keeping test data entirely separate from training data (as in hold out validation) is ideal, and k- fold validation is just a compromise as data is many a times scarce. I'm still quite new to statistics, could you point out how cross validation is in fact more precise? $\endgroup$
    – numX
    Commented Aug 24, 2016 at 19:17

12 Answers 12

22
$\begingroup$

NOTE: This answer is old, incomplete, and thoroughly out of date. Its was only debatably correct when it was posted in 2014, and I'm not really sure how it got so many upvotes or how it became the accepted answer. I recommend this answer instead, written by an expert in the field (and with significantly more upvotes). I am leaving my answer here for historical/archival purposes only.


My only guess is that you can Hold-Out with three hours of programming experience; the other takes a week in principle and six months in practice.

In principle it's simple, but writing code is tedious and time-consuming. As Linus Torvalds famously said, "Bad programmers worry about the code. Good programmers worry about data structures and their relationships." Many of the people doing statistics are bad programmers, through no fault of their own. Doing k-fold cross validation efficiently (and by that I mean, in a way that isn't horribly frustrating to debug and use more than once) in R requires a vague understanding of data structures, but data structures are generally skipped in "intro to statistical programming" tutorials. It's like the old person using the Internet for the first time. It's really not hard, it just takes an extra half hour or so to figure out the first time, but it's brand new and that makes it confusing, so it's easy to ignore.

You have questions like this: How to implement a hold-out validation in R. No offense intended, whatsoever, to the asker. But many people just are not code-literate. The fact that people are doing cross-validation at all is enough to make me happy.

It sounds silly and trivial, but this comes from personal experience, having been that guy and having worked with many people who were that guy.

$\endgroup$
11
  • 24
    $\begingroup$ Maybe as someone who has majored in CS I have a slightly skewed view on this, but if you can implement hold-out validation correctly (which already means splitting the dataset into 2 parts and using one for training and the other for testing), the only thing you need to change is the ratio of the split and put the whole thing into a loop. It just seems hard to believe that this would be a big problem. $\endgroup$
    – Voo
    Commented Jun 25, 2014 at 21:15
  • 3
    $\begingroup$ @Voo: in addition, being able to program is not enough here: you must understand the problem well enough to be able to judge for which confounders you need to account during your splitting procedure. See e.g. stats.stackexchange.com/questions/20010/…. I think I see this kind of problems more often than "pure" coding problems (although one never knows: someone who's barely able to code a plain splitting of the rows in the data matrix will usually also make the higher-level mistake of not splitting e.g. at patient level) $\endgroup$
    – cbeleites
    Commented Jun 26, 2014 at 12:21
  • $\begingroup$ Note also that you can do proper (e.g. patient/measurement day/...) hold-out splitting without any programming at all by separating the files the measurement instrument produces... $\endgroup$
    – cbeleites
    Commented Jun 26, 2014 at 12:23
  • 6
    $\begingroup$ To the up-voters: note that I asked a separate question that questions my logic. stats.stackexchange.com/q/108345/36229 $\endgroup$ Commented Jul 20, 2014 at 13:29
  • 3
    $\begingroup$ I don't think the answer explaining the difference between two cross validation methods should ever be human time to learn, absurdly biased and not helpful $\endgroup$
    – rgalbo
    Commented Jul 26, 2017 at 13:39
61
$\begingroup$

Hold-out is often used synonymous with validation with independent test set, although there are crucial differences between splitting the data randomly and designing a validation experiment for independent testing.

Independent test sets can be used to measure generalization performance that cannot be measured by resampling or hold-out validation, e.g. the performance for unknown future cases (= cases that are measured later, after the training is finished). This is important in order to know how long an existing model can be used for new data (think e.g. of instrument drift). More generally, this may be described as measuring the extrapolation performance in order to define the limits of applicability.

Another scenario where hold-out can actually be beneficial is: it is very easy to ensure that training and test data are properly separated - much easier than for resampling validation: e.g.

  1. decide splitting (e.g. do random assignment of cases)
  2. measure
  3. measurement and reference data of the training cases => modeling\ neither measurements nor reference of test cases is handed to the person who models.
  4. final model + measurements of the held-out cases => prediction
  5. compare predictions with reference for held-out cases.

Depending on the level of separation you need, each step may be done by someone else. As a first level, not handing over any data (not even the measurements) of the test cases to the modeler allows to be very certain that no test data leaks into the modeling process. At a second level, the final model and test case measurements could be handed over to yet someone else, and so on.

In some fields/cases/applications, we consider this obvious independence sufficiently important to prescribe that an independent organization is needed for validation*, e.g. in clinical chemistry (we also do that e.g. for vehicle safety: the one who safeties your car is not the same as your repair guy, and they are also in separate businesses).

(* I'm chemometrician/analytical chemist. To me, there is not much of a conceptual difference between validating a wet-lab method or an in-silico method (aka predictive model). And the difference will become even less with the advance of machine learning e.g. into medical diagnostics.)

Yes, you pay for that by the lower efficiency of the hold-out estimates compared to resampling validation. But I've seen many papers where I suspect that that the resampling validation does not properly separate cases (in my field we have lots of clustered/hierarchical/grouped data).

I've learned my lesson on data leaks for resampling by retracting a manuscript a week after submission when I found out that I had a previously undetected (by running permutation tests alongside) leak in my splitting procedure (typo in index calculation).

Sometimes hold-out can be more efficient than finding someone who is willing to put in the time to check the resampling code (e.g. for clustered data) in order to gain the same level of certainty about the results. However, IMHO it is usually not efficient to do this before you are in the stage where you anyways need to measure e.g. future performance (first point) - in other words, when you anyways need to set up a validation experiment for the existing model.

OTOH, in small sample size situations, hold-out is no option: you need to hold out enough test cases so that the test results are precise enough to allow the needed conclusion (remember: 3 correct out of 3 test cases for classification means a binomial 95% confidence interval that ranges well below 50:50 guessing!) Frank Harrell would point to the rule of thumb that at least ca. 100 (test) cases are needed to properly measure a proportion [such as the fraction of correctly predicted cases] with a useful precision.


Update: there are situations where proper splitting is particularly hard to achieve, and cross validation becomes unfeasible. Consider a problem with a number of confounders. Splitting is easy if these confounders are strictly nested (e.g. a study with a number of patients has several specimen of each patient and analyses a number of cells of each specimen): you split at the highest level of the sampling hierarchy (patient-wise). But you may have independent confounders which are not nested, e.g. day-to-day variation or variance caused by different experimenters running the test. You then need to make sure the split is independent for all confounders on the highest level (the nested confounders will automatically be independent). Taking care of this is very difficult if some confounders are only identified during the study, and designing and performing a validation experiment may be more efficient than dealing with splits that leave almost no data neither for training nor for testing of the surrogate models.

$\endgroup$
11
  • 9
    $\begingroup$ I wish I could give more than +1 for this very thorough answer. I particularly liked you mentioning your issue with a data leak as it effectively illustrates that it can be far from trivial to rule out such problems, even for experts. This is a good reality check! $\endgroup$ Commented Jun 26, 2014 at 11:17
  • $\begingroup$ Aren't you begging the question? Yes, splitting is hard, due to confounders, but it's hard regardless of whether you're doing a single hold-out validation or k-fold cross-validation, isn't it? (Thanks for an insightful answer regardless!) $\endgroup$ Commented Feb 1, 2016 at 5:51
  • 2
    $\begingroup$ @NilsvonBarth: I don't see how my arguments are circular: the OP asks "why [at all] use hold-out validation", and I give a bunch of practical reasons. The statistically most efficient use of a limited number of cases is not always the most important property of the study design. (Though in my experience it often is, due to extremely limited case numbers: I'm far more often advising for repeated/iterated k-fold CV instead of hold-out). For some confounders physical splitting is possible and easy - and a very efficient way to prevent sneak-previews. Who knows whether we'll find that doubly ... $\endgroup$
    – cbeleites
    Commented Feb 1, 2016 at 20:27
  • $\begingroup$ blinded statistical data analysis may be needed against too many false positive papers at some point? $\endgroup$
    – cbeleites
    Commented Feb 1, 2016 at 20:28
  • 3
    $\begingroup$ @NilsvonBarth: Careful with hold-out guaranteeing independence: it is easy to implement hold-out in such a way (by physical hold-out of cases, i.e. test specimen are put away and only measured after the model training is finished), but often the term hold-out is used for what is actually far more like a single random split of the data - and then all the possibilities of making mistakes in the splitting can be made with hold-out as well! $\endgroup$
    – cbeleites
    Commented Feb 4, 2016 at 16:08
23
$\begingroup$

Just wanted to add some simple guidelines that Andrew Ng mentioned in our CS 229 class at Stanford regarding cross-validation. These are the practices that he follows in his own work.

Let $m$ be the number of samples in your dataset:

  • If $m\le 20$ use Leave-one-out cross validation.

  • If $20 < m \le 100$ use k-fold cross validation with a relatively large $k \le m$ keeping in mind computational cost.

  • If $100 < m \le 1,000,000$ use regular k-fold cross validation $(k = 5)$. Or, if there is not enough computational power and $m > 10,000$, use hold-out cross validation.

  • If $m \ge 1,000,000$ use hold-out cross validation, but if computational power is available you can use k-fold cross validation $(k = 5)$ if you want to squeeze that extra performance out of your model.

$\endgroup$
3
  • 4
    $\begingroup$ This is a really useful answer. Does he talk about nested cross validation and provide any recommendations there? $\endgroup$
    – skeller88
    Commented Apr 17, 2020 at 23:01
  • $\begingroup$ Unfortunately it has been too long. I do not recall. $\endgroup$ Commented May 5, 2020 at 1:17
  • 1
    $\begingroup$ By hold-out cross validation, you mean a simple split, right? $\endgroup$
    – Rafs
    Commented Jul 19, 2022 at 16:23
9
$\begingroup$

It might be useful to clear up the terminology a little. If we let $k$ be some integer less than (or equal to) $n$ where $n$ is the sample size and we partition the sample into $k$ unique subsamples, then what you are calling Hold-out validation is really just 2-fold ($k$ = 2) cross-validation. Cross-validation is merely a tool for estimating the out-of-sample error rates (or generalizability) of a particular model. The need to estimate the out-of-sample error rate is a common one and has spawned an entire literature. See, for starters, chapter 7 of ESL.

So to answer the questions:

  1. Why talk about it? Pedagogically. It's worthwhile to think of Hold-out validation as a special - and only occasionally useful - case of an otherwise quite useful method with many, many variations.

  2. Why use it? If one is lucky enough to have a colossal dataset (in terms of observations, $n$), then splitting the data in half - training on one half and testing on the other - makes sense. This makes sense for computational reasons since all that is required is fitting once and predicting once (rather than $k$ times). And it makes sense from a "large-sample estimation" perspective since you have a ton of observations to fit your model to.

A rule-of-thumb I've learned is: when $n$ is large, $k$ can be small, but when $n$ is small, $k$ should be close to $n$.

$\endgroup$
1
  • 20
    $\begingroup$ I don't think that holdout is the same as 2 fold validation, because in 2 fold validation you will fit two models and then average out the errors across the two holdout sets. $\endgroup$
    – Alex
    Commented Aug 31, 2015 at 0:30
8
$\begingroup$

If your model selection & fitting procedure can't be coded up because it's subjective, or partly so,—involving looking at graphs & the like—hold-out validation might be the best you can do. (I suppose you could perhaps use something like Mechanical Turk in each CV fold, though I've never heard of its being done.)

$\endgroup$
6
$\begingroup$

Short answer:

I would recommend to always use CV with at least $k=5$ for:

  • complex models
  • final results that have to adhere validity constraints

You might relax this for:

  • training on really large datasets
  • training simple models
  • prototyping when time is an issue

Some of you mentioned, that programming this in R might be an issue. I recommend you to have a look at the "mlr" package. It wraps different packages in a unified interface, also providing really advanced resampling and performance evaluation methods.

Have a look: http://mlr-org.github.io/mlr-tutorial/release/html/resample/ and: http://mlr-org.github.io/mlr-tutorial/release/html/performance/index.htm

Some more explanation - what CV really does is break the bias variance tradeoff:

Now, the problem that both approaches try to solve is to estimate the generalization error, which is conditional on the data that was used to train a model.

Holdout has a problem with bias and variance:

By making the amount of data that we test on smaller, we introduce variance to our estimated generalization error, as the test data might not represent the underlying distribution very well anymore. This itself does not introduce a bias though, as in expectation the estimated performance will be correct.

Making the training set smaller however introduces a pessimistic bias, as again the underlying distribution is not well represented in the data and the model cannot fit the data as well. Making the training set very small introduces variance as well.

As size of training and test set determine each other, this leaves us with a tradeoff: pessimistic bias vs high variance.

$k$-fold Cross validation tackles this problem by keeping the training set large (a fraction of $\frac{k-1}{k}$ of the data is used for training in every iteration) and dealing with the variance of the test error by resampling. After all iterations, we have tested performance on every observation of the dataset with one learner. Obviously, this requires more computation time than simple holdout.

Cross-validating is especially important for more complex (high variance) learners. Those usually are more expensive computationally as well, which can make the whole process quite time intensive.

$\endgroup$
4
$\begingroup$

All these are useful comments. Just take one more into account. When you have enough data, using Hold-Out is a way to assess a specific model (a specific SVM model, a specific CART model, etc), whereas if you use other cross-validation procedures you are assessing methodologies (under your problem conditions) rather than models (SVM methodology, CART methodology, etc).

Hope this is helpful!

$\endgroup$
4
$\begingroup$

Simply put; time. Cross-validation you run the training routine k times (i.e. once for each hold-out set). If you have large data, then you it might take many hours or even days to train the model for just one data set, so you multiply that by k when using cross-validation.

So although cross-validation is the best method, in certain circumstances it's not feasible, and the time it would take might have been better spent modeling the data in different ways, or trying out different loss functions in order to get a better model.

My personal preference is to take validation data from throughout the data set, so rather than take a single 10% chunk from the head or tail of the data, I take 2% from 5 points in the data set. That makes the validation data a bit more representative of the data as a whole.

$\endgroup$
1
  • 1
    $\begingroup$ Even though it's an old question and a new answer, I'm voting this up because it challenges the groundless assertion that "K-fold is more precise but SLIGHTLY more computationally expensive", which the other answers were ignoring or passing over too quickly. $\endgroup$ Commented Jun 24, 2018 at 3:12
3
$\begingroup$

Modeling with time serious data is an exception for me. K fold cannot work in some cases when you need to predict the future based on the previous data. The test sets have to be the future data, and you can never touch them in training phase. e.x predicting sell or the stock market. Hold out is useful in those cases.

$\endgroup$
0
1
$\begingroup$

I'm aware this question is old but I landed here from Google anyway and the accepted answer isn't very pleasing as no one needs to programming CV themselves as this is handled by according libraries.

For a good answer first the scope terms must be defined. My answer focuses on machine learning ("classical" as in regression, random forest, etc... and not deep learning). The hold-out set or test set is part of the labeled data set, that is split of at the beginning of the model building process. (And the best way to split in my opinion is by acquisition date of the data with newest data being the hold-out set because that exactly mimics future use of the model)

A crucial aspect to consider that your model isn't just the used algorithm and parameters but the whole process you use to build it from feature selection to parameter optimization. That is why the hold-out set gets split off at the start so that in above definition the model has never seen that data in any way.

k-fold cross-validation is used within your model (reminder: model = your whole pipeline) for example within parameter optimization or feature selection. You need to use CV here because else you optimize your model for 1 specific data split instead of a more general optimization you get with CV. At the end of this pipeline you can also do another CV with the final model settings for an approximate guess of the models performance but be aware that this will almost always be better than the truth because the model during the model building process has already seen the data. It still gives you a rough estimate and especially a hint on the variance.

After you have your model you apply it to this hold-out set which if done correctly is 100% new to the model. This should give you a correct indication about your models performance and this as said above will almost always be worse than what you get with CV.

$\endgroup$
0
$\begingroup$

It should be noted that it's not always possible to apply the cross-validation. Consider the time-dependent datasets such that you want to use the historical data to train a predictive model for the future behaviour. In this case, you have to apply hold-out validation.

$\endgroup$
1
  • 2
    $\begingroup$ In this case, you should do forward validation. $\endgroup$
    – Neil G
    Commented Sep 8, 2018 at 16:54
0
$\begingroup$

Imagine you are predicting if a given chemical mixture of two components will explode or not, based on the properties of the two components. A certain component A may appear in diverse observations: you can have it in a mixture of A+B, A+C, A+D, etc. Now, imagine that you use k-fold validation. When the model is predicting for the A+C mixture, maybe it was already trained with the observation "A+B", therefore, it will be biased towards the output of that observation (because half of the variables of the two observations are the same: in one you have the properties of A and the properties of C, and in the other one you gave the properties of A and the properties of B).

In the case described, hold out validation would give you a way less biased result than k-fold cross validation!

$\endgroup$