119
$\begingroup$

I'm wondering what the value is in taking a continuous predictor variable and breaking it up (e.g., into quintiles), before using it in a model.

It seems to me that by binning the variable we lose information.

  • Is this just so we can model non-linear effects?
  • If we kept the variable continuous and it wasn't really a straight linear relationship would we need to come up with some kind of curve to best fit the data?
$\endgroup$
9
  • 15
    $\begingroup$ 1) No. You are right that binning loses information. It should be avoided if possible. 2) Generally, the curve function that is consistent with theory behind the data is preferred. $\endgroup$
    – O_Devinyak
    Commented Aug 31, 2013 at 6:27
  • 11
    $\begingroup$ I don't know about benefits, but there are a number of widely-recognized dangers $\endgroup$
    – Glen_b
    Commented Aug 31, 2013 at 8:28
  • 3
    $\begingroup$ A reluctant argument for it, on occasion: It can simplify clinical interpretation and the presentation of results - eg. blood pressure is often a quadratic predictor and a clinician can support the use of cutoffs for low, normal and high BP and may be interested in comparing these broad groups. $\endgroup$
    – user20650
    Commented Aug 31, 2013 at 16:30
  • 4
    $\begingroup$ @user20650: I'm not quite sure I understood you, but wouldn't it be better to fit the best model you can, & then use that model's predictions to say anything you want to say about broad groups? The 'high blood-pressure group' in my study won't necessarily have the same distribution of pressures as the general population, so their results won't generalize. $\endgroup$ Commented Sep 2, 2013 at 18:25
  • 8
    $\begingroup$ The simplified clinical interpretation is a mirage. Effects estimates from categorized continuous variables have no known interpretation. $\endgroup$ Commented Jun 30, 2015 at 23:04

8 Answers 8

98
+50
$\begingroup$

You're right on both counts. See Frank Harrell's page here for a long list of problems with binning continuous variables. If you use a few bins you throw away a lot of information in the predictors; if you use many you tend to fit wiggles in what should be a smooth, if not linear, relationship, & use up a lot of degrees of freedom. Generally better to use polynomials ($x + x^2 + \ldots$) or splines (piecewise polynomials that join smoothly) for the predictors. Binning's really only a good idea when you'd expect a discontinuity in the response at the cut-points—say the temperature something boils at, or the legal age for driving–, & when the response is flat between them..

The value?—well, it's a quick & easy way to take curvature into account without having to think about it, & the model may well be good enough for what you're using it for. It tends to work all right when you've lots of data compared to the number of predictors, each predictor is split into plenty of categories; in this case within each predictor band the range of response is small & the average response is precisely determined.

[Edit in response to comments:

Sometimes there are standard cut-offs used within a field for a continuous variable: e.g. in medicine blood pressure measurements may be categorized as low, medium or high. There may be many good reasons for using such cut-offs when you present or apply a model. In particular, decision rules are often based on less information than goes into a model, & may need to be simple to apply. But it doesn't follow that these cut-offs are appropriate for binning the predictors when you fit the model.

Suppose some response varies continuously with blood pressure. If you define a high blood pressure group as a predictor in your study, the effect you're estimating is the average response over the particular blood-pressures of the individuals in that group. It's not an estimate of the average response of people with high blood pressure in the general population, or of people in the high blood pressure group in another study, unless you take specific measures to make it so. If the distribution of blood pressure in the general population is known, as I imagine it is, you'll do better to calculate the average response of people with high blood pressure in the general population based on predictions from the model with blood pressure as a continuous variable. Crude binning makes your model only approximately generalizable.

In general, if you have questions about the behaviour of the response between cut-offs, fit the best model you can first, & then use it to answer them.]

[With regard to presentation; I think this is a red herring:

(1) Ease of presentation doesn't justify bad modelling decisions. (And in the cases where binning is a good modelling decision, it doesn't need additional justification.) Surely this is self-evident. No-one ever recommends taking an important interaction out of a model because it's hard to present.

(2) Whatever kind of model you fit, you can still present its results in terms of categories if you think it will aid interpretation. Though ...

(3) You have to be careful to make sure it doesn't aid mis-interpretation, for the reasons given above.

(4) It's not in fact difficult to present non-linear responses. Personal opinion, clearly, & audiences differ; but I've never seen a graph of fitted response values versus predictor values puzzle someone just because it's curved. Interactions, logits, random effects, multicollinearity, ...—these are all much harder to explain.]

[An additional point brought up by @Roland is the exactness of the measurement of the predictors; he's suggesting, I think, that categorization may be appropriate when they're not especially precise. Common sense might suggest that you don't improve matters by re-stating them even less precisely, & common sense would be right: MacCallum et al (2002), "On the Practice of Dichotomization of Quantitative Variables", Psychological Methods, 7, 1, pp17–19.]

$\endgroup$
3
  • 7
    $\begingroup$ Excellent comments on a pervasive issue. It's important to propagandise for thoroughly quantitative thinking here. There is already too much emphasis on crossing thresholds, e.g. above some level disaster, below some level comfort. $\endgroup$
    – Nick Cox
    Commented Sep 4, 2013 at 10:54
  • 21
    $\begingroup$ I would challenge anyone to show a validation of any cutoffs used by physicians. $\endgroup$ Commented Sep 4, 2013 at 11:38
  • 1
    $\begingroup$ It's worth noting that this binning approach does have some benefits in other areas - it's particularly popular when combined with large neural nets for predicting multi-modal distributions such as vehicle orientation. See arxiv.org/abs/1612.00496 for example. $\endgroup$
    – N. McA.
    Commented May 27, 2018 at 10:43
16
$\begingroup$

A part of this answer that I've learned since asking is that not binning and binning seeks to answer two slightly different questions - What is the incremental change in the data? and What is the difference between the lowest and the highest?.

Not binning says "this is a quantification of the trend seen in the data" and binning says "I don't have enough information to say how much this changes by each increment, but I can say that the top is different from the bottom".

$\endgroup$
7
$\begingroup$

As previous posters have mentioned, it is generally best to avoid dichotomizing a continuous variable. However, in answer to your question, there are instances where dichotomizing a continuous variable does confer advantages.

For instance, if a given variable contains missing values for a significant proportion of the population, but is known to be highly predictive and the missing values themselves bear predictive value. For example, in a credit scoring model, consider a variable, let's say average-revolving-credit-balance (which granted, is not technically continuous, but in this case mirrors a normal distribution close enough to be treated as such), which contains missing values for about 20% of the applicant pool in a given target market. In this case, the missing values for this variable represent a distinct class--those who don't have an open, revolving-credit line; these customers will display entirely different behavior compared to, say, those with available revolving credit-lines, but who regularly carry no balance. If instead these missing values were discarded, or imputed, it could restrict the model's predictive ability.

Another benefit of dichotomization: it can be used to mitigate the effects of significant outliers that skew coefficients, but represent realistic cases that need to be handled. If the outliers don't differ greatly in outcome from other values in the nearest percentiles, but skew the parameters enough to effect marginal accuracy, then it may be beneficial to group them with values displaying similar effects.

Sometimes a distribution naturally lends itself to a set of classes, in which case dichotomization will actually give you a higher degree of accuracy than a continuous function.

Also, as previously mentioned, depending on the audience, the ease of presentation can outweigh the losses to accuracy. To use credit scoring again as an example, in practice, the high degree of regulation does make a practical case for discretizing at times. While the higher degree of accuracy could help the lender cut losses, practitioners must also consider that models need to be easily understood by regulators (who may request thousands of pages of model documentation) and consumers, whom if denied credit, are legally entitled to an explanation of why.

It all depends on the problem at hand and the data, but there are certainly cases where dichotomization has its merits.

$\endgroup$
2
  • $\begingroup$ Dichotomization is putting into two bins - do you mean discretization? $\endgroup$ Commented Oct 14, 2014 at 14:02
  • 3
    $\begingroup$ In both of your first two examples, discretization is trying to bluff its way into the party by latching on to a bona fide guest. Don't be fooled. (1) If you want to model not having an open revolving credit line as a distinct class just use a dummy variable to indicate that condition & assign any constant value for average revolving credit balance. (2) If you want to treat certain extreme predictor values identically, as "big" or "small", truncate them; no need to muck about with the rest of the values. The 3rd case is uncontested - feel free to add examples. $\endgroup$ Commented Oct 14, 2014 at 14:35
6
$\begingroup$

As a clinician I think the answer depends on what you want to do. If you want to make the best fit or make the best adjustment you can use continuous and squared variables.

If you want to describe and communicate complicated associations for a non-statistically oriented audience the use of categorised variables is better, accepting that you may give some slightly biased results in the last decimal. I prefer to use at least three categories to show nonlinear associations. The alternative is to produce graphs and predicted results at certain points. Then you may need to produce a family of graphs for each continuous covariate that may be interesting. If you are afraid of getting too much bias I think you can test both models and see if the difference is important or not. You need to be practical and realistic.

I think we may realise that in many clinical situations our calculations are not based on exact data and when I for instance prescribe a medicine to an adult I do not do that with exact mg's per kilo anyway (the parable with choice between surgery and medical treatment is just nonsense).

$\endgroup$
4
  • 1
    $\begingroup$ Why exactly is the analogy nonsense? Because categorizing continuous variables never produces significantly worse models? Or because using a significantly worse model never has any practical consequences? $\endgroup$ Commented Sep 4, 2013 at 11:22
  • 11
    $\begingroup$ That is simply not the case @Roland. Estimates obtained from cutoffs are only simple because people do not understand what the estimates estimate. That is because they do not estimate a scientific quantity, i.e., a quantity that has meaning outside the sample or experiment. For example the high:low odds ratio or mean difference will increase if you add patients with ultra-high or ultra-low values to the dataset. Also, the use of cutoffs implies that biology is discontinuous, which is not the case. $\endgroup$ Commented Sep 4, 2013 at 11:41
  • $\begingroup$ @Scortchi Changing from medical to surgical treatment because it is easier to explain (is it really?) would be like replacing age with height as explanatory variable. $\endgroup$
    – Roland
    Commented Sep 5, 2013 at 7:57
  • $\begingroup$ I agree about avoiding dichotomised variables. Clinical medicine is not a rocke science where the last decimal is important. In the models I work with the results only change at the last decimal if I use categories of age vs age as continous and squared variables but increases the understanding and communicability of the associations enormously. $\endgroup$
    – Roland
    Commented Sep 5, 2013 at 8:12
5
$\begingroup$

I'm a committed fan of Frank Harrell's advice that analysts should resist premature discretization of continuous data. And I have several answers on CV and SO that demonstrate how to visualize interactions between continuous variables, since I think that is an even more valuable line of investigation. However, I also have real-world experience in the medical world of the barriers to adhering to this advice. There are often attractive divisions that both clinicians and non-clinicians expect for "splits". The conventional "upper limit of normal" is one such "natural" split point. One is essentially first examining the statistical underpinning of a relation and then communicating the substance of the findings in terms that your audience expects and can easily comprehends. Despite my "allergy" to barplots, they are exceedingly common in scientific and medical discourse. So the audience is likely to have a ready-made cognitive pattern to process them and will be able to integrate the results in their knowledge base.

Furthermore, the graphical display of modeled interactions among non-linear forms of predictor variables requires presentations of contour plots or wireframe displays which most of the audience will have some difficulty in digesting. I have found the medical and general public more receptive to presentations that have discretized and segmented results. So I suppose the conclusion is that splitting is properly done after the statistical analysis is complete; and is done in the presentation phase.

$\endgroup$
4
$\begingroup$

Many times binning continuous variables comes with an uneasy feeling of causing damage due to information lost. However, not only that you can bound the information loss, you can gain information and get more advantages.

If you use binning and get categorised variables you might be able to apply learning algorithms that are not applicable to continuous variables. Your dataset might fit better one of these algorithms so here is your first benefit.

The idea of estimating the loss due to binning is based on the paper "PAC learning with irrelevant attributes". Suppose the our concept is binary so we can split the samples into positives and negatives. For each pair of a negative and a positive samples, the difference in concept might be explained by a difference in one of the features (or otherwise, it is not explainable by the given features). The set of the feature differences is the set of possible explanation to concept difference, hence the data to use to determine the concept. If we did binning and we still get the same set of explanations for the pairs, we didn't lose any information needed (with respect to learning algorithms that work by such comparisons). If our categorisation will be very strict we will probably have a smaller set of possible explanations but we will be able to measure accurately how much and where we lose. That will enable us to trade off the number of bins vs. set of explanations.

So far we saw that we might not lose due to categorisation, but if we consider applying such a step we would like to benefit. Indeed, we can benefit from categorisation

Many learning algorithms that will be asked to classify a sample with values not seen on the train set, will consider the value as "unknown". Hence we will get a bin of "unknown" that includes ALL values not seen during the train (or even not seen enough). For such algorithms, the difference between unknown values pairs won't be used to improve classification. Compare your pairs after binning to the pairs with unknown and see if your binning is useful and you actually gained.

You can estimate how common will be unknown values by checking the values distribution of each feature. Feature were values that appear only few times are a considerable part of their distribution are good candidates for binning. Note that in many scenarios you will have many features with unknown increasing the probability that a sample will contain unknown value. Algorithms that treat all or many of the features are prone to error in such situations.

A. Dhagat and L. Hellerstein, "PAC learning with irrelevant attributes", in 'Proceedings of the IEEE Symp. on Foundation of Computer Science', 1994.http://citeseer.ist.psu.edu/dhagat94pac.html

$\endgroup$
3
$\begingroup$

If a variable has an effect at a specific threshold, create a new variable by binning it is a good thing to do. I always keep both variables, original one and binning one, and check which variable is a better predictor.

$\endgroup$
2
$\begingroup$

I just want to add something to the discussion: Normally, I would tend to also not binning the predictor variables, as I've learned that loosing information is not much appreciated, and sometimes dangerous.

However, thinking of a massive amount of data, the performance of getting the required outcome could be something like frustrating at a larger mass of data. However, the model error with binning tends to be near the error of a model with continous predictors, while accuracy stays sharp.

https://proceedings.neurips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf p.4 ff. and p.7

And with the rise of Histogram Gradient Boosting Classifiers and Regressors we may have a chance that binning of continous predictors into categorical features, may have some sort of use, if data is big enough. And my own experiments showed me that this is true to newer experimental versions of these GDBTs.

It is correct that we do not know the exact rise in units just as in a regression, when doing binning (as cjthompson underpinned previously). But with permutation importance we at least know what is essential to a model. Thus, if you do not agree, you have to admit that on large scale data, a pre scan of your data with these algorithms, even if you tend to continous predictors, may be enlightening or something to consider.

Ke et al. 2017, LightGBM: A Highly Efficient Gradient Boosting Decision Tree, 31st Conference on Neural Information Processing Systems.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.