37
$\begingroup$

Decision trees seems to be a very understandable machine learning method. Once created it can be easily inspected by a human which is a great advantage in some applications.

What are the practical weak sides of Decision Trees?

$\endgroup$

4 Answers 4

40
$\begingroup$

Here are a couple I can think of:

  • They can be extremely sensitive to small perturbations in the data: a slight change can result in a drastically different tree.
  • They can easily overfit. This can be negated by validation methods and pruning, but this is a grey area.
  • They can have problems out-of-sample prediction (this is related to them being non-smooth).

Some of these are related to the problem of multicollinearity: when two variables both explain the same thing, a decision tree will greedily choose the best one, whereas many other methods will use them both. Ensemble methods such as random forests can negate this to a certain extent, but you lose the ease of understanding.

However the biggest problem, from my point of view at least, is the lack of a principled probabilistic framework. Many other methods have things like confidence intervals, posterior distributions etc., which give us some idea of how good a model is. A decision tree is ultimately an ad hoc heuristic, which can still be very useful (they are excellent for finding the sources of bugs in data processing), but there is the danger of people treating the output as "the" correct model (from my experience, this happens a lot in marketing).

$\endgroup$
6
  • 2
    $\begingroup$ From a ML point of view trees can be tested in a same way as any other classifier (CV for instance). Still it rather shows that heavy overfit happened ;-) Also RF escapes multicollinearity not because it is ensemble, but because its trees are suboptimal. $\endgroup$
    – user88
    Commented Aug 5, 2010 at 16:48
  • 2
    $\begingroup$ For a probabilistic framework of decision trees, see DTREE (url: datamining.monash.edu.au/software/dtree/index.shtml) which is based on the paper "Wallace C.S. & Patrick J.D., `Coding Decision Trees', Machine Learning, 11, 1993, pp7-22". $\endgroup$
    – emakalic
    Commented Aug 6, 2010 at 1:17
  • 3
    $\begingroup$ Also, isn't it possible to get CI (for the predictions) using bootstrapping ? $\endgroup$
    – Tal Galili
    Commented Sep 6, 2010 at 14:58
  • $\begingroup$ @Simon Byrne, I have a question regarding your comment "However the biggest problem, from my point of view at least, is the lack of a principled probabilistic framework". Forgive my ignorance, but could you please point me to some practical principled probabilistic frameworks (specifically in the context of classification). I am highly interested in this limitation of decision trees. $\endgroup$ Commented Jul 5, 2011 at 21:27
  • 2
    $\begingroup$ @AmV, one example would be logistic regression: we can use the fact that each observation comes from a binomial to obtain confidence/credible intervals and check the assumptions of the model. $\endgroup$ Commented Jul 15, 2011 at 10:55
27
$\begingroup$

One disadvantage is that all terms are assumed to interact. That is, you can't have two explanatory variables that behave independently. Every variable in the tree is forced to interact with every variable further up the tree. This is extremely inefficient if there are variables that have no or weak interactions.

$\endgroup$
5
  • $\begingroup$ i'm wonder if this is a practical limitation though--for a variable that only weakly influences classification, my intuition is that Tree won't likely split on that variable (i.e., it's not going to be a node) which in turn means it's invisible as far as Decision Tree classification goes. $\endgroup$
    – doug
    Commented Aug 5, 2010 at 12:52
  • $\begingroup$ I am talking of weak interactions, not weak effects on classification. An interaction is a relationship between two of the predictor variables. $\endgroup$ Commented Aug 5, 2010 at 13:03
  • 3
    $\begingroup$ This may be inefficient, but tree structure can handle it. $\endgroup$
    – user88
    Commented Aug 5, 2010 at 13:12
  • $\begingroup$ That's why I said inefficient rather than biased or incorrect. If you have loads of data, it doesn't matter much. But if you fit a tree to a few hundred observations than the assumed interactions can greatly reduce the predictive accuracy. $\endgroup$ Commented Aug 5, 2010 at 13:15
  • 2
    $\begingroup$ Agree; I just wanted to highlight it. Still I think that reduction of predictive accuracy can be removed by using proper training; in phylogenetics the similar problem (greedyness) is reduced by Monte Carlo scanning of the possible tree space to find maximum likelihood ones -- I don't know is there a similar approach in stats, probably no-one was bothered by this problem to such extent. $\endgroup$
    – user88
    Commented Aug 5, 2010 at 16:55
13
$\begingroup$

There are good answers here, but I am surprised that one thing has not been emphasized. CART does not make any distributional assumptions about the data, particularly the response variable. In contrast, OLS regression (for continuous response variables) and logistic regression (for certain categorical response variables), for example, do make strong assumptions; specifically, OLS regression assumes the response is conditionally normally distributed, and logistic assumes the response is binomial or multinomial.

CART's lack of such assumptions is a double-edged sword. When those assumptions are not warranted, this gives the approach a relative advantage. On the other hand, when those assumptions hold, more information can be extracted from the data by taking those facts into account. That is, standard regression methods can be more informative than CART when the assumptions are true.

$\endgroup$
0
12
$\begingroup$

My answer is directed to CART (the C 4.5/C 5 implementations) though i don't think are limited to it. My guess is that this is what the OP has in mind--it's usually what someone means when they say "Decision Tree."

Limitations of Decision Trees:


Low-Performance

By 'performance' i don't mean resolution, but execution speed. The reason why it's poor is that you need to 'redraw the tree' every time you wish to update your CART model--data classified by an already-trained Tree, that you then want to add to the Tree (i.e., use as a training data point) requires that you start from over--training instances can not be added incrementally, as they can for most other supervised learning algorithms. Perhaps the best way to state this is that Decision Trees cannot be trained in online mode, rather only in batch mode. Obviously you won't notice this limitation if you don't update your classifier, but then i would expect that you see a drop in resolution.

This is significant because for Multi-Layer Perceptrons for instance, once it's trained, then it can begin classifying data; that data can also be used to 'tune' the already-trained classifier, though with Decision Trees, you need to retrain with the entire data set (original data used in training plus any new instances).


Poor Resolution on Data With Complex Relationships Among the Variables

Decision Trees classify by step-wise assessment of a data point of unknown class, one node at time, starting at the root node and ending with a terminal node. And at each node, only two possibilities are possible (left-right), hence there are some variable relationships that Decision Trees just can't learn.


Practically Limited to Classification

Decision Trees work best when they are trained to assign a data point to a class--preferably one of only a few possible classes. I don't believe i have ever had any success using a Decision Tree in regression mode (i.e., continuous output, such as price, or expected lifetime revenue). This is not a formal or inherent limitation but a practical one. Most of the time, Decision Trees are used for prediction of factors or discrete outcomes.


Poor Resolution With Continuous Expectation Variables

Again, in principle, it's ok to have independent variables like "download time" or "number of days since previous online purchase"--just change your splitting criterion to variance (it's usually Information Entropy or Gini Impurity for discrete variables) but in my experience Decision Trees rarely work well in these instance. Exceptions are cases like "student's age" which looks continuous but in practice the range of values is quite small (particularly if they are reported as integers).

$\endgroup$
1
  • 1
    $\begingroup$ +1 for the good call on the performance angle, which usually doesn't get enough play. I've seen Decision Trees run into performance issues on several software platforms designed for large datasets (such as SQL Server), at least in comparison to other data mining methods. This is aside from the whole retraining issue you brought up. It seems to worsen in cases where overfitting occurs (although that can be said of many other mining algorithms). $\endgroup$ Commented May 31, 2016 at 10:50

Not the answer you're looking for? Browse other questions tagged or ask your own question.