23
$\begingroup$

My input variables have different dimensions. Some variables are decimal while some are hundreds. Is it essential to center (subtract mean) or scale (divide by standard deviation) these input variables in order to make the data dimensionless when using random forest?

$\endgroup$

2 Answers 2

42
$\begingroup$

No.

Random Forests are based on tree partitioning algorithms.

As such, there's no analogue to a coefficient one obtain in general regression strategies, which would depend on the units of the independent variables. Instead, one obtain a collection of partition rules, basically a decision given a threshold, and this shouldn't change with scaling. In other words, the trees only see ranks in the features.

Basically, any monotonic transformation of your data shouldn't change the forest at all (in the most common implementations).

Also, decision trees are usually robust to numerical instabilities that sometimes impair convergence and precision in other algorithms.

$\endgroup$
2
$\begingroup$

Overall I agree with Firebug, but there could be some value in standardizing your variables if you're interested in predictor importance scores. RF will tend to favour highly variable continuous predictors because there are more opportunities to partition the data. A better way to deal with this issue, however, is to use particular approaches (ie sampling without replacement using conditional forests) that are more robust to this bias. See https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-25

$\endgroup$
1
  • 4
    $\begingroup$ Welcome to the site. We are trying to build a permanent repository of high-quality statistical information in the form of questions & answers. Thus, we're wary of link-only answers, due to linkrot. Can you post a full citation & a summary of the information at the link, in case it goes dead? $\endgroup$ Commented Jul 3, 2019 at 20:58

Not the answer you're looking for? Browse other questions tagged or ask your own question.