3
$\begingroup$

I am working, using Python, on a Random Forest Regression for the prediction of a target variable. I have trained it and tested it on real data, obtaining satisfying results. Now, I would like to explore different possible scenarios to understand how, by changing the other variables, the target one would be modified. Can I test the RF model on synthetic data if I trained it on real data?

I have attempted to compute this simulated data by multiplying some variables of the real test dataset by chosen (by me) indexes. For example, by increasing variables A and C by 10%.

Is this approach of mixing real data for training and simulated data for testing acceptable?

$\endgroup$
2
  • 2
    $\begingroup$ This sounds like a great idea: to get a sense of how the model would behave under various alternative scenarios, since future data might not look like past data. I think we could describe this as a "sensitivity analysis". $\endgroup$ Commented Jul 4 at 14:14
  • $\begingroup$ Thank you so much, I did not want to use an approach that did not statistically make sense! $\endgroup$ Commented Jul 8 at 9:23

1 Answer 1

0
$\begingroup$

This is generally good practice for understanding many machine learning models, and some complex statistical models. It's especially valuable as a way to evaluate how the model extrapolates. In the case of standard random forests, though, extrapolation is very crude and this approach probably won't tell you much (see Decision Trees and Regression - Can predicted values be outside range of training data? ). However, if you are not extrapolating, it's still a good way to examine how predictors jointly affect the response - essentially, this is a way of recreating a partial dependence plot in multiple dimensions.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.