1
$\begingroup$

My doubt about overfitting is almost general, but in this particular case is all about survival models. I am working in a case-cohort study, estimating the HR in a cohort where heart attack correspond to cases (56 individuals) and the rest are health controls (192).

We perform a Cox regression to estimate HR of different covariates, in which I am specially interested in potential molecular biomarkers and its diagnostic efficiency. The thing is based in our population and the descriptive statistics, we ruled out different covariates (related with cardiovascular disease), however we still have almost 9 or 10 variables that would be useful to include as predictor (including the microRNA). According to the distribution of this variables (age, sex, weight, total cholesterol, diabetes status, smoking status..) we can still rule out if it's advisable.

I come from here where is a similar discussion (with different study design) in which it is suggested by Prof. Harrell a new method, which I don't fully understand. Seems that rule of thumb of 10 cases for 1 predict variable is not advisable. So my doubt lies in the alternatives or methods to estimate when I am overfitting the model, in this case a Cox regression model (any extra information of other models is always welcome, but not intended in this thread)

# the model

Surv(timetoevent, heart_attack) ~ age+ sex + HDL-c + diabetes + smoking + batch + weight + total_cholesterol + biomarker_of_interest,data = datab, subcoh =  ~ subdata, id = ~ids, cohort.size = 5404, method = "LinYing", robust = TRUE)

$\endgroup$
3
  • 1
    $\begingroup$ Go with the paper by Richard Riley et al. $\endgroup$ Commented Apr 25 at 12:32
  • $\begingroup$ I went throught it but it was hard to understand $\endgroup$ Commented Apr 29 at 7:29
  • $\begingroup$ In referral to this is there any other publication to be mentioned. I'm going to go for it. $\endgroup$ Commented May 2 at 7:13

1 Answer 1

1
$\begingroup$

With that many predictors you are very likely to overfit a survival model.

If you want to test for overfitting you might adapt the "optimism bootstrap" method that Prof. Harrell uses in the validate() function of his rms package. Repeat your modeling approach on multiple bootstrapped samples of the data, and evaluate how well the linear predictor from each of those models fits both the corresponding bootstrap sample and the full data set. The full data set then represents the underlying population, and the bootstrap samples represent the process of sampling from that population and then modeling. The average difference in performance thus provides an estimate of the "optimism" in the original model due to overfitting. There's a brief outline on this page.

If you want to avoid overfitting, consider more aggressive data reduction to combine multiple predictors into a smaller number of predictors (without examining associations with outcome). Or use ridge regression to penalize the coefficients of predictors that aren't of primary importance. This paper illustrates that approach.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.