12

I am having a strange problem. I have successfully ran this code on my laptop, but when I try to run it on another machine first I get this warning Distribution not specified, assuming bernoulli ..., which I expect but then I get this error: Error in object$var.levels[[i]] : subscript out of bounds

library(gbm)
gbm.tmp <- gbm(subxy$presence ~ btyme + stsmi + styma + bathy,
                data=subxy,
                var.monotone=rep(0, length= 4), n.trees=2000, interaction.depth=3,
                n.minobsinnode=10, shrinkage=0.01, bag.fraction=0.5, train.fraction=1,
                verbose=F, cv.folds=10)

Can anybody help? The data structures are exactly the same, same code, same R. I am not even using a subscript here.

EDIT: traceback()

6: predict.gbm(model, newdata = my.data, n.trees = best.iter.cv)
5: predict(model, newdata = my.data, n.trees = best.iter.cv)
4: predict(model, newdata = my.data, n.trees = best.iter.cv)
3: gbmCrossValPredictions(cv.models, cv.folds, cv.group, best.iter.cv, 
       distribution, data[i.train, ], y)
2: gbmCrossVal(cv.folds, nTrain, n.cores, class.stratify.cv, data, 
       x, y, offset, distribution, w, var.monotone, n.trees, interaction.depth, 
       n.minobsinnode, shrinkage, bag.fraction, var.names, response.name, 
       group)
1: gbm(subxy$presence ~ btyme + stsmi + styma + bathy, data = subxy,var.monotone = rep(0, length = 4), n.trees = 2000, interaction.depth = 3, n.minobsinnode = 10, shrinkage = 0.01, bag.fraction = 0.5, train.fraction = 1, verbose = F, cv.folds = 10)

Could it have something to do because I moved the saved R workspace to another machine?

EDIT 2: ok so I have updated the gbm package on the machine where the code was working and now I get the same error. So at this point I am thinking that the older gbm package did perhaps not have this check in place or that the newer version has some problem. I don't understand gbm well enough to say.

4
  • 1
    (1) It may not be the source of your problem, but your formula shouldn't use $; just do presence ~ .... (2) One thing to check is that both machines have R set up the same way; for instance check stringsAsFactors.
    – joran
    Commented Sep 5, 2013 at 15:53
  • Where is this subxy data frame? If it's your own data, then please can you provide some sample data that reproduces the problem. A traceback() of where the error occurs would also be useful. Commented Sep 5, 2013 at 15:54
  • The default distribution for gbm is "bernoulli", so if you have an outcome with greater than two levels, wouldn't you expect to throw an error?
    – IRTFM
    Commented Sep 5, 2013 at 17:57
  • @joran I checked both, and they have no effect on the issue. Commented Sep 5, 2013 at 19:48

2 Answers 2

13

just a hunch since I can't see you data, but I believe that error occurs when you have variable levels that exist in the test set which don't exist in the training set.

this can easily happen when you have a factor variable with a high number of levels, or one level has a low number of instances.

since you're using CV folds, it's possible the holdout set on one of the loops has foreign levels to the training data.

I'd suggest either:

A) use model.matrix() to one-hot encode your factor variables

B) keep setting different seeds until you get a CV split that doesn't have this error occur.

EDIT: yep, with that traceback, your 3rd CV holdout has a factor level in its test set that doesn't exist in the training. so the predict function sees a foreign value and doesn't know what to do.

EDIT 2: Here's a quick example to show what I mean by "factor levels not in the test set"

#Example data with low occurrences of a factor level:

set.seed(222)
data = data.frame(cbind( y = sample(0:1, 10, replace = TRUE), x1 = rnorm(10), x2 = as.factor(sample(0:10, 10, replace = TRUE))))
data$x2 = as.factor(data$x2)
data

      y         x1 x2
 [1,] 1 -0.2468959  2
 [2,] 0 -1.2155609  6
 [3,] 0  1.5614051  1
 [4,] 0  0.4273102  5
 [5,] 1 -1.2010235  5
 [6,] 1  1.0524585  8
 [7,] 0 -1.3050636  6
 [8,] 0 -0.6926076  4
 [9,] 1  0.6026489  3
[10,] 0 -0.1977531  7

#CV fold.  This splits a model to be trained on 80% of the data, then tests against the remaining 20%.  This is a simpler version of what happens when you call gbm's CV fold.

CV_train_rows = sample(1:10, 8, replace = FALSE) ; CV_test_rows = setdiff(1:10, CV_train_rows)
CV_train = data[CV_train_rows,] ; CV_test = data[CV_test_rows,]

#build a model on the training... 

CV_model = lm(y ~ ., data = CV_train)
summary(CV_model)
#note here: as the model has been built, it was only fed factor levels (3, 4, 5, 6, 7, 8) for variable x2

CV_test$x2
#in the test set, there are only levels 1 and 2.

#attempt to predict on the test set
predict(CV_model, CV_test)

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
factor x2 has new levels 1, 2
4
  • 1
    thanks for the answer, it's a bit over my head, I am not sure if I understand all of it. Why the same function works on the other computer? I never get this error. It's a bit strange. I don't want to modify the CV parameter. Commented Sep 5, 2013 at 20:04
  • please see edit2 in the answer if that makes sense. Thank you Commented Sep 5, 2013 at 20:24
  • 4
    so I can confirm that by deactivating the CV fold gbm works. Maybe it's a bug with the package? It was working in the previous package. Any CV number higher than 1 gives this error. So anytime it is used. Commented Sep 6, 2013 at 10:25
  • 1
    hi dylanjf, would you be able to share an example of using model.matrix to encode factor variable please?
    – Eugene Yan
    Commented Apr 4, 2015 at 3:02
4

I encounter the same problem and end up solving it by changing one of the hidden function called predict.gbm in the gbm package. This function predict the testing set by the trained gbm object on the training set from the division by cross validation.

The problem is the passed testing set should only have the columns corresponding to the features, so you should modify the function.

1
  • 1
    "The problem is the passed testing set should only have the columns corresponding to the features, so you should modify the function." Thanks! This tripped me for a long time this morning. Commented Jul 25, 2017 at 8:11

Not the answer you're looking for? Browse other questions tagged or ask your own question.