I've started using missForest
to potentially replace rfImpute
and while doing some testing with both synthetic and real data and the different flavours of parallelization strategies offered by missForest
I came across a strange error which only occurs when using parallelize='variables'
AND only with some real data-sets.
Suppose I start with a data-set input.miss
which has no missing data, introduce some NA
s into it thereafter and then try imputation by missForest
with all flavours for parallelize
:
# load packages
library(parallel)
library(doParallel)
library(missForest)
# set-up parallel backend
ncl <- max(2,floor(detectCores()*0.85)) #number of cores
clst <- makePSOCKcluster(n=ncl)
registerDoParallel(cl = clst)
# create synthetic data (skip when working with real data)
syn_num <- as.data.frame(matrix(rnorm(20*7,20,10),ncol=7))
syn_cat <- as.data.frame(matrix(sample(LETTERS[1:15], size=20*7, replace = TRUE), ncol=7))
for (colname in colnames(syn_cat)) {
syn_cat[,colname] <- as.factor(syn_cat[,colname]) # turn columns into factors
}
input.miss <- cbind.data.frame(syn_num, syn_cat)
# introduce NAs
input.miss <- prodNA(input.miss, noNA = 0.2)
# impute data
input.imputed <- missForest(input.miss, parallelize = 'no')[["ximp"]] # works with all data-sets
input.imputed <- missForest(input.miss, parallelize = 'forests')[["ximp"]] # works with all data-sets
input.imputed <- missForest(input.miss, parallelize = 'variables')[["ximp"]] # fails with some real data-sets
Note: the parallel back-end set-up has fewer cores than my data-set has variables.
With some real data-sets I get the following error only for parallelize = 'variables'
(the other two flavours work no problem!):
Error in `[<-.data.frame`(`*tmp*`, misi, res$varInd, value = c(1L, 1L, :
replacement has 20 rows, data has 6
I do not understand what the error message is trying to convey, the only clue I have is that the data.frame input.miss
has 20 rows and when I modify it to draw fewer or more rows from the original data I will always find the row number ...
of the data-set as the value for replacement has ... rows
in the error message. It never happens with synthetically created dummy data.
Anyone's got an idea what might be going on?
Below is some extra info on the synthetic and real data-sets that work / don't work, respectively:
real data (cannot be parallelized using variables
):
> str(input.miss)
'data.frame': 20 obs. of 14 variables:
$ V1 : num 33.6 33.7 33.7 33.8 34.5 ...
$ V2 : Factor w/ 6 levels "X-C7-03",..: 1 1 1 NA 2 1 3 3 1 1 ...
$ V3 : Factor w/ 4 levels "YZ03-1_A","YZ03-1_B",..: 1 1 1 1 1 1 1 3 1 1 ...
$ V4 : int 10 13 17 25 12 17 5 NA 23 NA ...
$ V5 : Factor w/ 3 levels "(empty)","R-TS-XZS500_03_03_A_1",..: 1 NA 1 1 1 1 NA 1 NA 1 ...
$ V6 : Factor w/ 3 levels "(empty)","ONDEMAND",..: 1 1 NA 1 1 1 1 1 1 1 ...
$ V7 : int 4 3 17 2 NA 12 19 21 1 15 ...
$ V8 : int NA 8 24 3 18 NA 21 23 25 24 ...
$ V9 : int NA 23 2 5 23 15 10 8 22 19 ...
$ V10: Factor w/ 3 levels "(empty)","T-AV-PR640B-2F-S-MT-LAY0-TFGDU-04E",..: 1 2 1 2 2 NA 2 NA 2 NA ...
$ V11: int 17 NA 11 14 20 NA NA 20 9 1 ...
$ V12: Factor w/ 3 levels "(empty)","S-RF-PR640B-NT-S-03-LAY0-TFGDU-01V",..: 2 1 2 1 1 1 1 1 2 1 ...
$ V13: num 232 231 228 NA 230 ...
$ V14: Factor w/ 3 levels "E-F-HJ3-NT-S-01V",..: 1 NA 1 NA 1 NA 1 1 NA 1 ...
synthetic data (can be parallelized using variables
):
> str(input.miss)
'data.frame': 20 obs. of 14 variables:
$ V1: num 18.6 34.2 20.9 11.5 13.6 ...
$ V2: num 8.75 NA 15.33 13.18 38.98 ...
$ V3: num 14 29.47 9.64 26.91 20.93 ...
$ V4: num 6.9 6.67 39.43 11.96 25.37 ...
$ V5: num 5.45 36.48 25.39 19.86 21.47 ...
$ V6: num 20.6 14.5 38.5 NA 12.1 ...
$ V7: num 21.96 -6.04 13.39 NA NA ...
$ V1: Factor w/ 11 levels "A","B","C","D",..: 1 11 6 NA 8 5 NA 1 11 5 ...
$ V2: Factor w/ 10 levels "B","C","E","F",..: 1 1 8 2 1 6 10 8 8 4 ...
$ V3: Factor w/ 10 levels "A","B","C","D",..: 10 4 NA 7 1 NA 3 5 10 10 ...
$ V4: Factor w/ 10 levels "B","D","G","H",..: NA NA 3 2 9 7 NA NA 10 10 ...
$ V5: Factor w/ 12 levels "A","B","C","D",..: 2 11 2 NA 9 6 NA 10 7 NA ...
$ V6: Factor w/ 12 levels "A","B","D","E",..: 12 6 NA 5 1 10 4 6 3 NA ...
$ V7: Factor w/ 10 levels "A","B","C","F",..: NA 9 4 2 1 NA 6 3 9 6 ...
missRanger
instead ofmissForest
does not show this problem so in my case a valid alternative option. Soinput.imputed <- missRanger(input.miss, maxiter = 2, num.trees = 100, verbose = 0, splitrule = "extratrees")
did the trick for me