0

I've started using missForest to potentially replace rfImpute and while doing some testing with both synthetic and real data and the different flavours of parallelization strategies offered by missForest I came across a strange error which only occurs when using parallelize='variables' AND only with some real data-sets.

Suppose I start with a data-set input.miss which has no missing data, introduce some NAs into it thereafter and then try imputation by missForest with all flavours for parallelize:

# load packages
library(parallel)
library(doParallel)
library(missForest)

# set-up parallel backend
ncl <- max(2,floor(detectCores()*0.85)) #number of cores
clst <- makePSOCKcluster(n=ncl)
registerDoParallel(cl = clst)

# create synthetic data (skip when working with real data)
syn_num <- as.data.frame(matrix(rnorm(20*7,20,10),ncol=7))
syn_cat <- as.data.frame(matrix(sample(LETTERS[1:15], size=20*7, replace = TRUE), ncol=7))
for (colname in colnames(syn_cat)) {
    syn_cat[,colname] <- as.factor(syn_cat[,colname]) # turn columns into factors
    }
input.miss <- cbind.data.frame(syn_num, syn_cat)

# introduce NAs
input.miss <- prodNA(input.miss, noNA = 0.2)

# impute data
input.imputed <- missForest(input.miss, parallelize = 'no')[["ximp"]] # works with all data-sets
input.imputed <- missForest(input.miss, parallelize = 'forests')[["ximp"]]  # works with all data-sets
input.imputed <- missForest(input.miss, parallelize = 'variables')[["ximp"]] # fails with some real data-sets

Note: the parallel back-end set-up has fewer cores than my data-set has variables.

With some real data-sets I get the following error only for parallelize = 'variables' (the other two flavours work no problem!):

Error in `[<-.data.frame`(`*tmp*`, misi, res$varInd, value = c(1L, 1L,  : 
  replacement has 20 rows, data has 6

I do not understand what the error message is trying to convey, the only clue I have is that the data.frame input.miss has 20 rows and when I modify it to draw fewer or more rows from the original data I will always find the row number ... of the data-set as the value for replacement has ... rows in the error message. It never happens with synthetically created dummy data.

Anyone's got an idea what might be going on?

Below is some extra info on the synthetic and real data-sets that work / don't work, respectively:

real data (cannot be parallelized using variables):

> str(input.miss)
'data.frame':   20 obs. of  14 variables:
 $ V1 : num  33.6 33.7 33.7 33.8 34.5 ...
 $ V2 : Factor w/ 6 levels "X-C7-03",..: 1 1 1 NA 2 1 3 3 1 1 ...
 $ V3 : Factor w/ 4 levels "YZ03-1_A","YZ03-1_B",..: 1 1 1 1 1 1 1 3 1 1 ...
 $ V4 : int  10 13 17 25 12 17 5 NA 23 NA ...
 $ V5 : Factor w/ 3 levels "(empty)","R-TS-XZS500_03_03_A_1",..: 1 NA 1 1 1 1 NA 1 NA 1 ...
 $ V6 : Factor w/ 3 levels "(empty)","ONDEMAND",..: 1 1 NA 1 1 1 1 1 1 1 ...
 $ V7 : int  4 3 17 2 NA 12 19 21 1 15 ...
 $ V8 : int  NA 8 24 3 18 NA 21 23 25 24 ...
 $ V9 : int  NA 23 2 5 23 15 10 8 22 19 ...
 $ V10: Factor w/ 3 levels "(empty)","T-AV-PR640B-2F-S-MT-LAY0-TFGDU-04E",..: 1 2 1 2 2 NA 2 NA 2 NA ...
 $ V11: int  17 NA 11 14 20 NA NA 20 9 1 ...
 $ V12: Factor w/ 3 levels "(empty)","S-RF-PR640B-NT-S-03-LAY0-TFGDU-01V",..: 2 1 2 1 1 1 1 1 2 1 ...
 $ V13: num  232 231 228 NA 230 ...
 $ V14: Factor w/ 3 levels "E-F-HJ3-NT-S-01V",..: 1 NA 1 NA 1 NA 1 1 NA 1 ...

synthetic data (can be parallelized using variables):

> str(input.miss)
'data.frame':   20 obs. of  14 variables:
 $ V1: num  18.6 34.2 20.9 11.5 13.6 ...
 $ V2: num  8.75 NA 15.33 13.18 38.98 ...
 $ V3: num  14 29.47 9.64 26.91 20.93 ...
 $ V4: num  6.9 6.67 39.43 11.96 25.37 ...
 $ V5: num  5.45 36.48 25.39 19.86 21.47 ...
 $ V6: num  20.6 14.5 38.5 NA 12.1 ...
 $ V7: num  21.96 -6.04 13.39 NA NA ...
 $ V1: Factor w/ 11 levels "A","B","C","D",..: 1 11 6 NA 8 5 NA 1 11 5 ...
 $ V2: Factor w/ 10 levels "B","C","E","F",..: 1 1 8 2 1 6 10 8 8 4 ...
 $ V3: Factor w/ 10 levels "A","B","C","D",..: 10 4 NA 7 1 NA 3 5 10 10 ...
 $ V4: Factor w/ 10 levels "B","D","G","H",..: NA NA 3 2 9 7 NA NA 10 10 ...
 $ V5: Factor w/ 12 levels "A","B","C","D",..: 2 11 2 NA 9 6 NA 10 7 NA ...
 $ V6: Factor w/ 12 levels "A","B","D","E",..: 12 6 NA 5 1 10 4 6 3 NA ...
 $ V7: Factor w/ 10 levels "A","B","C","F",..: NA 9 4 2 1 NA 6 3 9 6 ...
1
  • 1
    Not really an answer to my question but I figured out that using missRanger instead of missForest does not show this problem so in my case a valid alternative option. So input.imputed <- missRanger(input.miss, maxiter = 2, num.trees = 100, verbose = 0, splitrule = "extratrees") did the trick for me
    – MarkH
    Commented Jun 2 at 15:02

0

Browse other questions tagged or ask your own question.