0

I'm having a problem where R crashes when calling keras::unserialize_model() in a doParallel foreach loop.

I have to sanitize this code, so hopefully I don't munge anything. And I'm not an R developer; I'm trying to move some R code that someone else wrote into a production runtime environment.

If I run this code, it objects load and nothing crashes:

#unserialize models locally
my_model1 <- keras::unserialize_model(ser_model1)
my_model2 <- keras::unserialize_model(ser_model2)
my_model3 <- keras::unserialize_model(ser_model3)
my_model4 <- keras::unserialize_model(ser_model4)
my_model5 <- keras::unserialize_model(ser_model5)

and I can get to processing. But if I run this in a foreach() loop:

places  <- list( of things to run )
r <- foreach(i=places, .export = c("ser_model1", "ser_model2", "ser_model3", "ser_model4", "ser_model5"),
                  .packages = c("dplyr","av","imager","jpeg","tensorflow","keras","stringr","reticulate","caTools","imagerExtra","raster","readr","gsignal","data.table")) %dopar% {
    #unserialize models locally
    my_model1 <- keras::unserialize_model(ser_model1)
    my_model2 <- keras::unserialize_model(ser_model2)
    my_model3 <- keras::unserialize_model(ser_model3)
    my_model4 <- keras::unserialize_model(ser_model4)
    my_model5 <- keras::unserialize_model(ser_model5)

    # lots of processing here
    # eventually some_results <- whatever_computation()
    
    return(some_results)
}

then the code crashes with a segfault on the keras::unserialize_model(ser_model1) call:

2024-04-06 21:32:07.768352: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-06 21:32:07.773084: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-06 21:32:07.834125: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-06 21:32:09.108273: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

*** caught segfault ***
address (nil), cause 'memory not mapped'

Traceback:
1: conditionMessage_from_py_exception(c)
2: conditionMessage.python.builtin.BaseException(errorValue)
3: conditionMessage(errorValue)
4: sprintf("task %d failed - \"%s\"", errorIndex, conditionMessage(errorValue))
5: e$fun(obj, substitute(ex), parent.frame(), e$data)
6: Redacted foreach statement
7: calling_my_function_above()
8: perform_model(inputs)
An irrecoverable exception occurred. R is aborting now ...
Segmentation fault (core dumped)

Here is my session info:

R version 4.3.3 (2024-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base

other attached packages:
[1] dplyr_1.1.4  rjson_0.2.21 hash_2.2.6.3 DBI_1.2.2    odbc_1.4.2

loaded via a namespace (and not attached):
 [1] utf8_1.2.4       R6_2.5.1         tidyselect_1.2.1 bit_4.0.5
 [5] magrittr_2.0.3   glue_1.7.0       bspm_0.5.5.1     blob_1.2.4
 [9] tibble_3.2.1     pkgconfig_2.0.3  generics_0.1.3   bit64_4.0.5
[13] lifecycle_1.0.4  cli_3.6.2        fansi_1.0.6      vctrs_0.6.5
 hms_1.1.3        pillar_1.9.0     Rcpp_1.0.12
[21] rlang_1.1.3

As above, removing the foreach() seems to let the code progress. I've changed the number of threads to 2 from 8. And I've tried paring down the code as much as possible. The issue seems to be with the call for my_model1. If I comment that out (and leave the other unserialize_model() calls) the code will proceed without causing a segfault.

Maybe the "Could not find TensorRT" warning is an issue, but since the other calls have no problem working then I have to believe that's not a problem. (Is it?)

How can I learn what's interesting about ser_model1 that causes the crash? Why is the "task failed" message shown in the call stack never printed? Seems like it would give some insight. How can I debug R when it causes a segfault and so many other libraries and dependencies are involved?

1
  • You should say that one of those packages is crashing. I'd guess it's foreach, but it's easy to be wrong about this, so I'll avoid all of them. Commented Apr 6 at 22:18

1 Answer 1

0

Forking a process with TensorFlow is not safe. TensorFlow maintains its own threadpool and attempting to fork the main process will lead to segfaults. The same is true regardless of if you're using the Python or R interface. Additionally, if you're using a GPU with CUDA, the CUDA runtime is also not fork safe, as far as I know.

Ergo, using any parallelization approaches in R that fork the current R process will not work. This includes foreach, mclapply, and similar.

1
  • Thanks. Is this limitation documented anywhere?
    – MikeB
    Commented Apr 8 at 21:33

Not the answer you're looking for? Browse other questions tagged or ask your own question.