0

I want to fit multiple models to each participant in my data. I want to parallelize this by applying the fitting function in parallel to the data of multiple participants. For one of the models, the fitting will reliably kill workers for random participants. It happens with a uniform distribution over time, one in every about 20 processes that is running, I don't know why and no longer care. I want to just work around it by redoing the fitting for this participant, which - when done manually - works perfectly fine, because the issue is random, not systematic. However, I am struggling to understand how to do this with future/parallel/parallelly.

I am doing something like this

future::plan(multisession, workers = 4)
future.apply::future_lapply(unique(data$participant), function(x) myfitfunc(data %>% filter(participant == x), additional_args = 'many'))

This setup will run fine until at some point it will tell me that the connection to a worker was lost and that at that PID no process is running, i.e. my fitting function crashed the process and killed the worker.

What I would like to happen at that point is that the dead worker is replaced by a living worker, the participant's data is used to fit the model again (and will most likely not crash this time), and that the whole process otherwise just continues uninterrupted.

I have tried to figure out how to do that with parallelly::isAliveNode and parallelly::cloneNode(), but it is not clear to me how I can integrate that with future_lapply. Especially if I would run on 100 workers and near-constantly one of the workers would die and stop the whole process and simply restarting the entire process would mean that very frequently everything is initialized again.

Should I wrap the isAliveNode and cloneNode in a tryCatch statement somewhere. But where? Inside myfitfunc? Around myfitfunc? Around future_lapply?

Update: I ran it again, after some time this happend:

Error in unserialize(node$con) : MultisessionFuture (future_lapply-2) failed to receive message results from cluster RichSOCKnode #2 (PID 14396 on localhost ‘localhost’). The reason reported was ‘error reading from connection’. Post-mortem diagnostic: No process exists with this PID, i.e. the localhost worker is no longer alive.

BUT: the process continues to run in the background. I can now even use the RStudio console again, which was not possible before this error. However, the rate at which logs are written has dropped from about 300 logs per min to about 100 logs per min, so it might be that about 2/3 of the workers have died by now and future_lapply just continues to run in the background with the still alive workers.

Update 2: After running for a while, the logs stopped, and it seems that the process stopped altogether. Probably by then all workers succumbed to the fatal error that happens in my code occasionally.

1

0