I have an AWS instance. I would like to run a bunch of tasks, some memory and cpu intensive. Ideally, I would like to compute timing information on each task. If I run them in serial, it computes accurate timing information, but it's slow. If I run them in parallel, the whole thing is faster, but individual tasks are slower, as reported by both wall time and thread CPU time.
This slowdown increases as the number of threads increases up to the number of CPUs
Cursory examination with ghc-events-analyze
and +RTS -s
suggests that the source of the slowdown is (unsurprisingly) GC pauses. Playing with RTS options reveals that +RTS -qg -qb -qa -A256m
(disabling parallel GC, disabling load balancing GC, disabling thread migration, and increasing the GC allocation area) improves this, but does not completely eliminate it.
I am running threads using forkIO
, but the threads are independent and pure apart from printing progress information. I'm using parallel-io to manage the number of running threads, but when I briefly tried a more conventional approach of having a fixed pool of threads and a task queue, I still had this problem.
Any suggestions for how to debug?
EDIT:
@jberryman asked for an example. Each of the tasks looks like the below code
computation params = do
!x <- force params
print $ "Starting computation on " ++ show params
t1 <- getCPUTime
!y <- fmap force $ do $
...some work with x ...
t2 <- getCPUTime
print $ "Finished computation on " ++ show params
return (t2 - t1, y)
-threaded
and are running with-N
? An actual executable program that exhibits the issue is what I was hoping for-N
parameter is the only thing I'm changing. I can't provide the actual code. I'll see if I can get build an MWE, but I'm not hopeful.