Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiprocessing.pool ThreadPool.imap does not respect memory scarcity #101586

Open
shaundaley39 opened this issue Feb 5, 2023 · 0 comments
Open
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir topic-multiprocessing type-bug An unexpected behavior, bug, or error

Comments

@shaundaley39
Copy link

shaundaley39 commented Feb 5, 2023

Bug report

ThreadPool.imap (also imap_unsorted) consumes vast amounts of memory unnecessarily, degrading performance of systems that use it without this helping in the performance of imap. Where used to process images, this has consumed many GB of RAM and caused out-of-memory issues.

The problem is that ThreadPool.imap internally iterates over the input iterator, loading inputs into memory in a waiting task queue, without limit and without waiting for tasks to be executed or their outputs read by the imap-iterator-consuming processes.

A small example can illustrate the problem:

import time
from multiprocessing.pool import ThreadPool


def slow_function(x: int):
    time.sleep(0.05)
    print("processed inside imap: ", x)

def report_source(x: int):
    print(f"generated input {x}")
    return x

fast_source = map(report_source, range(16))

with ThreadPool(processes=2) as pool:
    list(pool.imap(slow_function, fast_source, chunksize=1))

When run, we see:

$ python reproduction.py 
generated input 0
generated input 1
generated input 2
generated input 3
generated input 4
generated input 5
generated input 6
generated input 7
generated input 8
generated input 9
generated input 10
generated input 11
generated input 12
generated input 13
generated input 14
generated input 15
processed inside imap:  0
processed inside imap:  1
processed inside imap:  2
processed inside imap:  3
processed inside imap:  4
processed inside imap:  5
processed inside imap:  6
processed inside imap:  7
processed inside imap:  8
processed inside imap:  9
processed inside imap:  10
processed inside imap:  11
processed inside imap:  12
processed inside imap:  13
processed inside imap:  14
processed inside imap:  15

If the function being run inside imap is only a little slower than the process supplying inputs (if we're bothering to make execution concurrent, this will often be the case!), we have an imap task queue that rapidly grows to consume all available system memory (unless there isn't enough input for that).

Within imap (and imap_unsorted) there are two SimpleQueue structures that can grow to arbitrary length:

  • Pool._taskqueue will grow to arbitrary length if the input iterator is able to yield input items faster than they are processed by the imap function
  • Pool._items will grow to arbitrary length if the process consuming the pool.imap iterator is slower than the imap function processing inputs.

It could be argued that an imap function which respected system memory scarcity would be a "feature". Imap has only 2 advantages over map (that I'm aware of): it can begin mapping from input to output before all of the input is available, and it is able to work where not enough memory can be allocated to have all the inputs in memory simultaneously. For users that care about the second (more common?) objective when using imap, respecting memory scarcity is not a feature; failure to respect scarcity is a bug. That's why I've made this a "Bug report" issue.

Your environment

This is environment-independent.

@shaundaley39 shaundaley39 added the type-bug An unexpected behavior, bug, or error label Feb 5, 2023
@arhadthedev arhadthedev added performance Performance or resource usage stdlib Python modules in the Lib dir topic-multiprocessing labels Feb 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir topic-multiprocessing type-bug An unexpected behavior, bug, or error
2 participants