Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fast path to deepcopy() for empty list/tuple/dict/set #121192

Open
lgeiger opened this issue Jun 30, 2024 · 9 comments
Open

Add fast path to deepcopy() for empty list/tuple/dict/set #121192

lgeiger opened this issue Jun 30, 2024 · 9 comments
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir

Comments

@lgeiger
Copy link

lgeiger commented Jun 30, 2024

deepcopy() can be surprisingly slow when called with empty containers like lists, tuples, dicts, sets or frozensets.

Adding a fast path for this case similar to #114266 would significantly speed up such cases by about 4x - 28x while having little impact on the default path and not adding too much complexity. With such a patch the following benchmarking script would show a significant speedup compared to main:

import pyperf
runner = pyperf.Runner()

setup = """
import copy

a = {"list": [1, 2 ,3 ,4], "t": (1, 2, 3), "str": "hello", "subdict": {"a": True}}

t = ()
fs = frozenset()
l = []
s = set()
d = {}
deep = [[], (), {}, set(), frozenset()]
"""

runner.timeit(name="deepcopy dict", stmt=f"b = copy.deepcopy(a)", setup=setup)
runner.timeit(name="deepcopy empty tuple", stmt=f"b = copy.deepcopy(t)", setup=setup)
runner.timeit(name="deepcopy empty frozenset", stmt=f"b = copy.deepcopy(fs)", setup=setup)
runner.timeit(name="deepcopy empty list", stmt=f"b = copy.deepcopy(l)", setup=setup)
runner.timeit(name="deepcopy empty set", stmt=f"b = copy.deepcopy(s)", setup=setup)
runner.timeit(name="deepcopy empty dict", stmt=f"b = copy.deepcopy(d)", setup=setup)
runner.timeit(name="deepcopy multiple empty containers", stmt=f"b = copy.deepcopy(deep)", setup=setup)
deepcopy dict: Mean +- std dev: [baseline] 1.86 us +- 0.06 us -> [optimize-empty-copy] 2.02 us +- 0.02 us: 1.09x slower
deepcopy empty tuple: Mean +- std dev: [baseline] 285 ns +- 2 ns -> [optimize-empty-copy] 48.4 ns +- 0.9 ns: 5.89x faster
deepcopy empty frozenset: Mean +- std dev: [baseline] 1.47 us +- 0.11 us -> [optimize-empty-copy] 49.9 ns +- 1.5 ns: 29.44x faster
deepcopy empty list: Mean +- std dev: [baseline] 323 ns +- 2 ns -> [optimize-empty-copy] 82.7 ns +- 2.5 ns: 3.91x faster
deepcopy empty set: Mean +- std dev: [baseline] 1.46 us +- 0.10 us -> [optimize-empty-copy] 85.4 ns +- 4.9 ns: 17.04x faster
deepcopy empty dict: Mean +- std dev: [baseline] 326 ns +- 4 ns -> [optimize-empty-copy] 83.3 ns +- 2.6 ns: 3.91x faster
deepcopy multiple empty containers: Mean +- std dev: [baseline] 4.13 us +- 0.04 us -> [optimize-empty-copy] 1.16 us +- 0.02 us: 3.56x faster

Geometric mean: 5.48x faster

This might conflict with @eendebakpt efforts in #91610 or could be something that should be added to the proposed C version as well.

For context, I noticed this when using pydantic models with mutable default values where pydantic would deep copy the default value upon class instantiation. E.g.:

class Foo(pydantic.BaseModel):
    bar: list[int] = []

To be fair the proper fix in this case would be not to use a mutable default value in pydantic and switch to pydantic.Field(default_factory=list) similar to dataclasses instead which is much faster. But potentially there might be other scenarios where deepcopying empty iterables might be common.

I'm happy to make a PR unless it conflicts with the efforts going on in #91610.

Linked PRs

@eendebakpt
Copy link
Contributor

@lgeiger No worries about conflicts with my PR, I can rebase if needed. I will add benchmarks for empty containers as well (I never imagined they would actually matter).

More important is perhaps whether your patch slows down the other cases. Do you have numbers for that? And perhaps also a benchmark for the pydantic models.

@lgeiger
Copy link
Author

lgeiger commented Jun 30, 2024

More important is perhaps whether your patch slows down the other cases. Do you have numbers for that?

I extended the above benchmarks to add a case which doesn't include any empty iterables. It does make this case about 1.09x slower. I'm not sure what kind of noise we'd expect from such a microbenchmark though, since I'm currently just running it on an M3 macbook on battery only.

And perhaps also a benchmark for the pydantic models.

Unfortunately I can't add a proper benchmark for the pydantic case, since it doesn't built with 3.14 yet. But here's a comparison of the example code with 3.12 which should illustrate the problem:

import pyperf
runner = pyperf.Runner()

setup = """
from pydantic import BaseModel, Field

class Foo(pydantic.BaseModel):
    bar: list[int] = []
    baz: dict[str, int] = {}

class FooField(pydantic.BaseModel):
    bar: list[int] = Field(default_factory=list)
    baz: dict[str, int] = Field(default_factory=dict)
"""

runner.timeit(name="init pydantic defaults", stmt=f"Foo()", setup=setup)
runner.timeit(name="init pydantic default factory", stmt=f"FooField()", setup=setup)
init pydantic defaults: Mean +- std dev: 1.19 us +- 0.01 us
init pydantic default factory: Mean +- std dev: 351 ns +- 12 ns

Only profiling the Foo() calls yields the following flamegraph whereas FooField() doesn't have a need to call deepcopy():
Screenshot 2024-07-01 at 00 04 59

@lgeiger
Copy link
Author

lgeiger commented Jun 30, 2024

I could reduce the slowdown of the default case to 1.06x with a slightly simpler version of the patch that actually performs even better for empty lists, sets and dicts but is a bit slower for tuples and frozen sets (although still much faster than the baseline). Let me know if you'd like me to open a PR, maybe then it's easier to discuss with the concrete code.

+------------------------------------+----------+------------------------+--------------------------+
| Benchmark                          | baseline | optimize-empty-copy    | less-optimize-empty-copy |
+====================================+==========+========================+==========================+
| deepcopy dict                      | 1.86 us  | 2.01 us: 1.09x slower  | 1.97 us: 1.06x slower    |
+------------------------------------+----------+------------------------+--------------------------+
| deepcopy empty tuple               | 285 ns   | 48.7 ns: 5.85x faster  | 55.0 ns: 5.18x faster    |
+------------------------------------+----------+------------------------+--------------------------+
| deepcopy empty frozenset           | 1.47 us  | 49.7 ns: 29.55x faster | 73.8 ns: 19.92x faster   |
+------------------------------------+----------+------------------------+--------------------------+
| deepcopy empty list                | 323 ns   | 81.8 ns: 3.95x faster  | 75.1 ns: 4.30x faster    |
+------------------------------------+----------+------------------------+--------------------------+
| deepcopy empty set                 | 1.46 us  | 84.7 ns: 17.18x faster | 77.6 ns: 18.75x faster   |
+------------------------------------+----------+------------------------+--------------------------+
| deepcopy empty dict                | 326 ns   | 84.8 ns: 3.84x faster  | 78.0 ns: 4.18x faster    |
+------------------------------------+----------+------------------------+--------------------------+
| deepcopy multiple empty containers | 4.13 us  | 1.17 us: 3.51x faster  | 1.32 us: 3.14x faster    |
+------------------------------------+----------+------------------------+--------------------------+
| Geometric mean                     | (ref)    | 5.47x faster           | 5.20x faster             |
+------------------------------------+----------+------------------------+--------------------------+
@pochmann3
Copy link
Contributor

Why check len(x) == 0 instead of not x?

@lgeiger
Copy link
Author

lgeiger commented Jul 1, 2024

Why check len(x) == 0 instead of not x?

Good idea, changed in ba7298c. This makes it slightly faster as well. In my quick benchmark the PR would make the default case only 1.03x slower (see updated numbers in the PR)

@hugovk hugovk added performance Performance or resource usage stdlib Python modules in the Lib dir labels Jul 1, 2024
@sobolevn
Copy link
Member

sobolevn commented Jul 1, 2024

I think that from 4x to 20x speed up for this quite common case is a good enough reason to slow down for 1.09x in the slow path.

I am +1 on this.

@corona10
Copy link
Member

corona10 commented Jul 1, 2024

@corona10
Copy link
Member

corona10 commented Jul 1, 2024

@sobolevn Out of curiosity, what is the typical case for deep copy? Deep cooy for non empty containers or Deepcopy for empty containers? I think the slowdown is not cheap if the nonempty container is more frequently called in the real world.

@lgeiger
Copy link
Author

lgeiger commented Jul 2, 2024

I think the slowdown is not cheap if the nonempty container is more frequently called in the real world.

@corona10 I updated #121193 and I don't see any measurable performance regression in the common case anymore (for more info see the updated numbers in the PR description).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir
6 participants