JIT: improve AArch64 code generation #119726

diegorusso · 2024-05-29T13:25:56Z

Feature or enhancement

Proposal:

This is really a follow up of #115802 and more focused on the AArch64 improvements of the code generated for the JIT.
This has been discussed with @brandtbucher during PyCon 2024.

There are a series of incremental improvements that we could implement when generating AArch64 code:

Remove duplication of trampoline section (movk) at the end of every micro op assembly code.

    // 0000000000000140:  R_AARCH64_MOVW_UABS_G0_NC    PyObject_Free
    // 144: f2a00008      movk    x8, #0x0, lsl #16
    // 0000000000000144:  R_AARCH64_MOVW_UABS_G1_NC    PyObject_Free
    // 148: f2c00008      movk    x8, #0x0, lsl #32
    // 0000000000000148:  R_AARCH64_MOVW_UABS_G2_NC    PyObject_Free
    // 14c: f2e00008      movk    x8, #0x0, lsl #48
    // 000000000000014c:  R_AARCH64_MOVW_UABS_G3       PyObject_Free
    // 150: d61f0100      br      x8
    // 154: 00 00 00 00
    // 158: d2800008      mov     x8, #0x0
    // 0000000000000158:  R_AARCH64_MOVW_UABS_G0_NC    PyObject_Free
    // 15c: f2a00008      movk    x8, #0x0, lsl #16
    // 000000000000015c:  R_AARCH64_MOVW_UABS_G1_NC    PyObject_Free
    // 160: f2c00008      movk    x8, #0x0, lsl #32
    // 0000000000000160:  R_AARCH64_MOVW_UABS_G2_NC    PyObject_Free
    // 164: f2e00008      movk    x8, #0x0, lsl #48
    // 0000000000000164:  R_AARCH64_MOVW_UABS_G3       PyObject_Free
    // 168: d61f0100      br      x8

Implement trampoline with LDR of a PC relative literal (instead of movk). It saves 8bytes in code size.
Move the trampolines from the "code" section of a micro-op to the "data" section, so it's out-of-line.
Emit all of the trampolines at the end of every trace, so that each opcode doesn't need its own copy of the trampolines it uses. Also write a function to generate the trampoline.
Once we have a slab allocator from JIT: improve memory allocation #119730, a PR use one set of trampolines per-slab rather than per-trace.

Has this already been discussed elsewhere?

I have already discussed this feature proposal on Discourse

Links to previous discussion of this feature:

This has been discussed broadly at PyCon 2024 in person.

Linked PRs

The text was updated successfully, but these errors were encountered:

brandtbucher · 2024-05-29T23:25:19Z

Thanks for organizing our thoughts on this. Okay if I assign you, since you expressed interest in working on it?

Implement trampoline with LDR of a PC relative literal (instead of movk). It saves

Interesting! Mind elaborating on this a bit more? I get that it saves memory, but I'm curious if it's expected to be faster too.

Generate trampoline at the end of the trace instead of at the end of every micro op and write a function to generate the trampoline.

I'd break this up into a couple of phases:

A PR to move the trampolines from the "code" section of a micro-op to the "data" section, so it's out-of-line.
A PR to emit all of the trampolines at the end of every trace, so that each opcode doesn't need its own copy of the trampolines it uses (this could waste some memory initially, but is a nice intermediate step).
Once we have a slab allocator from JIT: improve memory allocation #119730, a PR use one set of trampolines per-slab rather than per-trace.

Also worth mentioning: we'll want to move to short jumps with trampolines on all platforms, not just AArch64 (AArch64 just sort of forces our hand right now since it only lets us use short jumps). So this work should also benefit other platforms too, which is nice.

diegorusso · 2024-05-30T13:43:50Z

Interesting! Mind elaborating on this a bit more? I get that it saves memory, but I'm curious if it's expected to be faster too.

I've updated the original comment saying that it saves 8 bytes. About the speed, I think we need to measure it somehow but I would think it would be the same. The other saving is that we will do only one relocation instead of four.

The code will be something like that:

ldr x8, [PC+8]
br x8
&_Py_Dealloc

So this work should also benefit other platforms too, which is nice.

Of course :)

When emitting AArch64 trampolines at the end of every data stencil, re-use existent ones fot the same symbol. Fix the disassebly to reflect the "bl" instruction without the relocation.

)

Replace AArch64 trampolines with LDR of a PC relative literal. It saves 8 bytes in code size per trampoline and decreases the number of patches functions from 4 to 1 per stencil. It decreases by 17% the size of the stencil header file generated.

…ythonGH-120250)

Emit AArch64 trampolines in the data section (instead of the code) of the stencil. In many cases this allows the branch to the next micro-op at the end of the stencil to be replaced with a fall-through NOP.

…ythonGH-120250)

diegorusso added the type-feature A feature request or enhancement label May 29, 2024

mdboom added the performance Performance or resource usage label May 29, 2024

brandtbucher added interpreter-core (Objects, Python, Grammar, and Parser dirs) 3.14 new features, bugs and security fixes labels May 29, 2024

brandtbucher assigned diegorusso May 29, 2024

AlexWaygood added the topic-JIT label May 30, 2024

bedevere-app bot mentioned this issue Jun 7, 2024

gh-119726: JIT: re-use trampolines on AArch64 #120250

Merged

brandtbucher pushed a commit that referenced this issue Jun 19, 2024

GH-119726: Deduplicate JIT trampolines for out-of-range jumps (GH-120250

a0dce37

)

bedevere-app bot mentioned this issue Jun 25, 2024

gh-119726: replace AArch64 trampolines with LDR #121001

Merged

mrahtz pushed a commit to mrahtz/cpython that referenced this issue Jun 30, 2024

pythonGH-119726: Deduplicate JIT trampolines for out-of-range jumps (p…

c4193b2

…ythonGH-120250)

brandtbucher pushed a commit that referenced this issue Jul 1, 2024

GH-119726: Use LDR for AArch64 trampolines (GH-121001)

9662608

bedevere-app bot mentioned this issue Jul 2, 2024

gh-119726: emit AArch64 trampolines in the data section #121280

Merged

Akasurde pushed a commit to Akasurde/cpython that referenced this issue Jul 3, 2024

pythonGH-119726: Use LDR for AArch64 trampolines (pythonGH-121001)

26a4df5

brandtbucher pushed a commit that referenced this issue Jul 3, 2024

GH-119726: Emit AArch64 trampolines out-of-line (GH-121280)

84512c0

noahbkim pushed a commit to hudson-trading/cpython that referenced this issue Jul 11, 2024

pythonGH-119726: Deduplicate JIT trampolines for out-of-range jumps (p…

6cc7eb9

…ythonGH-120250)

noahbkim pushed a commit to hudson-trading/cpython that referenced this issue Jul 11, 2024

pythonGH-119726: Use LDR for AArch64 trampolines (pythonGH-121001)

73cb048

noahbkim pushed a commit to hudson-trading/cpython that referenced this issue Jul 11, 2024

pythonGH-119726: Emit AArch64 trampolines out-of-line (pythonGH-121280)

622eb1f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: improve AArch64 code generation #119726

JIT: improve AArch64 code generation #119726

diegorusso commented May 29, 2024 •

edited by bedevere-app bot

Loading

brandtbucher commented May 29, 2024

diegorusso commented May 30, 2024

JIT: improve AArch64 code generation #119726

JIT: improve AArch64 code generation #119726

Comments

diegorusso commented May 29, 2024 • edited by bedevere-app bot Loading

Feature or enhancement

Proposal:

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Linked PRs

brandtbucher commented May 29, 2024

diegorusso commented May 30, 2024

diegorusso commented May 29, 2024 •

edited by bedevere-app bot

Loading