Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: improve AArch64 code generation #119726

Open
2 of 5 tasks
diegorusso opened this issue May 29, 2024 · 2 comments
Open
2 of 5 tasks

JIT: improve AArch64 code generation #119726

diegorusso opened this issue May 29, 2024 · 2 comments
Assignees
Labels
3.14 new features, bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage topic-JIT type-feature A feature request or enhancement

Comments

@diegorusso
Copy link
Contributor

diegorusso commented May 29, 2024

Feature or enhancement

Proposal:

This is really a follow up of #115802 and more focused on the AArch64 improvements of the code generated for the JIT.
This has been discussed with @brandtbucher during PyCon 2024.

There are a series of incremental improvements that we could implement when generating AArch64 code:

  • Remove duplication of trampoline section (movk) at the end of every micro op assembly code.
    // 0000000000000140:  R_AARCH64_MOVW_UABS_G0_NC    PyObject_Free
    // 144: f2a00008      movk    x8, #0x0, lsl #16
    // 0000000000000144:  R_AARCH64_MOVW_UABS_G1_NC    PyObject_Free
    // 148: f2c00008      movk    x8, #0x0, lsl #32
    // 0000000000000148:  R_AARCH64_MOVW_UABS_G2_NC    PyObject_Free
    // 14c: f2e00008      movk    x8, #0x0, lsl #48
    // 000000000000014c:  R_AARCH64_MOVW_UABS_G3       PyObject_Free
    // 150: d61f0100      br      x8
    // 154: 00 00 00 00
    // 158: d2800008      mov     x8, #0x0
    // 0000000000000158:  R_AARCH64_MOVW_UABS_G0_NC    PyObject_Free
    // 15c: f2a00008      movk    x8, #0x0, lsl #16
    // 000000000000015c:  R_AARCH64_MOVW_UABS_G1_NC    PyObject_Free
    // 160: f2c00008      movk    x8, #0x0, lsl #32
    // 0000000000000160:  R_AARCH64_MOVW_UABS_G2_NC    PyObject_Free
    // 164: f2e00008      movk    x8, #0x0, lsl #48
    // 0000000000000164:  R_AARCH64_MOVW_UABS_G3       PyObject_Free
    // 168: d61f0100      br      x8
  • Implement trampoline with LDR of a PC relative literal (instead of movk). It saves 8bytes in code size.
  • Move the trampolines from the "code" section of a micro-op to the "data" section, so it's out-of-line.
  • Emit all of the trampolines at the end of every trace, so that each opcode doesn't need its own copy of the trampolines it uses. Also write a function to generate the trampoline.
  • Once we have a slab allocator from JIT: improve memory allocation #119730, a PR use one set of trampolines per-slab rather than per-trace.

Has this already been discussed elsewhere?

I have already discussed this feature proposal on Discourse

Links to previous discussion of this feature:

This has been discussed broadly at PyCon 2024 in person.

Linked PRs

@diegorusso diegorusso added the type-feature A feature request or enhancement label May 29, 2024
@mdboom mdboom added the performance Performance or resource usage label May 29, 2024
@brandtbucher brandtbucher added interpreter-core (Objects, Python, Grammar, and Parser dirs) 3.14 new features, bugs and security fixes labels May 29, 2024
@brandtbucher
Copy link
Member

Thanks for organizing our thoughts on this. Okay if I assign you, since you expressed interest in working on it?

Implement trampoline with LDR of a PC relative literal (instead of movk). It saves

Interesting! Mind elaborating on this a bit more? I get that it saves memory, but I'm curious if it's expected to be faster too.

Generate trampoline at the end of the trace instead of at the end of every micro op and write a function to generate the trampoline.

I'd break this up into a couple of phases:

  • A PR to move the trampolines from the "code" section of a micro-op to the "data" section, so it's out-of-line.
  • A PR to emit all of the trampolines at the end of every trace, so that each opcode doesn't need its own copy of the trampolines it uses (this could waste some memory initially, but is a nice intermediate step).
  • Once we have a slab allocator from JIT: improve memory allocation #119730, a PR use one set of trampolines per-slab rather than per-trace.

Also worth mentioning: we'll want to move to short jumps with trampolines on all platforms, not just AArch64 (AArch64 just sort of forces our hand right now since it only lets us use short jumps). So this work should also benefit other platforms too, which is nice.

@diegorusso
Copy link
Contributor Author

Interesting! Mind elaborating on this a bit more? I get that it saves memory, but I'm curious if it's expected to be faster too.

I've updated the original comment saying that it saves 8 bytes. About the speed, I think we need to measure it somehow but I would think it would be the same. The other saving is that we will do only one relocation instead of four.

The code will be something like that:

ldr x8, [PC+8]
br x8
&_Py_Dealloc

So this work should also benefit other platforms too, which is nice.

Of course :)

diegorusso added a commit to diegorusso/cpython that referenced this issue Jun 7, 2024
When emitting AArch64 trampolines at the end of every data stencil,
re-use existent ones fot the same symbol.
Fix the disassebly to reflect the "bl" instruction without the
relocation.
diegorusso added a commit to diegorusso/cpython that referenced this issue Jun 25, 2024
Replace AArch64 trampolines with LDR of a PC relative literal.
It saves 8 bytes in code size per trampoline and decreases the number
of patches functions from 4 to 1 per stencil.
It decreases by 17% the size of the stencil header file generated.
diegorusso added a commit to diegorusso/cpython that referenced this issue Jul 2, 2024
Emit AArch64 trampolines in the data section (instead of the code) of
the stencil. In many cases this allows the branch to the next micro-op
at the end of the stencil to be replaced with a fall-through NOP.
Akasurde pushed a commit to Akasurde/cpython that referenced this issue Jul 3, 2024
noahbkim pushed a commit to hudson-trading/cpython that referenced this issue Jul 11, 2024
noahbkim pushed a commit to hudson-trading/cpython that referenced this issue Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.14 new features, bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage topic-JIT type-feature A feature request or enhancement
4 participants