Can we reliably use unaligned scalars on contemporary hardware?

Question

Processors have come a long way in their handling of unaligned data - from crashing at the very notion of it, through suffering severe penalties, all the way to having almost no impact.

I suppose it is still possible to do harm, if for some reason the use case happens to somehow abuse cache line boundaries, but all in all, it does appear that it is pretty much ok for uniformly distributed access.

So why would we want to not align?

We save on a little memory from not padding, the reduced footprint and better data density technically make it a tad more cache friendly, and the extra hits may even be enough to offset the line boundary access costs. And then there is the almost negligible advantage of omitting padding from the compilation process.

What potential disadvantages could this have, excluding the close to within margin of error of performance impact?

Additionally, information on uarchs with significant market presence that struggle with or outright do not support unaligned access is appreciated.

amon · Accepted Answer · 2022-06-12 13:43:27Z

This is really architecture-dependent. No processor will just crash, but some architectures will raise exceptions for unaligned accesses. The performance overhead from unaligned accesses (if allowed) depends entirely on the specific microarchitecture, with penalties ranging from “multiple times slower” to “little or no penalty”.

Many modern architectures such as ARMv8 support unaligned memory accesses only under certain conditions, and you will have to opt-in to these drawbacks. Unaligned accesses are unattractive because they generally require multiple memory accesses on a microarchitecture level. This means that unaligned accesses are (a) often several times slower and (b) cannot be atomic. But the details depend on the specific processor. For example, the ARM Cortex A-78 optimization guide says that unaligned accesses are generally fast, except when one of the following applies:

Load operations that cross a cache-line (64-byte) boundary

Quad-word load operations that are not 4B aligned

Store operations that cross a 32B boundary

Furthermore, there are restrictions with regards to the store-to-load forwarding optimization.

The x86 architecture is generally more flexible, but here too can misalignment imply significant performance overhead, especially since it will also generally prevent some potential optimizations such as use of SIMD instructions (though there are some instructions like MOVUPD specifically for unaligned data). According to Agner Fog's microarchitecture guide, there is little or no penalty for unaligned accesses per se on modern some microarchitectures like AMD Zen 1–3, whereas other microarchitectures like Intel Ice Lake can allegedly suffer slight delays for unaligned writes.

Independent of microarchitecture and atomicity concerns, there will necessarily be a performance overhead in multicore scenarios when false sharing happens, where a block of memory is loaded into one core's cache but also needed by another core. It is thus advisable to align data to cache line boundaries, generally 64 bytes.

It is completely fine to determine that a more compact memory layout is more important than fast and atomic accesses for your use case. This is a space–time tradeoff, though atomicity also raises correctness concerns in multithreaded scenarios. I would consider space-optimization with unaligned data storage when the following conditions hold:

I know the specific device that will run the software, so that I can optimize for the behaviour of a particular architecture + microarchitecture.
Lower-hanging fruit for optimization is not available. For example, it is not possible to convert pointers into array offsets, or to reorder struct members for a more compact layout, or to switch from row- to column-oriented storage (aka Array of Structs AoS vs Structs of Arrays SoA).
The workload is memory- or cache-limited. Increasing data density by a bit will mean that the entire data for a meaningful sub-problem will fit into memory/cache entirely. The reduced swapping/cache misses will more than make up for any overhead from unaligned accesses.
Issues related to concurrent execution such as atomicity or false sharing can be ignored.

Stack Exchange Network

Can we reliably use unaligned scalars on contemporary hardware?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
performance
language-design
compiler
hardware
cpu
or ask your own question.

Hot Network Questions

Can we reliably use unaligned scalars on contemporary hardware?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged performancelanguage-designcompilerhardwarecpu or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
performance
language-design
compiler
hardware
cpu
or ask your own question.