Memory Alignment

Question

I want to make sure I understand the concept referred to by alignment:

Is it just a way of making sure that you never have a non-integer number of words? The wikipedia page says in order for an access to be aligned, the address has to be a multiple of the datum's size (which I interpret as number of words; as in, a 4-words integer requires an address that refers to an nth word, where n is a multiple of 4), but I don't see how that is significant (unless that is needed by the CPU to do multi-word reads/writes). The page relates that to the architecture (probably disregarding the instruction set), which to me seems irrelevant, unless they're talking about 8-bit bytes (not words) and trying to say that multi-word accesses are better done with whole words

I'm not certain if the goal is not to have to read parts of words (instead of wholes), or something else instruction sets enforce, in case of multi-word accesses. I would find the latter understandable, but can't imagine that the former would be too troublesome (unless that might cause some virtual-memory difficulties).

If I understood the wikipedia page correctly (I probably haven't), then accessing a datum that is 3 words long is never an aligned access, but I don't see a reason why that would be troublesome when reading sequentially with multi instructions or even all words at once (again, unless a particular implementation enforces such rules for multi-word access).

The missing piece here could be that some instruction sets actually allow you to specify an 8-bit byte address and because of their implementations aligned accesses are better

Wikipedia paragraphs:

naturally aligned, which generally means that the data's memory address is a multiple of the data *size. For instance, in a 32-bit architecture, the data may be aligned if the data is stored in four consecutive *bytes and the first byte lies on a 4-byte boundary.

*(I interpret "size" as number of words, not 8-bit bytes; and "byte" as a word, not 8-bit byte)

A memory address a is said to be n-byte aligned when a is a multiple of n (where n is a power of 2). In this context, a byte is the smallest unit of memory access, i.e. each memory address specifies a different byte.

It's probably a very simple and obvious thing, yet these things can be hard to convey sometimes. Maybe I should note that I know nothing about abstract computing techniques used, most of the time I imagine features not actually read about them.

Terribly sorry for the long and possibly naive question, I'm just tired of thinking (:

EDIT: It seems that some did not understand what I mean. When I use the term "word", I mean the width of the location memory wholly returns when given an address, independently of any manipulation that occurs whether by the CPU or any other module. Now, that term should have a defined answer unless there's no base width (in case memory is just a mess of single-bit registers and it's completely up to logic to map addresses to collections of these bits.. that's just to make my meaning clear), and I thought "n-bit architecture" for the most part always meant n-bit memory-CPU data bus, and maybe that the word width is also n, along with other things. I don't think I'm wrong about what "n-architecture" could mean, so alignment would make since if, as I've mentioned above in the question, address 0 to the CPU doesn't mean word 0, but the first byte in that word; in other words, the CPU has a target width different from the word width. That could give rise to alignment requirements/efficiency, and from how Martin explained below, that model immediately comes to mind. But, as with many things, you can't really say what is the implementation, I just wrote that to make sure I'm not off-track, but alignment requirements/efficiency could be attributed to or caused by different things, that's just one model in my head.

Doc Brown · Accepted Answer · 2022-10-06 18:31:49Z

5

I don't see how that is significant (unless that is needed by the CPU to do multi-word reads/writes)

For certain RISC architectures, that is exactly the reason - their CPU instructions require data to be placed in an aligned way. For others like Intel x86 or x64 CPU architectures, non-aligned memory access works, but is slower than aligned access.

And that's all, no less, no more. Alignment is either important for making some CPU instructions work, or for making them work faster.

answered Oct 6, 2022 at 18:31

Doc Brown

210k33 gold badges386 silver badges592 bronze badges

Thanks for answering, Doc Brown, but could you tell me if the reason such requirements/differences exist is because of standard word widths?(I thought on 32-bit architectures for example, any memory address refers to a 32-bit word; but reading the answers, alignment would only be necessary/better if the width was different from the number of bytes to be accessed and additional logic exists to make you access (for exa) 32 bits in one cycle.) Is the x64 arc designed with a word width in mind that is 8 bits but logic exists to allow 64 bits access in 1 cycle, for example?. Sorry for bothering you
– Hello
Commented Oct 9, 2022 at 14:45
@Hello: I think Martin Kochanski's answer makes a pretty good job on explaining what happens under the hood. And concerning the x64 architecture - I don't have the numbers at hand, but I am almost sure it is still optimized for 32 bit alignment (to my experience, C or C++ programs which are using 32 bit ints don't behave much different performance-wise, when run on the same CPU in x86 mode or x64 mode. x64 brings only a better performance when one is using 64 bit numbers.
– Doc Brown
Commented Oct 9, 2022 at 16:29

Add a comment |

Martin Kochanski · Accepted Answer · 2022-10-06 19:22:21Z

3

First, I am not sure what you mean by “words” and I am not entirely that you are sure what you mean by them. This may well be me being dim, but anyway I will try not to use the term.

Suppose you have a 32-bit architecture, meaning that the data path between the CPU and the memory is 32 bits wide. This means that the CPU is physically capable of reading 32 bits at a time starting at a byte address which ends with 0, 4, 8 or C if written in hexadecimal. That is what is meant by 32-bit aligned memory addresses.

It is impossible for the CPU to read 32 bits at an address ending in any other digit, such as 1.

In a properly academically respectable instruction set, an attempt to read 32 bits staring at xxx1 will raise an exception instead of reading anything. That exception, if not captured by your program, will be captured by the operating system which, if it is as uptight as the instruction set, will terminate your program.

In an instruction set corrupted by commerce, an attempt to read 32 bits starting at xxx1 will do the following instead:

Read 32 bits starting at xxx0.
Shift the result to the right by 8 bits.
Read 32 bits starting at xxx4.
Shift the result to the left by 24 bits. (To keep things simple I am assuming little-endian architecture.)
OR the values obtained in steps 2 and 4 and put the result into whatever register was specified in the instruction.

Leaving aside the extra circuitry for all those logical operations, you can see that a misaligned access means two reads instead of one.

For writing, the situation is worse. An aligned write is one 32-bit write. An unaligned write is:

Read the 32 bits at xxx0.
Alter the top three bytes to reflect the value being written.
Write the result back to xxx0.
Read the 32 bits at xxx4.
Alter the bottom byte to reflect the value being written.
Write the result back to xxx4.

answered Oct 6, 2022 at 19:22

Martin Kochanski

9284 silver badges7 bronze badges

When I used to work on SPARC, one could optionally compile misaligned support. This would add a handler for the bus error you mentioned (since it was an academically respectable ISA) and perform the workaround in software instead of hardware. Performance was ... noticeably affected.
– Useless
Commented Oct 7, 2022 at 10:24
It's worth mentioning that modern CPUs happily do this dance, and can even do non-aligned reads/writes atomically, but it really depends on architecture and microarchitecture – old ARM architectures didn't support unaligned accesses, and modern ARM chips will have a performance penalty and/or will not support atomics on unaligned accesses. x86-derived architectures have a more programmer friendly memory model, but tend to be slower overall.
– amon
Commented Oct 7, 2022 at 10:26
5

This answer makes a good job on explaining why unaligned access, in case it works, is slower than aligned access, but a remark like "In an instruction set corrupted by commerce" is - in all due respect - unprofessional bullshit. So +1 for the good explanation, -1 for the unnecessary rant, gives 0 in total from me.
– Doc Brown
Commented Oct 7, 2022 at 18:55
1

@Hello The complication is that the answer to your question is both. Most CPU architectures (with the exception of some very academically pure RISC arcitectures) will provide instructions to read various widths of bytes, (1, 2 4, 8, 16, 32, 64), but the main memory bus will have a fixed transaction size. (In DDR4 its 64 bits wide * 8 bits serial for 64 bytes per transaction) The CPU, its caches and memory controller have to work to handle the partial accesses and read/writes this difference causes.
– user1937198
Commented Oct 10, 2022 at 0:25
1

The vast majority of memory accesses on a modern x86 CPU will be partial accesses, because almost nothing except vector work wants to work with 512 bits at a time. So the L1 caches are built for high performance partial read/write.
– user1937198
Commented Oct 10, 2022 at 0:29

| Show 6 more comments

juhist · Accepted Answer · 2022-10-17 18:58:22Z

Is it just a way of making sure that you never have a non-integer number of words?

No it isn't. For example, let's consider this struct:

struct nonintegernumberofwords {
    uint8_t content[13];
};

I see non-integer number of words there, assuming you define a word to be something larger than a byte.

Non-aligned access has several important problems.

Firstly, if all your accesses are aligned, the CPU knows it has to access only exactly one cache-line. Unaligned accesses may hit two cache-lines.

Worse, an unaligned access can hit not one page but two pages. It can therefore generate zero, one or two page faults, whereas an aligned access can only generate zero or one page faults.

The main benefit of unaligned access is that it allows you to pack data tighter. This is important in cases where memory usage has to be kept at a minimum. You may be able to find some way of packing data tighter than in your initial struct layout, but this could mean moving related members far away from each other, resulting in less predictable cache access patterns.

Then there's networking. Ethernet has a 14-byte header. So even though IP, TCP, UDP and ICMP are protocols where everything is aligned, the 14-byte Ethernet header breaks it. If you want the highest possible packet processing performance, you may want to access network interface card driver buffers directly so you may not have control over where the packet is placed in memory. And even if you have control and put it at an unaligned offset of 2 bytes to fix alignment for IP, TCP, UDP and ICMP, then what if the Ethernet packet is an LLC packet without SNAP that breaks the alignment by further 3 extra bytes, so your being 2 bytes out-of-alignment guess was wrong?

My opinion is that it's extremely important for CPUs to allow reasonably fast access to non-aligned data. Whether that is a special unaligned access instruction that's nearly as fast as aligned access, or whether every access can be aligned or unaligned are two different ways to solve the problem. But a CPU that has slow unaligned access is a CPU that is doomed to fail.

Thanks juhist for the answer. As for the cache lines point, that still invokes what I mentioned in the Edit section of the question, is it safe to think about it that way?. I'm trying to know the relation between alignment requirements and physical sizes of cache/memory if any exists, since the more I read, the more I suspect that it's just a way to make instructions atomic and it's tightly connected to physical sizes of a single location. I can't see a reason for anything I've read unless it's because of the mapping of addresses to parts of locations. — Hello, Commented Oct 20, 2022 at 14:59
... If for example, address 1 is an independent memory word (single addressable location), then you'd be able to read it without reading any other address, and if all addresses are like that, alignment wouldn't matter, but if 0 1 2 3 are all in one location, then you'll only be able to read four addresses together if you start at a multiple of four (0, 4, 8, 16), since for example addresses 2 3 4 5 would be split over two locations. Is that part of the reason for alignment or is it always completely unrelated? — Hello, Commented Oct 20, 2022 at 14:59
IIRC the 68020 processor could have up to 11 (or was it 13) page faults in one instruction. — gnasher729, Commented Nov 1, 2022 at 23:01

Hello · Accepted Answer · 2022-11-01 21:15:18Z

So, after doing some reading I've found out that (in a way) it's what I've always suspected. It seems that when people talk about addresses on machines these addresses don't necessarily map to individual words (sometimes called bytes/blocks), but to 8-bit bytes. Adding that to however a given machine actually addresses memory, you might get alignment requirements; although that seems to only be a small part of the picture, since caching and other computing methods could impose the need to align as well. Maybe for caches it's the same problem as memory, but with cache instead.

In case of unaligned multi-word access (e.g. reading bytes 2 3 4 5 on a 32-bit machine), some machines might be able to handle that automatically, at the expense of more cycles, but others don't; "some machines don't allow unaligned access" is what I've read, but I'm not sure if that's the case in every sense, maybe it only means reading/writing across word boundaries isn't supported. If they didn't support unaligned access in every sense, then why not have all the bits in an address be powers of the aligned boundary from the start.

Anyhow, that addresses "conventionally" map to 8-bits is what caused the confusion. I hope this helps others.

Resources: overflow question, IBM article, and this page

Stack Exchange Network

Memory Alignment

4 Answers 4

Not the answer you're looking for? Browse other questions tagged
memory
cpu
memory-usage
or ask your own question.

Hot Network Questions

Memory Alignment

4 Answers 4

Not the answer you're looking for? Browse other questions tagged memorycpumemory-usage or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
memory
cpu
memory-usage
or ask your own question.