21

Cache-coherent SMP (symmetric, or shared-memory, multi processing) systems can provide various grades of memory ordering guarantees, the stronger ones being more expensive but making it easier to write correct code: https://en.wikipedia.org/wiki/Memory_ordering

Modern x86 CPUs provide Total Store Ordering (TSO), which is the strongest guarantee (that is used in actual SMP systems).

I'm interested in what the situation was in the early 90s. At that time, a typical x86 SMP system would have a pair of 486 CPUs, each with memory controller on a separate chip. Did those systems also provide TSO?

11
  • 9
    I think the question could benefit from a paragraph explaining what 'Total Store Ordering' is, for the general reader. Though I am familiar with memory models (had to be, having written a little Alpha code :-)), the term TSO is not a familiar one to me. It is not actually defined at the linked Wikipedia page; it appears only as the name of an operational mode for RISC-V and SPARC.
    – dave
    Commented Feb 1 at 11:22
  • 5
    @camelccc Intel CPUs support SMP for all x86. In fact they do all the way back to 8080 times. Multibus (note the name) was quite common with professional 8086 to 486 systems. Multibus allows various degrees of multiprocessing, from loose coupled dedicated systems over shared I/O and shared memory all the way to SMP. Implementation is based on Intel's DMA protocol introduced with the 8080, available and used since 1974.
    – Raffzahn
    Commented Feb 1 at 14:47
  • 3
    which is the strongest guarantee (that is used in actual SMP systems) - Historically there have been sequentially-consistent SMP systems, which is even stronger, even StoreLoad ordering doesn't happen (e.g. hardware with no store buffer or pipelining, each instruction fully finishes before starting the next.) Probably some of the machines in en.wikipedia.org/wiki/Symmetric_multiprocessing#History are like that. (386 was like this. But 386 SMP, unlike 486, predates the rules that modern x86 follows.) Commented Feb 1 at 21:03
  • 1
    @camelccc: lock add dword [esp], 0 is a full memory barrier with no other side effects. (And ironically is more efficient than mfence on modern x86, so it's what compilers use even on x86-64, with RSP of course. locked instructions may not be 100% ordered wrt. SSE4.1 NT loads from WC memory, but mfence is; I think this is why mfence is slower. See Does lock xchg have the same behavior as mfence? - on Skylake, a microcode update made mfence drain the ROB (reorder buffer) like lfence, as well as the store buffer.) Commented Feb 1 at 21:11
  • 5
    "Modern x86 CPUs provide Total Store Ordering (TSO), which is the strongest guarantee (that is used in actual SMP systems)." -- I am not sure, but I have been told / taught at computer architecture conferences that at least some IBM systems, mainframes probably, were actually sequentially consistent.
    – Krazy Glew
    Commented Feb 1 at 23:40

2 Answers 2

27

BRIEF: most i486 multi-processors were probably TSO. But system vendors could definitely build non-TSO systems, and were definitely considering doing so.

---+ DETAIL

---++ SMP Marketplace ~1991 (not just Intel)

During the Intel P6 design effort (~1991-1996) we had to consider

  • SMP systems like Sequent, who started out as a bunch of ex-Intel engineers building NS32032 based systems, then '386, '486, and '586 systems. Wikipedia doesn't mention Sequent shipping a P6 based system, but we surely talked to them, especially because they were in a neighboring suburb of Portland Oregon.

  • NCR's 486 based SMP systems.

  • A scattering of other x86 based multiprocessor systems that we would have been reluctant to break compatibility with

  • As well as companies who were moving to x86 from other processor families, who we sought to encourage. It's fair to say that many companies were considering building x86 multi-processors at that time.

---++ Memory Models on Our Mind

P6 spent quite some time thinking seriously about implementing sequential consistency, because there were some older systems (Intel and non-Intel based) that had such strong memory ordering models. However, I clearly remember when 'NS' said something like "Why are we bothering with strong ordering? The i486 allows loads to bypass stores in its write buffer between CPU and L1 cache." Since we were running short on time, we switched to the weaker memory ordering model.

By the way, lest people think that Intel started off in the strong memory ordering / SC camp, and then only reluctantly relaxed, let me say that one of my first assignments at Intel was to participate in the "Weak Ordering Task Force". This petered out after a while, even before we realized we could achieve higher performance using a stronger memory ordering model and speculation. Weak ordering did not raise its head again until Itanium. Nevertheless, I still think that Intel should have defined a "store-release" instruction, is that is required for so many weaker memory ordering models. Even if those models have slightly different definitions of store-release, Supporting the instruction would have allowed us to implement 1 of the weaker models. I think that LOCK MOV to memory, i.e. the x86 LOCK prefix in front of a store instruction, would have been easy to do. IIRC the LOCK prefix on stores was ignored at this time. But I emphasize: weaker memory ordering models with store-release instructions were probably lower performance on the shared bus multiprocessor systems that Intel was anticipating in the near future at the time the P6 effort started in 1991. I wanted store-release for both higher end HPC systems and lower end embedded systems that might have wanted to avoid the implementation complexity of large-scale speculative consistency maintenance.

Note that the i486 only allowed loads to bypass stores with non-overlapping addresses, in a 4-deep store buffer between processor and the L1 cache, whereas P6 provided store-to-load forwarding out of its much deeper store buffer. 16 entries? 32 entries? Not determined at that time, but we knew it was going to be much deeper. Not only is there a slight difference between bypass and forwarding store buffers with respect to the memory model, but the different depths was possibly a greater concern. We already knew that some code could detect the difference between i386, i486, and Pentium/P5 based on timing and estimation of the store buffer and instruction prefetch queue depths, and sometimes depended on particular timings and depths. However, we had already decided that it was a fools errand to try to keep code running that depended excessively on instruction timing and particular microarchitecture buffer depths, etc.

Overall we felt more comfortable having P6 implement more conservative architecture policies than previous processors, rather than less conservative. E.g. for self modifying code P6 let a right to the instruction stream affect the very next instruction, rather than expose the actual microarchitecture instruction prefix queue depth, which was certainly deeper than that of previous processors. Code for previous processors "knew" that if it wrote more than, say, 4*16 bytes worth of instruction fetches ahead. P5 had only just introduced the notion of instruction stream serializing instructions. However, the memory ordering model was one place where legacy software that depended on an N<=4 deep store buffer would break on P6.

---++ System Vendors and Memory Ordering

At that time, ~1991-6, Intel had nowhere near as much control of the systems architecture as it does now. Load bypassing of the store buffer is the key element making you not be strong ordered, but whether you are PC or TSO is a function of the system architecture, which was not controlled by Intel.

Terminology: SC = Sequential Consistency. PC = Processor Consistency. TSO = so-called Total Store Order. See below for more terminology commentary.

Quick reminder: TSO Total Store Ordering is IMHO totally misnamed. It is not a total store ordering. It corresponds roughly to having a forwarding or at least bypassing store buffer between every processor and a "global ordering point". So there is not a single store ordering for the entire system that is seen by all processors. Rather, there is a single store ordering for the entire system, but each processor may see its own stores "earlier" then they occur in the total store order seen by other processors. TSO is arguably the simplest memory model that supports Causal Consistency, but is a bit stronger.

The main other alternative system architecture, typically associated with PC "processor consistency", can be imagined as a tree of store buffers. E.g. processors P1 and P2 each have their own private store buffers, and then their traffic is merged into store buffer SB12. Processors P3 and P4 have their own private store buffers, and their traffic is merged into store buffer SB34. The global observation point may lie beyond SB12 and SB34. P1 can see not only its own stores earlier than in the so-called global order, but it can also see P2's stores earlier. And so on.

(Actually, multipath systems like meshes should really be a major alternative system topology. But with writeback caching you can emulate TSO or PC. It is multipath systems like meshes with caches that create the most "interesting" memory model considerations. That, and trying to milk the highest performance out of a mesh, even if you have caches. But the guys who go that far usually aren't HW cache coherent anyway.)

Intel's 486 (and Pentium) implemented the private store buffers. But it was up to the system vendor to develop the system architecture.

During the development of P6 there was at least one system vendor that was planning to build a non-TSO system. It's been a long time, but my memory is that they had a smaller SMP, let's say 4 processors, and they interconnected two of these clusters with bidirectional store buffers. I think their 486 era implementation was writethrough caching only (i486 itself was writethrough only, but some system implementers build their own writeback caches). Of course we wanted and expected everybody to eventually go to writeback caches, but supporting older configurations always has some value, and at the time there was some serious debate about whether WT was occasionally higher performance.

IIRC this was the "smoking gun" or at least "last straw" that resulted in the official Intel P6 memory ordering model being processor consistency and not TSO. There were probably some other companies, but I remember this one in particular because it was a very clean example. SAD BUT TRUE: many years later I met the platform architect for that company, who had since joined Intel. He told me that the non-TSO version of that platform never shipped. If we had known this, Intel might very well have gone for TSO way back then.

... Except: there were various implementation bugs that made various systems not only fail TSO, but also processor consistency. Sometimes the recommended workarounds essentially involved adding enough fence type operations to make the code almost be weakly ordered tolerant.

@PeterCordes links to a discussion of such errata that caused messy code ifdef code on Linux https://lkml.iu.edu/hypermail/linux/kernel/1803.2/02384.html

---++ Intel's not really official memory models

In the architectural documents I called the official Intel P6 memory ordering model something like "speculative processor consistency", where the speculative part was to emphasize that if you placed what RISC-V now calls non-idempotent MMIO devices in WB or WT speculatable memory types, you could not rely on MMIO side effects being done in any particular order. There were some awesome and terrifying system that did things like map disk controller MMMIO addresses to every address of a 4K page, and we just gave up on being compatible with that.

For many years Intel's internal memory ordering model target was stronger than we were willing to guarantee to the outside world. Not just fearing Intel implementation bugs, but also system vendor issues. Eventually, once the multiprocessor wild wild West settled down, more and more features of the internal memory ordering target model were disclosed in subsequent releases or updates of the memory ordering specification.

One of Intel's leading architects was vociferously against officially publishing any memory ordering model, whether processor consistency or the eventual TSO, for fear of such "minor" implementation bugs requiring a recall.

BTW, I believe that one of Intel's first multicore chips - multiple processors on the same chip - broke TSO by allowing forwarding of stores from one processor to the other, before globally observed. Easily fixed by comparing processor IDs. Not quite sure where or how it happened, or even if.

---++ Memory Ordering and Implementations

IMHO full SC probably could have been implemented reasonably, using the speculation techniques that allow loads to pass other loads and stores. Particularly in systems that only contained memory types WB and UC.

(WB = Write-Back cacheable, WT=Write-Through cacheable, UC=Uncacheable)

It is my understanding, based on some presentations and tutorials on memory ordering presented by IBM people at computer architecture conferences, that IBM mainframes do this, essentially by requiring a processor/cache to get ownership of any line before any access - whether writeback, writethrough, or uncached. Sounds good - but this amounts to building a cache directory or snoop filter to track such ownership even if you have no cache in your system.

Actually, it's not WB write-back back per-se that makes it easier to implement memory ordering. AFAICT it is any protocol that allows only a single writer to a memory location at any time. Like MESI and MOESI WB cache protocols. But not necessarily Write Update, which may or may not be write back. Not WT Write Through did not require ownership. Hence my surprise when IBM told me that they had to acquire ownership before doing a writethrough. Not any protocols that try to dynamically decide whether to do write back via read for ownership vs write-through, depending on which is faster for particular workloads. Nor for right back protocols with some optimizations that eliminate read for ownerships. And certainly not for some of the GPU caches that have a dirty bit per byte, which allows simultaneous writes to the same or different bites in a cache line without the "write back exposing stale data" issues that arise when you have a single dirty bit for an entire cache line.

Historically, certainly prior to P6 and probably more recently, low end x86 systems were "store and forget". As were many other not so low end non-x86 systems. A processor sent a writethrough store into an external store buffer, the store buffer said OK, and from the processor's point of view it was complete. No need to track "ownership" for WT or UC memory. With "store-and-forget" the external system must be involved in implementing the memory ordering model, and can easily break any supposed guarantees that a processor company tries to provide.

There are other ways of implementing stronger or strong-ish memory ordering models. E.g. it may be required that store-completion-acknowledgements be sent back to the processor only when the store has completely or finally been done. Then a processor can wait for store-completion before sending any order dependent request. Sounds good... except that my experience at that time was that system vendors often lied, sending store-completion earlier for improved performance. Early store completion is actually PC correct (not necessarily TSO) at almost any place in a system topology that you can guarantee all traffic goes through. But early store-completion is problematic when there is more than one path to a node.

I think this single path consideration is one reason why PCIe tends to have a tree structure, and modern x86 MP systems tend to be hierarchical. Mesh systems tend to be multipath. AMD's Opteron caches look like they should be multipath, but I learned that they were always configured with routing tables that are single path.

I clearly remember discussing "store-and-forget" versus "store-completion" with a former coworker who came from ARM. He conceded the possible performance advantages of store-and-forget (for WT and UC systems, which were ARM's original bread and butter), but silenced me by drawing K2,2 on my whiteboard: 2 processors, connected to two different I/O devices, each by point to point links. That might have store buffers. That's a really simple example of a multipath system. If you want to do store-and-forget there, you have to provide store-buffer draining.

Sorry to have digressed.

---++ BOTTOM LINE

Most 486 systems were probably TSO. At least to writeback cacheable memory. But Intel customers were definitely thinking about building non-TSO systems.

And in aggressive I/O systems prior to PCI it could be hard to find any sort of memory model. Certainly, you had to "flush the buffers" or the like using different techniques for different paths. (Which, BTW, is still true for some systems today. Even non-Intel SOC / NOCs.)

---++ Terminology Notes

When I say "SMP" I mean "shared memory multiprocessing", and nearly always some form of "HW cache coherent SMP". Not necessarily 100% symmetric in terms of access time. E.g. SMP with local coupled faster memory regions OK.

I should probably handwave or obscured terms like SC, PC, PSO and TSO.

SC (Sequential Consistency) is one of several very strongly ordered memory models. Some folks get religious about the distinctions between SC and strong ordering in general or particular.

TSO is fairly well defined. Note that for a while Intel systems were officially CC (Causally Consistent), but not necessarily TSO.

As I mentioned above, the original P6 memory ordering model was speculative processor consistency. unfortunately as far as I can tell the more formal parts of this model were never made public.

I'm not going to discuss SUN SPARC PSO Partial Store Order.

I like the store buffer topology model as a way of thinking about the difference between TSO and "processor consistency". TSO=private store buffers per processor, global ordering. "Processor consistency" allows multilevel trees of store buffers. But there are differences between these physical models and the abstract memory ordering models.

I'm fairly sure that I should not be saying PC (Processor Consistency) for the "tree of store buffers" model. I think that's a special type of PC, but not the only type.

PC Processor Consistency is definitely not necessarily the same as "program order".

There are different weak ordering models. Some people get religious about the distinction between RC Release Consistency and WC Weak Consistency. If you bind your ordering operations to slightly different instructions... although some instruction sets have load-acquire and store-release, e.g. RISC-V and ARM (I think), their definitions of load-acquire and store-release are not necessarily the definitions that similarly named instructions had in the classic memory ordering models like RC and WC.

---+ End - Invisible [1]: ... label definitions follow

12
  • Thanks for all these historical details! I edited to add definitions and wikipedia links for acronyms like PSO and PC closer to where you first used them. Hopefully my simplistic summaries like PC = release/acquire (like x86 all the time, or like AArch64 stlr / ldapr) don't conflict with what you meant. Commented Feb 3 at 2:00
  • 1
    Don't need @PeterCordes because we are the only people discussing ... // My only significant objection to your edits was saying "PC = release/acquire", without the qualification "like x86 all the time", so I deleted that. // Although this is a good example of how the load-acquire or store-release in models like WC or RC, or which we considered for P6, is not necessarily the same as the store-release in ARM or RISC-V. Some flavors of store-release were (or are) globally fencing, not the pairwise. I sometimes wish that Sarita had used different words for her pairwise memory models.
    – Krazy Glew
    Commented Feb 3 at 22:53
  • 1
    Yes, as I mentioned the multicore issue that I remembered, I couldn't figure out how it would have been a problem. Hence my waffling. I'm sure you are 100% correct, forwarding between hyper threads in a shared store buffer is much more likely to have been the problem.
    – Krazy Glew
    Commented Feb 3 at 22:55
  • 3
    Thanks for the clarification. BTW, in Stack Overflow markdown you can get actual headers of different sizes like ### header 3 / #### header 4 which you might want to use instead of ---+. Commented Feb 3 at 23:05
  • 1
    My bad, don't need exclusive access to location before reading. But may need to guarantee that nobody else has exclusive access and is writing. No WT cache I knew did anything special before writing - up until I heard about the IBM systems. Let alone UC. // Q: are there other systems like IBM with ownership for UC and WT? // Elsewhere have talked about sender and receiver implementations of memory ordering. Nice thing about special -acquire and -release instructions, whether loads, stores, or fences, is that they allow implementation choice.
    – Krazy Glew
    Commented Feb 3 at 23:08
18

The first standard for x86-based multi-processing systems was Intel’s MultiProcessor Specification, first published in 1994 – so there weren’t really any “typical” 486 SMP systems before then. (There were 486-based SMP systems, and Intel provided components such as the 82489DX APIC that were expected to be used when building such systems, but there was no standard for them to adhere to.) Version 1.4 of that specification explicitly requires the following:

  • Maintaining cache coherency—When one processor accesses data cached in another processor’s cache, it must not receive incorrect data. If it modifies data, all other processors that access that data also must not receive stale data. External caches must maintain coherency among themselves, and with the main memory, internal caches, and other bus master DMA devices.
  • Cache flushing—The processor can generate special flush and write-back bus cycles that must be used by external caches in a manner that maintains cache coherency. The actual responses are implementation-specific and may vary from design to design. A program can initiate hardware cache flushing by executing a WBINVD instruction. This instruction is only guaranteed to flush the caches of the local processor. See Appendix B for system-wide flushing mechanisms. Given that cache coherency is maintained by hardware, there is no need for software to issue cache flush instructions under normal circumstances.
  • Reliable communication—All processors must be able to communicate with each other in a way that eliminates interference when more than one processor accesses the same area in memory simultaneously. The processor uses the LOCK# signal for this purpose. External caches must ensure that all locked operations are visible to other processors.
  • Write ordering—In some circumstances, it is important that memory writes be observed externally in precisely the same order as programmed. External write buffers must maintain the write ordering of the processor.

Given that the 486 is a strictly in-order CPU, the cache coherency and memory locking requirements described above provide strong ordering guarantees across the whole system.

Pre-MPS symmetric multi-processor systems might not have implemented such strict guarantees, but I imagine most if not all of them did in practice. For example, the 1991 Digital 433MP System Overview explains

Cache coherency guarantees that no processor loses track of whether its cached data is up-to-date or not. The cache coherency hardware in the applicationDEC system prevents a processor from changing data without informing other processors. Whenever a processor caches system memory it marks the memory as taken. If it changes or knows that it will have to change the data, it marks the data so that other processors are prevented from copying the data until the current operations on the cache are complete. The cache coherency scheme implemented in the applicationDEC system ensures that when one processor caches data, the other processors are not affected.

3
  • 1
    The NCR Voyager is another interesting x86 SMP implementation, dating back at least to the 486, and supported for a few years by Linux in the mid- to late noughts. Commented Feb 1 at 15:30
  • 2
    There were even some 386 SMP systems, with sequentially-consistent memory ordering (like TSO but without a store buffer or instruction pipelining, so no StoreLoad reordering either.) e.g. cs.hac.ac.il/staff/martin/Micro_Modern/slide08.pdf has a photo of a "Makbilan Parallel Computer" from 1989: sixteen 386 single-board computers with a shared address-space. Also os2museum.com/wp/386-cache-coherency - 386's bus was designed to support cache coherency for external caches, and for non-Intel chips with cache. Commented Feb 1 at 21:35
  • 2
    Note that modern x86's TSO memory model is what you get from coherent caches and in-order execution with a store buffer + store forwarding, like 486 and Pentium (with in-order commit from the store buffer, not looking past the oldest entry on cache miss stores.) PPro was the first x86 Intel CPU that had to intentionally limit what memory-ordering it could make visible, by speculatively loading early but rolling back (machine_clears.memory_ordering perf event) if the cache line was invalidated before the load was architecturally allowed to happen. Commented Feb 1 at 21:42

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .