BRIEF: most i486 multi-processors were probably TSO. But system vendors could definitely build non-TSO systems, and were definitely considering doing so.
---+ DETAIL
---++ SMP Marketplace ~1991 (not just Intel)
During the Intel P6 design effort (~1991-1996) we had to consider
SMP systems like Sequent, who started out as a bunch of ex-Intel engineers building NS32032 based systems, then '386, '486, and '586 systems. Wikipedia doesn't mention Sequent shipping a P6 based system, but we surely talked to them, especially because they were in a neighboring suburb of Portland Oregon.
NCR's 486 based SMP systems.
A scattering of other x86 based multiprocessor systems that we would have been reluctant to break compatibility with
As well as companies who were moving to x86 from other processor families, who we sought to encourage. It's fair to say that many companies were considering building x86 multi-processors at that time.
---++ Memory Models on Our Mind
P6 spent quite some time thinking seriously about implementing sequential consistency, because there were some older systems (Intel and non-Intel based) that had such strong memory ordering models. However, I clearly remember when 'NS' said something like "Why are we bothering with strong ordering? The i486 allows loads to bypass stores in its write buffer between CPU and L1 cache." Since we were running short on time, we switched to the weaker memory ordering model.
By the way, lest people think that Intel started off in the strong memory ordering / SC camp, and then only reluctantly relaxed, let me say that one of my first assignments at Intel was to participate in the "Weak Ordering Task Force". This petered out after a while, even before we realized we could achieve higher performance using a stronger memory ordering model and speculation. Weak ordering did not raise its head again until Itanium. Nevertheless, I still think that Intel should have defined a "store-release" instruction, is that is required for so many weaker memory ordering models. Even if those models have slightly different definitions of store-release, Supporting the instruction would have allowed us to implement 1 of the weaker models. I think that LOCK MOV to memory, i.e. the x86 LOCK prefix in front of a store instruction, would have been easy to do. IIRC the LOCK prefix on stores was ignored at this time. But I emphasize: weaker memory ordering models with store-release instructions were probably lower performance on the shared bus multiprocessor systems that Intel was anticipating in the near future at the time the P6 effort started in 1991. I wanted store-release for both higher end HPC systems and lower end embedded systems that might have wanted to avoid the implementation complexity of large-scale speculative consistency maintenance.
Note that the i486 only allowed loads to bypass stores with non-overlapping addresses, in a 4-deep store buffer between processor and the L1 cache, whereas P6 provided store-to-load forwarding out of its
much deeper store buffer. 16 entries? 32 entries? Not determined at that time, but we knew it was going to be much deeper. Not only is there a slight difference between bypass and forwarding store buffers with respect to the memory model, but the different depths was possibly a greater concern. We already knew that some code could detect the difference between i386, i486, and Pentium/P5 based on timing and estimation of the store buffer and instruction prefetch queue depths, and sometimes depended on particular timings and depths. However, we had already decided that it was a fools errand to try to keep code running that depended excessively on instruction timing and particular microarchitecture buffer depths, etc.
Overall we felt more comfortable having P6 implement more conservative architecture policies than previous processors, rather than less conservative. E.g. for self modifying code P6 let a right to the instruction stream affect the very next instruction, rather than expose the actual microarchitecture instruction prefix queue depth, which was certainly deeper than that of previous processors. Code for previous processors "knew" that if it wrote more than, say, 4*16 bytes worth of instruction fetches ahead. P5 had only just introduced the notion of instruction stream serializing instructions. However, the memory ordering model was one place where legacy software that depended on an N<=4 deep store buffer would break on P6.
---++ System Vendors and Memory Ordering
At that time, ~1991-6, Intel had nowhere near as much control of the systems architecture as it does now. Load bypassing of the store buffer is the key element making you not be strong ordered, but whether you are PC or TSO is a function of the system architecture, which was not controlled by Intel.
Terminology: SC = Sequential Consistency. PC = Processor Consistency.
TSO = so-called Total Store Order. See below for more terminology commentary.
Quick reminder: TSO Total Store Ordering is IMHO totally misnamed. It is not a total store ordering. It corresponds roughly to having a forwarding or at least bypassing store buffer between every processor and a "global ordering point". So there is not a single store ordering for the entire system that is seen by all processors. Rather, there is a single store ordering for the entire system, but each processor may see its own stores "earlier" then they occur in the total store order seen by other processors. TSO is arguably the simplest memory model that supports Causal Consistency, but is a bit stronger.
The main other alternative system architecture, typically associated with PC "processor consistency", can be imagined as a tree of store buffers. E.g. processors P1 and P2 each have their own private store buffers, and then their traffic is merged into store buffer SB12. Processors P3 and P4 have their own private store buffers, and their traffic is merged into store buffer SB34. The global observation point may lie beyond SB12 and SB34. P1 can see not only its own stores earlier than in the so-called global order, but it can also see P2's stores earlier. And so on.
(Actually, multipath systems like meshes should really be a major alternative system topology. But with writeback caching you can emulate TSO or PC. It is multipath systems like meshes with caches that create the most "interesting" memory model considerations. That, and trying to milk the highest performance out of a mesh, even if you have caches. But the guys who go that far usually aren't HW cache coherent anyway.)
Intel's 486 (and Pentium) implemented the private store buffers. But it was up to the system vendor to develop the system architecture.
During the development of P6 there was at least one system vendor that was
planning to build a non-TSO system. It's been a long time, but my memory is that they had a smaller SMP, let's say 4 processors, and they interconnected two of these clusters with bidirectional store buffers. I think their 486 era implementation was writethrough caching only (i486 itself was writethrough only, but some system implementers build their own writeback caches). Of course we wanted and expected everybody to eventually go to writeback caches, but supporting older configurations always has some value, and at the time there was some serious debate about whether WT was occasionally higher performance.
IIRC this was the "smoking gun" or at least "last straw" that resulted in the official Intel P6 memory ordering model being processor consistency and not TSO. There were probably some other companies, but I remember this one in particular because it was a very clean example. SAD BUT TRUE: many years later I met the platform architect for that company, who had since joined Intel. He told me that the non-TSO version of that platform never shipped. If we had known this, Intel might very well have gone for TSO way back then.
... Except: there were various implementation bugs that made various systems not only fail TSO, but also processor consistency. Sometimes the recommended workarounds essentially involved adding enough fence type operations to make the code almost be weakly ordered tolerant.
@PeterCordes links to a discussion of such errata that caused messy code ifdef code on Linux https://lkml.iu.edu/hypermail/linux/kernel/1803.2/02384.html
---++ Intel's not really official memory models
In the architectural documents I called the official Intel P6 memory ordering model something like "speculative processor consistency", where the speculative part was to emphasize that if you placed what RISC-V now calls non-idempotent MMIO devices in WB or WT speculatable memory types, you could not rely on MMIO side effects being done in any particular order. There were some awesome and terrifying system that did things like map disk controller MMMIO addresses to every address of a 4K page, and we just gave up on being compatible with that.
For many years Intel's internal memory ordering model target was stronger than we were willing to guarantee to the outside world. Not just fearing Intel implementation bugs, but also system vendor issues. Eventually, once the multiprocessor wild wild West settled down, more and more features of the internal memory ordering target model were disclosed in subsequent releases or updates of the memory ordering specification.
One of Intel's leading architects was vociferously against officially publishing any memory ordering model, whether processor consistency or the eventual TSO, for fear of such "minor" implementation bugs requiring a recall.
BTW, I believe that one of Intel's first multicore chips - multiple processors on the same chip - broke TSO by allowing forwarding of stores from one processor to the other, before globally observed. Easily fixed by comparing processor IDs. Not quite sure where or how it happened, or even if.
---++ Memory Ordering and Implementations
IMHO full SC probably could have been implemented reasonably, using the speculation techniques that allow loads to pass other loads and stores. Particularly in systems that only contained memory types WB and UC.
(WB = Write-Back cacheable, WT=Write-Through cacheable, UC=Uncacheable)
It is my understanding, based on some presentations and tutorials on memory ordering presented by IBM people at computer architecture conferences, that IBM mainframes do this, essentially by requiring a processor/cache to get ownership of any line before any access - whether writeback, writethrough, or uncached. Sounds good - but this amounts to building a cache directory or snoop filter to track such ownership even if you have no cache in your system.
Actually, it's not WB write-back back per-se that makes it easier to implement memory ordering. AFAICT it is any protocol that allows only a single writer to a memory location at any time. Like MESI and MOESI WB cache protocols. But not necessarily Write Update, which may or may not be write back. Not WT Write Through did not require ownership. Hence my surprise when IBM told me that they had to acquire ownership before doing a writethrough. Not any protocols that try to dynamically decide whether to do write back via read for ownership vs write-through, depending on which is faster for particular workloads. Nor for right back protocols with some optimizations that eliminate read for ownerships. And certainly not for some of the GPU caches that have a dirty bit per byte, which allows simultaneous writes to the same or different bites in a cache line without the "write back exposing stale data" issues that arise when you have a single dirty bit for an entire cache line.
Historically, certainly prior to P6 and probably more recently, low end x86 systems were "store and forget". As were many other not so low end non-x86 systems. A processor sent a writethrough store into an external store buffer, the store buffer said OK, and from the processor's point of view it was complete. No need to track "ownership" for WT or UC memory. With "store-and-forget" the external system must be involved in implementing the memory ordering model, and can easily break any supposed guarantees that a processor company tries to provide.
There are other ways of implementing stronger or strong-ish memory ordering models. E.g. it may be required that store-completion-acknowledgements be sent back to the processor only when the store has completely or finally been done. Then a processor can wait for store-completion before sending any order dependent request. Sounds good... except that my experience at that time was that system vendors often lied, sending store-completion earlier for improved performance. Early store completion is actually PC correct (not necessarily TSO) at almost any place in a system topology that you can guarantee all traffic goes through. But early store-completion is problematic when there is more than one path to a node.
I think this single path consideration is one reason why PCIe tends to have a tree structure, and modern x86 MP systems tend to be hierarchical. Mesh systems tend to be multipath. AMD's Opteron caches look like they should be multipath, but I learned that they were always configured with routing tables that are single path.
I clearly remember discussing "store-and-forget" versus "store-completion" with a former coworker who came from ARM. He conceded the possible performance advantages of store-and-forget (for WT and UC systems, which were ARM's original bread and butter), but silenced me by drawing K2,2 on my whiteboard: 2 processors, connected to two different I/O devices, each by point to point links. That might have store buffers. That's a really simple example of a multipath system. If you want to do store-and-forget there, you have to provide store-buffer draining.
Sorry to have digressed.
---++ BOTTOM LINE
Most 486 systems were probably TSO. At least to writeback cacheable memory. But Intel customers were definitely thinking about building non-TSO systems.
And in aggressive I/O systems prior to PCI it could be hard to find any sort of memory model. Certainly, you had to "flush the buffers" or the like using different techniques for different paths. (Which, BTW, is still true for some systems today. Even non-Intel SOC / NOCs.)
---++ Terminology Notes
When I say "SMP" I mean "shared memory multiprocessing", and nearly always some form of "HW cache coherent SMP". Not necessarily 100% symmetric in terms of access time. E.g. SMP with local coupled faster memory regions OK.
I should probably handwave or obscured terms like SC, PC, PSO and TSO.
SC (Sequential Consistency) is one of several very strongly ordered memory models. Some folks get religious about the distinctions between SC and strong ordering in general or particular.
TSO is fairly well defined. Note that for a while Intel systems were officially CC (Causally Consistent), but not necessarily TSO.
As I mentioned above, the original P6 memory ordering model was speculative processor consistency. unfortunately as far as I can tell the more formal parts of this model were never made public.
I'm not going to discuss SUN SPARC PSO Partial Store Order.
I like the store buffer topology model as a way of thinking about the difference between TSO and "processor consistency". TSO=private store buffers per processor, global ordering. "Processor consistency" allows multilevel trees of store buffers. But there are differences between these physical models and the abstract memory ordering models.
I'm fairly sure that I should not be saying PC (Processor Consistency) for the "tree of store buffers" model. I think that's a special type of PC, but not the only type.
PC Processor Consistency is definitely not necessarily the same as "program order".
There are different weak ordering models. Some people get religious about the distinction between RC Release Consistency and WC Weak Consistency. If you bind your ordering operations to slightly different instructions... although some instruction sets have load-acquire and store-release, e.g. RISC-V and ARM (I think), their definitions of load-acquire and store-release are not necessarily the definitions that similarly named instructions had in the classic memory ordering models like RC and WC.
---+ End - Invisible [1]: ...
label definitions follow
lock add dword [esp], 0
is a full memory barrier with no other side effects. (And ironically is more efficient thanmfence
on modern x86, so it's what compilers use even on x86-64, with RSP of course.lock
ed instructions may not be 100% ordered wrt. SSE4.1 NT loads from WC memory, butmfence
is; I think this is whymfence
is slower. See Does lock xchg have the same behavior as mfence? - on Skylake, a microcode update mademfence
drain the ROB (reorder buffer) likelfence
, as well as the store buffer.)