10

The PC architecture, from the original IBM PC onward, has always been designed around the idea that video memory will be on an expansion card. This was an unusual design decision; most 80s computers did not do it that way. It has advantages in terms of easier upgrades, but can have the disadvantage of making it slower for the CPU to write to video memory.

Of course nowadays, that doesn't matter because you're not supposed to use the CPU to write to video memory at all; that's the GPU's job. But I'm thinking about the 80s and early 90s, when the CPU still did all the work.

Now, there are always going to be wait states during active scan line, because the video chip is using some of the video memory bandwidth for generating the display; that's true regardless of where the video memory is. (Unless you have dual-port VRAM, which was expensive and rarely used.) I'm not talking about that here.

I'm talking about wait states incurred by the CPU writing to video memory, even outside active scan line, because it's on an expansion card at the other end of the expansion bus, as opposed to main memory on the motherboard.

When did video memory start incurring wait states for that reason? For example, I would expect CGA not to do so because the original IBM PC actually put main memory beyond 64K into expansion cards on the ISA bus. But what about EGA? VGA?

9
  • 1
    IBM certainly changed that (only 64KB on the mobo) really early in the product’s life.
    – RonJohn
    Commented Mar 10 at 14:57
  • 5
    "Unless you have dual-port VRAM, which was expensive and rarely used" - actually, the opposite is true, at least for CGA cards. Admittedly, IBM's original didn't use dual-ported memory, but quite a number of clones did.
    – tofro
    Commented Mar 10 at 16:58
  • 1
    @tofro: I wouldn't particularly expect a CGA clone to use a dual-ported VRAM; since since each video fetch of an even address is always immediately followed by a fetch of the next odd address, page-mode addressing could allow eighty pairs of bytes to be fetched per line with memory timings only slightly tighter than would be needed to perform eighty single accesses (like what would be used in 40-column or graphics mode).
    – supercat
    Commented Mar 10 at 17:32
  • 5
    @supercat Well, even the Amstrad PC1512 had DP-VRAM.
    – tofro
    Commented Mar 10 at 17:45
  • 3
    @supercat There's no shared memory (that was pretty uncommon in the early days) but rather dedicated video memory. And Amstrad was quite well-known for not overbuilding their machines.
    – tofro
    Commented Mar 10 at 18:19

5 Answers 5

24

This question starts with a number of misconceptions about early video, memory and bus systems that need to be addressed in order to clarify what this question really appears to be about.

If you want the quick summary, you can skip to the end of this (rather long, sorry) post.

Expansion Cards for Video

The PC architecture...has always been designed around the idea that video memory will be on an expansion card. This was an unusual design decision...

Video hardware and memory accessed through an expansion bus were not an unusual design decision; in fact that was how video had commonly been done before the advent of "all-in-one" units such as the Apple 1 and the 1977 trinity, on both microcomputer systems and the minicomputers that preceded them. (And it continued to be done even on systems with built-in video.) Some examples pre-dating the IBM PC include:

  1. The Knight TV and keyboard, dating from somewhere around 1972, was a video frame buffer and circuitry on a Unibus expansion card for a PDP-11.

  2. S-100 machines starting from the Altair, which almost invariably did not have built-in video. Video expansion cards for S-100 started appearing in 1976. These included the Cromemco Dazzler, which did have onboard RAM but accessed system RAM on the bus via DMA. Other cards, such as the Processor Technology VDM-1, had onboard RAM. These are actually somewhat parallel to the IBM PC CGA and MDA cards, in that the Dazzler was low resolution but offered colour graphics, whereas the VDM-1 had higher resolution, but offered only monochrome text. You can see quite a few more S-100 video boards in this list.

  3. The JFF Electronics Color Graphics Interface PC Board, released in 1979. This added colour graphics to the TRS-80 Model I and connected to its expansion interface connector on the back. I can't find detailed technical information on this, but given that it used BASIC PEEK and POKE as well as OUT commands, it was almost certainly a memory mapped display, with the memory either on the graphics board with the CPU accessing it through the expansion bus or it used system RAM with the graphics subsystem accessing it through the expansion bus. (There were several other graphics expansion cards for the TRS-80 Model I introduced from about 1979 onwards as well.)

  4. The 8086-based NEC N5200 (limited English info here), released just before the IBM PC in July 1981, had built-in text-only video circuitry, but graphics required an expansion board.

For the IBM PC, putting the video cards on the expansion bus was a perfectly sensible design decision since from the start they were making two quite different displays available: a high-resolution monochrome text-only display (MDA) and a slightly lower resolution colour graphics display (CGA). They could have gone the route of including one of those on the motherboard and making the other an expansion bus option, as the NEC N5200 did, but that would then impose extra cost on customers who wanted the display not built into the motherboard. And, as explained below, whether the display circuitry was on the motherboard or an expansion card made no difference.

Memory on an Expansion Bus

It is not the case that video memory (or any memory) on an expansion card will be slower to write than memory on the CPU board. In fact, on most microcomputers up to and including the IBM PC and PC/AT, using an expansion interface itself makes no difference to memory speed because the expansion bus is simply an extension of the system bus. Earlier systems, including all minicomputers and many microcomputers through the mid-1970s generally had memory only on the expansion bus. (E.g., the PDP-11 had memory, as well as peripherals, on the Unibus, and most S-100 systems, from the Altair onward up until the single-board systems that started appearing in the late '70s, had memory on a card separate from the CPU card on the S-100 bus.)

Even single-board computers often had system RAM on an expansion card, which was indistinguishable from the RAM on the mainboard except for the particular location in the address space. The Apple II and II Plus supported only 48K of RAM on the main board; Apple offered the extremely popular Langauge card to add another 16K of RAM, expanding it to 64K. The TRS-80 Model I as well supported only 16K of RAM on the mainboard; additional RAM (up to 48K total) was in the Expansion Interface.

As with the above machines, the ISA bus used by the IBM PC was just a (buffered) extension of the system bus and it was normal to put system RAM on a card on the ISA bus (and, in fact, required if you wanted more than 256K of RAM on the first models). Multi-function expansion cards that included RAM were very popular; one example was the AST SixPakPlus which included up to 384 KB of RAM, RS-232 serial, a parallel printer port, and a battery-backed real-time clock.

The key point here is that for many microcomputers, even though the 1980s, an expansion bus was simply an extension of the system bus, and it made essentially no difference (for timing or otherwise) whether RAM or other devices were on a separate card attached to an expansion bus connector or directly on the motherboard. Electrically and logically, the CPU saw these as the same.

This did eventually change as PC clones grew faster and yet still wanted to maintain backward compatibility with older, slower ISA cards. And this is indeed when wait states started being incurred "because [the card was] on an expansion card at the other end of the expansion bus," as you ask, but that had an effect on video memory since it was already incurring wait states for different reasons we'll get into next.

Video RAM Conflicts

Now, there are always going to be wait states during active scan line, because the video chip is using some of the video memory bandwidth for generating the display...

This is incorrect.

While it's true that generating a video display from a frame buffer in RAM will use some of the transfer bandwidth available from the RAM, it is not the case that this always involves delaying CPU access or making the CPU wait, even with single-ported memory. That depends entirely on the system design and the speed of the RAM.

A fair number of systems (particularly 6800- and 6502-based ones) used RAM considerably faster than necessary for just the CPU, allowing them not to delay CPU memory accesses at all. This is easy on Motorola bus systems (6800/6502/6809/etc.) because the CPU accesses memory only on one phase of the clock cycle, known as ϕ0 or ϕ2. This leaves memory entirely free for access on the other phase, ϕ1 and, if the memory is fast enough, it can be read by the video subsystem during this time at no penalty to CPU access speed. The Apple II and the BBC Micro are among the classic examples of this technique¹, but it was very widely used elsewhere.

Unfortunately for Intel bus systems, such as the 8080, Z80, and 8088, things become a bit more complex. While on a typical 1 MHz Motorola bus system the regular ϕ2 500 ns CPU access cycle followed by ϕ1 500 ns free cycle allows for regular and synchronous access to RAM when the CPU is not accessing it, and the typical 250 ns access time of mid- to late-'70s DRAM provides plenty of RAM bandwidth to read or write data in each of these two cycles, Intel bus systems have a more asynchronous access pattern.

On Intel bus systems the CPU is typically run with a 2-4 MHz clock where a varying number of these "T-states" (4-10 on the 8080, and up to 23 on some compatible CPUs) are required to execute a single instruction, and memory will be accessed during certain T-states of certain "M-cycles" which consist of 3-6 adjacent T-states. (The complexity is increased even further if the Z80 CPU's support for DRAM refresh is used, as that will access memory during some groups of T-states where the CPU instruction/data fetch unit does not access memory.)

Thus, it's typical on Intel bus systems to avoid entirely the difficult problem of figuring out on which cycles the CPU is accessing memory and simply request that the CPU pause during the times when an external system needs to access memory for any reason.² (This is referred to as "direct memory access" or DMA.)

This technique was used from quite early on because it allows for completely asynchronous access by the DMA system, whether it be for a video display or other purposes. For example, the NEC TK-80 trainer board used DMA to read eight memory addresses from RAM to provide the data for which segments to light in its eight 7-segement LED displays; the circuit that did the DMA was run by a simple 555 timer that had no synchronisation with the system clock (and, taking its timing from an R/C network rather than a crystal resonator, did not even have exact or perfectly steady timing).

And this, as explained in supercat's answer, is how the CGA card deals with memory access conflicts: it simply pauses the CPU while it's reading from video memory. Note that this has nothing to do with the expansion bus on most microcomputers through the IBM PC and PC/AT; the same issue would exist if the CGA video subsystem were built into the motherboard (and indeed does exist on PC clones with built-in CGA, such as the early Compaq portables).

Nor is it even required to deal with these memory access conflicts via wait states or any other means. One obviously doesn't want to skip memory writes to VRAM from the CPU, as then it would be instructing things to appear on the screen that would not appear, but it's perfectly reasonable to simply skip reads by the video subsystem if they would interfere with writes to (or reads from) VRAM by the CPU. This will simply introduce blanks (or, more usually, rubbish) on the screen when the video system is prevented from reading VRAM. And this was far from unknown; among the many systems that did this was the TRS-80 Model I (where the video memory and display subsystem were on the motherboard), which would display "static" on the screen when video subsystem reads were preempted by CPU reads or writes.

It's also worth mentioning that, from the introduction of microcomputers, wait states were often introduced simply because memory or another peripheral was slow. This had nothing to do with it being on an expansion bus or not; as with the video examples above, it was part of the nature of accessing the chips themselves, regardless of how connected to the CPU.

GPUs and Modern Systems

The CPU/GPU split is not new: in the microcomputer world it's existed since at least 1979 when the Texas Instruments TMS9918 Video Display Processor was released. The 9918 had its own separate video RAM (accessed via sending read and write commands to the chip) and, as well as having an extremely flexible character or glyph display system, also could do certain things on its own, such as display sprites.

This might not be considered a "true" GPU since it didn't actually ever write to its on video RAM (in the initial versions, anyway), but the following year's NEC μPD7220 High-Performance Graphics Display Controller (one of the best known graphics chips of the 1980s amongst those who were familiar with the market) did have commands to instruct it to do things like write lines and sectors of circles to the frame buffer, just as GPUs do today (except in a much more sophisticated way). Nor was this restricted to custom silicon: The 1981 Fujitsu FM-8 (and FM-7, FM77 and subsequent machines in the series) used a second 6809 CPU with its own address space and graphics code to draw characters and graphics on its 640×200 bitmap display. (The video subsystem itself had no character display functionality; the graphics CPU always had to write the individual dots of each character.)

Modern systems using a GPU still have the exact same issues with having to wait to write to memory, except perhaps worse. CPUs and GPUs communicate mostly through shared memory (either the system RAM or RAM on the GPU card that is usually called "VRAM"), with the CPU writing instructions and data for the GPU into system RAM or VRAM and the GPU reading those, updating the data where necessary, and then writing what is to be displayed to the frame buffer for the video subsystem to read. (The frame buffer is in VRAM, but is only a very small part of it. A 4K full-colour frame buffer is only about 32 MB, a small fraction of the gigabytes of VRAM on modern GPU cards. The majority of the data in VRAM is usually "textures," of which the GPU reads only a small part for every frame in order to produce modified data to be written to the frame buffer.)

Effectively, this is introducing a separate computer between the "main" computer and the video subsystem. So what we now have to deal with is, unlike a frame buffer, a very strict requirement that neither the CPU nor the GPU can fail a read or write to their shared memory. Thus when both want to read or write at the same time, one or the other must wait. Again, this is nothing to do with the bus, but inherent to the nature of shared memory.

Summary

  1. The IBM PC putting video memory or a video subsystem on an expansion card was not an unusual idea; it was very common long before the IBM PC came out.
  2. Putting the video memory/video subsystem on a card did not make it slower for the IBM PC to access video memory. That was inherent in the nature of the memory and video subsystem itself. Or more precisely, it was because IBM chose to prioritise a clean (i.e., never showing "static") display over the speed at which the CPU could write (or read) RAM used as a video frame buffer.
  3. The wait states for PC video memory in CGA and MDa cards (and for many, many things—memory and non-memory alike—before it) were not added by the bus, but by the device itself. On the PC and PC/AT (and the vast majority of other microcomputers up to that time) whether the device was on an expansion bus or on the motherboard made no difference to wait states.
  4. Multiple device access to VRAM still does matter today: even when using a GPU we still share memory between the GPU and the CPU and between the GPU and the video display subsystem that reads the frame buffer, and all the same issues apply if not using dual-ported RAM.
  5. From the question in the final two paragraphs in your post, it appears that what you are interested in has nothing to do with video or even shared RAM, but is merely about expansion buses. Introducing video and RAM into this simply obscures your question (or what I believe should be your question).

To answer that final question, PC expansion buses themselves started slowing access to devices on expansion cards (whether video RAM or anything else) somewhere between about 1984 and 1987.

  • The latter date is when IBM introduced Micro Channel architecture was introduced, and MCA was the first PC bus designed from the start to run at a different speed from the CPU.
  • However, PC clone vendors making systems that were faster than the IBM PC or PC/AT were seeing issues with expansion cards failing due to too-fast access several years before that. (The problems were generally not with VRAM, which was already slowing access on its own, but with other devices that just assumed the PC bus wouldn't be run faster than 4.77 MHz or the PC/AT bus wouldn't be run faster than 8 MHz.)
  • At some point the chipset vendors introduced the ability to run the ISA bus at a different rate from the CPU, and the first one of those would have been exactly when expansion buses started introducing wait states.

¹ The Apple II did introduce a very small delay (139 ns, or 1/7 of a CPU clock cycle) for each horizontal scan line. This was entirely unnecessary as far as memory speed is concerned, but was introduced in order to avoid alternating lines having a 180° phase difference in the colour carrier, which simplfies the colour hardware for reasons I won't get into here.

² It is possible to get "pseudo-synchronous" access to the memory to do DMA on 8080 (but probably not Z80) systems: the first M-cycle of every instruction consists of five T-cycles where the first three are used to fetch the instruction and the final two are used for instruction decode, leaving the memory free. It's possible, with a non-trivial amount of logic, to access the memory during T4 and/or T5; the Altair 88-S4K 4k Dynamic RAM board used this technique for dynamic RAM referesh. However, with this technique the memory bandwith is limited to two T-states per instruction, with the worst case being a sequence of DAD instructions (10 T-cycles each) giving a worst-case memory bandwith of 1 to 2 accesses every 10 T-clocks, perhaps 200,000 bytes per second (depending heavily on CPU and memory speed). That's a little over 3 KB per frame for a standard video system, which is enough to read data for a text screen (assuming that each character is read only once per frame—which has its own issues), but obviously not enough for, e.g., 320×200 monochrome graphics which requires reading 8000 bytes of memory per frame.

13
  • 1
    If the Apple II hadn't added the extra 1/2 chroma cycle on every line, even scan lines would have had the opposite color phase from odd ones, which would have been the way NTSC video was designed to work. Showing hires blue would have required a checkerboard rather than a vertical stripe pattern, and orange would have required the opposite checkerboard, but that would probably have improved visual quality for programs that were designed to work that way.
    – supercat
    Commented Mar 11 at 20:04
  • 1
    @supercat Good point. But you're talking about hi-res graphics only; consider what happens to lo-res graphics with an alternating phase, and how that would have to be fixed. (I have tweaked my answer based on this, though.)
    – cjs
    Commented Mar 12 at 2:28
  • 1
    "never showing "static" - CGA did this all the time didn't it? I have a memory of yellow text on a monochrome VDU 'shivvering' whenever I pressed return.
    – Neil
    Commented Mar 12 at 11:02
  • 1
    About the original IBM PC video adapters I’ll add one observation: not only it would have made no difference performance-wise but, honestly, it would not have fit even if they had wanted to. The 5150 motherboard was quite busy already, and the CGA and MDA boards are also very large (both full-length) as they all rely heavily on off-the-shelf low integration chips. Commented Mar 12 at 12:59
  • 1
    @cjs On original CGA CPU access has priority access to memory. FF in U2 visible in the top left quadrant of sheet 4 of the schematics (p. D-28 in the 1981 TechRef) locks he CPU address multiplexers and disables the 6845 (s.1). Thus the character latch (s.2) will store whatever the CPU reads or writes during that cycle. in result a random value is read which in turn is suppressed, giving a white line in graphics mode or a white square in character mode - much like a cursor square. To avoid this the BIOS always waited for a retrace before outputting a character. The reason why it was so slow.
    – Raffzahn
    Commented Mar 12 at 13:25
15

TL;DR: When the CPU-bus became faster than the I/O-bus

Now, there are always going to be wait states during active scan line, because the video chip is using some of the video memory bandwidth for generating the display

Not really; there are many ways to make that transparent. Starting with the VDU not needing the whole bandwidth during scan line.

Unless you have dual-port VRAM, which was expensive and rarely used.

Not needed. Already a 1-transaction (one byte or word) write buffer can level the playing field. This is possible as

  • reads are rare, so having them blocking doesn't hinder much;
  • the VDU may not need all bandwidth, so a single postponed write may be injected;
  • the CPU does usually not access video as a burst but single byte/word access.

The last is especially relevant, as an access loop may contain a dozen or more instructions with many memory cycles but only one of them being an I/O write cycle.

I'm talking about wait states incurred by the CPU writing to video memory, even outside active scan line, because it's on an expansion card at the other end of the expansion bus, as opposed to main memory on the motherboard.

It helps to remember that on the PC (and XT) the I/O-bus was clocked at the same rate as main memory. There was no general penalty for accessing. Same speed, same performance.

When did video memory start incurring wait states for that reason?

Generic wait states only became a thing when the CPUs were faster clocked than the I/O-bus. And that was a general slowdown, not just for video cards.

For example, I would expect CGA not to do so because the original IBM PC actually put main memory beyond 64K into expansion cards on the ISA bus. But what about EGA? VGA?

Neither of them, as it was about the different clock speed for either bus domain. On a 4.77 MHz PC or XT with 4.77 MHz I/O-bus no wait states were added. Same on a 10 MHz 386 with ISA bus clocked up to 10 MHz (*1). On the other side, already an 9.54 Mhz 8088 will have to wait if ISA is clocked at 4.77.


*1- Back in the days of ISA graphic cards we invested many hours (and money) to select cards that could run as high as possible, so we could crank up ISA bus speed :))

2
  • 2
    According to the benchmark utility I used at the time, the fastest unaccelerated VGA card I ever used was a Trident in a '486 DX33 with the ISA bus clocked up to whatever the BIOS would let me.
    – Neil
    Commented Mar 11 at 0:48
  • @Neil Those were the days :))
    – Raffzahn
    Commented Mar 11 at 1:23
11

The original CGA always imposed wait states on display memory accesses. The card's memory subsystem always performs one access every four pixel clocks, and only half of those are available to the CPU regardless of whether the display is enabled or what mode it's in. Although the pixel clock and CPU clock happen to both be generated off of a 14.3818MHz time source, the CPU clock and display-memory clock are asynchronous (the card has no way of knowing that the CPU clock is derived from that source). Any CPU read or write request must be delayed until a following 1.7977MHz timing window, and not even necessarily the immediately following one, since a request that occurs near the end of a window will miss the next window and have to wait for the one after that. If memory serves, CGA display memory accesses have something around 4-6 wait states on a stock 4.77MHz PC, and more wait states on faster machines.

Note that these wait states are not sufficient to allow clean updates in 80-column text mode. During the active portion of each 80-column line, memory fetches alternate between those where display hardware owns the bus and the CPU can't have it, and those where display hardware needs to fetch display data but will let the CPU have the bus anyway. If the CPU attempts to access display memory at those times, a transient visual glitch will appear wherever the beam happened to be at that moment.

11
  • 3
    The CGA card used the 14.31818 MHz clock from the motherboad. It did not have a crystal onboard. So they used the same clock from the same crystal and thus were synchronous with the 4.77 MHz CPU which used the crystal divided by 3.
    – Justme
    Commented Mar 10 at 19:11
  • 2
    @Justme: The CGA card was intended to be (and was) compatible with motherboards whose CPU clock was not derived from that 14.3818Mhz clock.
    – supercat
    Commented Mar 10 at 19:15
  • 2
    @Justme: Do you like the new wording?
    – supercat
    Commented Mar 10 at 19:16
  • 2
    I think so. On later machines the bus might run at CPU based rate, so the reads and writes can be asynchronous, and it will work as long as the bus has enough wait states at CPU rate so the bus cycles are not too fast for the peripheral.
    – Justme
    Commented Mar 10 at 19:45
  • 2
    I used my first (clone) PC with a TV as monitor, taking composite video from the CGA to an external RF modulator. I was only able to get B&W images, however, and this puzzled me for a while until I discovered the 14.31818MHz clock on the motherboard was slightly off. This clock is divided by 4 to generate the 3.579545 color burst on the composite video. I added a trimmer cap on the motherboard crystal, tweaked the 14MHz oscillator, and finally got color. Commented Mar 11 at 16:39
5

Actually, if you carefully read the BIOS for the IBM PC and CGA video (I did), you will find that (at least in text mode) it delayed every access to the video memory until the display entered a horizontal retrace. This was because, if you accessed the bus while things were being displayed, a little fleck of white appeared on the screen.

If you, like me, type in the BIOS source, remove the delays, and rebuild; then suddenly the video access gets to be a whole lot faster. Of course, you would get the little white snow all over the place. (My TSR also gave me a 100 line scrollback. :-) )

I suspect (but can't prove) that IBM could have added a gate or two to make the flecks black, and thus much less noticeable. Of course, IBM was doing everything they could to reduce the chip/gate count.

In short, it isn't necessarily the bus that is slowing access to video RAM.

4
  • I don't think that black snow would be necessarily less noticeable - it would just affect different parts of the display (text rather than the background; or everywhere in full-screen applications that use a non-black background). Mid-grey might be a better choice, since it's equally far from all extremes. Commented Mar 12 at 8:49
  • 1
    @TobySpeight I think that black snow would be less noticeable because of the mechanics of vision (though again I can't prove it). Grey would still show up on a black background.
    – David G.
    Commented Mar 12 at 13:39
  • Yes, that's a fair point. There's an argument for one side or the other (I'm not sure which) in that the reason that PAL signals have zero as white and maximum level as black is something to do with noise perception. We're more sensitive to differences in dark colours than in bright ones. So black is more noticeable against a dark background than white is against a bright background. But I'm no colour-perception expert, so my input must end there. Commented Mar 12 at 13:48
  • 2
    Removing retrace delays will make display updates faster, but still not as fast as writes to conventional memory. I see I forgot to mention the significance of 80-column mode in my answer, which I'll fix.
    – supercat
    Commented Mar 12 at 14:49
0

In many ways it's not that the PC architecture has slow access to video RAM. It's that, thanks to a lack of advanced features like sprites and backgrounds, it requires much more access to video RAM. Namely it has to go out and write to a framebuffer for anything that can't be handled as text.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .