Would compare-and-branch have added an extra cycle on ARM-1?

Question

The ARM-1 was an early RISC CPU, designed in 1986 (and even more typical of early RISC design constraints than the year would suggest, since Acorn didn't have the budget to pay for the latest process technology, so it had similar constraints to previous RISC chips designed a few years earlier.)

It's not very RISC purist; the designers seem to have been happy to make an instruction do more than one thing, whenever this seemed likely to improve speed or code density. For example, every instruction (not just branch) is predicated on condition codes; every instruction can optionally shift the output; there are instructions to save and restore multiple registers.

One thing it doesn't have, is compare and branch in one instruction; you need to compare values first (setting a condition code), then branch on that condition code. This is admittedly similar to many early microprocessors, but to my way of thinking, it misses an opportunity when your instruction words are always 32 bits; that's plenty enough bits to specify a pair of registers to compare, and a branch displacement, in one instruction, which would save both cycles and code size.

So I'm wondering why they didn't do it that way. As above, it wasn't about philosophical purity.

One possibility that occurs to me: maybe with the then-available process technology, it wouldn't be possible to run a pair of values through the ALU, and also branch, in the same time; maybe the comparison would have added a cycle to the time taken for a conditional branch. In that case, squeezing them into one instruction, might have helped code density, but it would have done nothing for speed.

Is that the case? Or was there some other reason for the design as it was?

you still need to add a condition on which to branch as well ... so space might get tight. Also, compare and branch is a real CISCy instruction as it needs, as you already note, wait for previous results. moreso, code size gain is rather meager as conditionals are still needed for other operation results. It essentially duplicates instructions. Not RISCy in any way. Last, but not least, the combination of non conflicting operation in one are very RISCy - they are of the VLIW kind. RISC is not about making instructions primitive, but steamlined. — Raffzahn, Commented Dec 7, 2022 at 3:55
ARM may not have strictly followed the philosophy of RISC I, but it is rather tidy, in its own way. An explicit conditional jump instruction does not mesh with one of the design principles: any instruction can be conditional. This suggests that conditional jumps can just be regular jumps, requiring no special consideration (or hardware). There's an elegance to that. — RETRAC, Commented Dec 7, 2022 at 6:35
@rwallace The designers committed to condition codes and conditional execution of any instruction. Why the ARM designers chose condition codes, versus branching on the contents of registers directly, would be worth a question of its own. But once you commit to that, the cheapest hardware implementation would leave out your proposed register comparison, which I think would require a unique instruction format, which complicates instruction decoding, etc. for effectively duplicate functionality. They were on a strict transistor budget. — RETRAC, Commented Dec 7, 2022 at 7:05
In general, I feel like the guiding principle wasn't so much "minimalism" as "orthogonality". Each instruction did a few different things, but they were always the same things (every instruction conditional, every arithmetic operand run through the barrel shifter, etc) and it was up to your creativity how to combine them into the instructions you really wanted. — Nate Eldredge, Commented Dec 7, 2022 at 7:12
The architecture did eventually add CBZ and CBNZ, compare and branch if (non)zero, but only in Thumb mode where "every instruction is conditional" no longer applies. It exists today (I checked ARMv8-A), but I'm not sure when it was added. AArch64 has CBZ/CBNZ as well. — Nate Eldredge, Commented Dec 7, 2022 at 7:15

Graham Toal · Accepted Answer · 2024-01-16 18:16:39Z

If I remember rightly (I worked at Acorn during the period ARM was designed and built, but I'll admit my memory for casual conversations from that time is far from perfect) the argument was that the majority of conditionals controlled very short sequences (1 to 3 instructions?) and after setting the condition code, the instructions could be executed conditionally rather than by jumping over them, which would be effectively free for a single instruction or which would have broken the pipeline unnecessarily for a couple of instructions. And for longer sequences the cost of an explicit branch instruction was an insignificant overhead.

Renee Cousins · Accepted Answer · 2022-12-08 17:55:00Z

Yes, it would add a cycle.

How do you branch on ARM?

add pc, pc, #8

That's a bit of an over-simplification, but the point is that ARM was meant to be purely orthogonal, that is, the program counter is "just another register" and branches pass through the arithmetic-logic unit (ALU) just like any other instruction.

Since the compare-operation also needs to pass through the ALU and there's only the one ALU, our hypothetical compare-with-branch would need two execute cycles; one to perform the compare (just a subtraction, discarding the result) and one to compute the new PC (just addition of the PC with some offset).

ARM does support this sort of behaviour and has to do this very thing with other instructions.

"A number of ARM instructions cannot be implemented in a single cycle given the limited resources of the ARM1 (i.e., a single ALU and a single shifter). Instructions such as a store STR (store register) requires calculating the effective address before it can store the data. To solve this problem, the ARM1 effectively runs the same instruction through the execute stage two to three times - in the first execute cycle is used to compute the address while the second execute stage the data store."

https://en.wikichip.org/wiki/acorn/microarchitectures/arm1#Multi-Cycle_Instruction

So it certainly could have been done but it would have cost an extra cycle anyway.

John Dallman · Accepted Answer · 2022-12-09 19:12:32Z

Remember that classic AArch32 code has conditional execution of most instructions, which more or less demands a condition code register in simple implementations. Once the ARM designers had included those things, compare-and-branch instructions would likely have seemed like unnecessary complexity.

They seem to have fully apprehended the RISC principle of simple, fast instructions, but had no ideological attachment to RISC purism. They had a lot of experience with the 6502, a simple 8-bit chip which can be surprisingly fast if used skilfully and tried to apply the lessons they'd learned to the new architecture. That was always intended for commercial application; they were in Cambridge, but they weren't doing an academic project at all.

The ARM designers, primarily Sophie Wilson, were very used to writing tight 6502 code to fit into limited-size ROMs on BBC Micro family machines. They seem to have considered assembly language programming as more important than the designers of the seminal Stanford and Berkley RISCs. Acorn Computers used assembler extensively in the operating systems and BASIC interpreters of the early ARM systems. Those were mostly home/educational computers, very powerful for the era, but not compatible with much else. UNIX was available for some high-end ones, but was expensive and little-used.

ARM conditional execution is relatively easy for humans to use in short and medium-sized passages of assembler code. The conventional wisdom is that it proved much harder for compilers to use effectively, which is part of why it was omitted from AArch64. It was also left out of Thumb, simply because it consumes too much space in 16-bit instructions. Branch predictors were starting to appear when Thumb-1 was released in 1994, and definitely weren't used in the early ARMs.

More about supporting conditional execution (aka predicated execution) can be found in these lecture notes from CMU. Page 16 summarises its pros and cons, under present-day conditions:

Advantages

Eliminates misprediction of hard-to-predict branches, by eliminating the branches.
This is good if the (inevitable) useless instruction fetches cost less than a branch.
Allows instructions to be re-ordered more easily.

Disadvantages

Pointless for easily predicted branches.
Doesn't adapt to run-time behaviour like a good branch predictor.
Wastes time if the useless instruction fetches cost more than a branch.
Costs valuable space in instructions.

I live in Cambridge, UK, where early ARM development was done, and people here who learned ARM programming in the 1980s and 1990s tend to regard AArch64 as "Not really ARM." This is because it discards so much of the classic ARM design and is much more like the classic American RISCs. Classic ARM was a style of its own, which happened to succeed.

"ARM conditional execution is relatively easy for humans to use in short and medium-sized passages of assembler code. It proved much harder for compilers to use effectively" - Interesting. Where can I find out more about this? — Bruce Abbott, Commented Dec 7, 2022 at 22:42

Simon Farnsworth · Accepted Answer · 2022-12-09 12:14:40Z

The ARM1 CPU core was tightly coupled to the memory interface - it was designed to be as efficient as possible when paired with FPM DRAM, and no consideration was given to performance with a cache or with another DRAM technology.

Compared to previous DRAM designs, the key innovation in FPM DRAM is that you can omit the RAS cycle if the next access is in the same DRAM page (i.e. you'd send the same values during the RAS cycle). On a typical FPM DRAM, you can do 4 fast page accesses in the time it takes you to do a single normal access; ARM1 has the SEQ signal to tell MEMC that the next access is sequential with the previous access, so that MEMC can allow it to use fast page accesses instead of a normal access whenever possible (and MEMC controls the clock, so it will stall the ARM1 if it crosses a page boundary).

A branch inherently forces the next instruction to be fetched using a normal access; in addition, on ARM1, it forces a pipeline bubble. Given this, a branch is only worthwhile compared to conditional execution if you need to skip at least 4 instructions to do your work with conditional execution - otherwise, conditional execution is faster, even though it executes 3 NOPs where a branch would not execute any.

Acorn had also decided to make the ARM1 strongly prefer position-independent code - which means that branches are PC-relative, and thus require an ALU operation to implement. Compares also need an ALU operation, and so a compare-and-branch would need to occupy the execute pipeline stage for two instructions worth of time, not just one.

Further, if you look at Ken Shirriff's post about the control PLA, you'll see that Acorn would have had to make the control PLA more complex to handle this - no instruction needs more than 4 rows in the PLA in the current setup, allowing you to use 2 bits to track your position in the PLA sequence, but a compare-and-branch could need 5 rows (because it combines the 2 rows of "data processing, register, shift by register" with the three rows of branch).

On the gain side, you're saving 4 bytes of memory as compared to two instructions. However, there's no time saving in the taken case (since the branch is already in the pipeline, ready to execute), and there's only a one-instruction saving in the not-taken case (since the branch doesn't take a pipeline slot any more, so you move straight to the next instruction). To implement this instruction without making the control PLA sequencing more complex for all instructions would need you to to prohibit the shifted register forms in the compare part (so either a straight register-register compare, or register-immediate, but not register-shifted register comparisons), and the instruction doesn't save enough to make it worth having a more complex control PLA just for this case.

The result is an instruction that has a very tiny value (saves 1 instruction's space and execution time in the not-taken case), but a lot of costs, for a case that we can assume didn't come up that often in Acorn's simulations (branch not taken, and branch target has enough instructions that you can't inline the branch target via conditional execution).

Finally, it's worth recalling that Acorn mostly programmed in assembly or BBC BASIC in the ARM1/ARM2/ARM3 era. They made heavy use of conditional execution, and did use tricks like executing a couple of instructions conditionally between the flag set and the branch, or having multiple branches with different conditions after a single compare. and to return from subroutine conditionally instead of branching, sometimes using multiple comparisons with conditional execution before the conditional return. All of these tricks don't work with a compare-and-branch, and thus make the pressure to have it go down.

Given the difference in size between the address space and memory, I wonder if looping could have been facilitated by using some address bits to permute or mask others, so that e.g. an 8-word chunk of RAM would be mapped 256 times consecutively in the address space, followed by the next 8-word chunk 256 times? Then if one wanted to run a piece of code 1000 times, one could store it into an 8-word-aligned chunk of address space, followed by a "conventional" decrement and branch if non-zero, and then jump to an address 96 bytes past the start of the shadow of the first chunk (skipping 24 reps). — supercat, Commented Dec 16, 2022 at 17:57
I suspect that wouldn't have paid its complexity weight - Acorn already found 26-bits limiting by the time of the A540 or R260 in 1990, and would have been even more constrained if they'd used bits to swizzle RAM like that. I know from talking to people involved that flags in PC was considered a mistake by the time of ARM3 - hence the change in the ARM610. — Simon Farnsworth, Commented Dec 19, 2022 at 14:04
One wouldn't have to limit the placement of such chunks to the 26-bit range that's reachable with BL immediate instructions. While putting flags in PC meant giving up almost 94% of the potential address space, a feature such as described could be usefully implemented using a tiny fraction. If one wanted to accommodate four eight-word loops which programs would usually set up shortly before use (but which could all be set up if a program's mid-level loop contained up to four high-speed loops) and could run up to 256 times, that would require only 32,768 bytes of address space. — supercat, Commented Dec 19, 2022 at 17:27
You now have me wondering if you could implement such a feature in MEMC instead of the ARM itself. MEMC already supported DRAM aliasing (you had physically mapped and logically mapped DRAM, and could have aliases in both regions), so it could have had an alias region in logically mapped where it mapped just 8 bytes repeatedly from one page. — Simon Farnsworth, Commented Dec 19, 2022 at 17:30
The best place to build in the feature would depend upon how the rest of the system was designed. Among other things, if all fetched instructions were pushed into an 8-level shift register, then after one had executed eight instructions from one of these "loops", one could start pulling instructions from the shift register without doing anything with memory, so if one had a load- or store-heavy loop like *a++ = *b++ + *c++; the loads and stores wouldn't interfere with the fetches of later instructions within the loop. — supercat, Commented Dec 19, 2022 at 17:50

Raffzahn · Accepted Answer · 2024-07-16 14:05:36Z

It's not very RISC purist; the designers seem to have been happy to make an instruction do more than one thing,

Erm, RISC isn't about doing only one thing, RISC is about not doing complex stuff. RISC is not about simplifying the instruction set, but simplifying the CPU, as a simple CPU can be speed up much better than a complex one.

It is not doing away with powerful operations, but stripping the instruction set of all the stuff that makes CPU construction complex. Foremost of instructions that need long pipelines, like waiting for partial results of an operation, like fetching operands or storing results (or both). The goal is to have an instruction set that can, for the most part, executed with a simple 2-3 stage pipeline without need for logic to handle dependencies.

For example, every instruction (not just branch) is predicated on condition codes; every instruction can optionally shift the output;

None of that implies complex instructions; that's application of the VLIW principle. VLIW can easy be seen as the pinnacle of RISC as it does even away with most of the decoding stage in favour of direct control of all available function units.

there are instructions to save and restore multiple registers.

While this is an instruction 'looping' around the registers, it's also essential for function and task switching, thus not burdening the transfer of register content with an equal number of instruction fetches makes a huge difference. Such instructions are found in many RISC architectures. RISC is all about improving, not being stupid.

One thing it doesn't have, is compare and branch in one instruction; you need to compare values first (setting a condition code), then branch on that condition code. This is admittedly similar to many early microprocessors, but to my way of thinking, it misses an opportunity when your instruction words are always 32 bits; that's plenty enough bits to specify a pair of registers to compare, and a branch displacement, in one instruction, which would save both cycles and code size.

This is a very CISCy instruction - and an unnecessary one:

It does not - and CAN not - save execution time, as the second part (branching) is dependent on the first part, so it always needs two cycles (at least)
To get execution (somewhat) down, it would require another pipeline stage, a branching stage after the operation stage. Worst requirement one can do to a RISC design.
It would not replace any existing branch or compare, but essentially duplicate those instructions. Rather bloaty, isn't it?
It would need not only the opcode, registers to compare and target, but additionally
- 4 bits for the condition to check for and
- 4 bit for an optional Link register
There is no room
- CMP already fills all 32 bit with useful values
- B/BL also uses all 32 bits.

Any instruction of that type could, compared with the existing instructions only provide a reduced functionality for either function, while at the same time not eliminating additional branches for complex conditions, all of that at increased execution time.

In contrast, ARM's use of conditional operation does not only away with very common cases of chained operations but also transforms all basic program flow instructions (Jump/Call/Return) into conditional versions. Very lean.

As above, it wasn't about philosophical purity.

I beg to differ, the original ARM was very much a pure concept - maybe even more pure than many other RISC implementations.

One possibility that occurs to me: maybe with the then-available process technology, it wouldn't be possible to run a pair of values through the ALU, and also branch, in the same time; maybe the comparison would have added a cycle to the time taken for a conditional branch. In that case, squeezing them into one instruction, might have helped code density, but it would have done nothing for speed.

Is that the case? Or was there some other reason for the design as it was?

Compare-And-Branch are the perfect example of CISC bells and whistles, all created to save a tiny bit of code at high implementation cost. Useful only in very limited cases and when memory was as small as in the 1960s. Exactly the kind of instructions RISC was about to cut away. So why on earth would anyone want to include them?

@rwallace: The problem is that your cycle time from the output of the control store (or control logic), needs to completely go back to it to perform a conditional jump. a) Control output to select registers and ALU operation b) from output of registers through ALU c) from ALU back to control store d) control store needs to select its logic for the next cycle to either do a branch (add offset to program counter) or not. Pipelining this adds difficult dependencies. Trying to do it in one cycle increases the clock period too much. — chthon, Commented Dec 8, 2022 at 16:55
@rwallace: I would think that an instruction which checked whether a condition specified by the four-bit destination-register field applied to the first source operand, and would either copy the second source operand to the PC or do nothing based on the results of that comparison, could be accomplished in the same amount of time as a branch. In fact, it might even be possible to have an instruction which computes source1-1 while it uses the MSB of the source operand to select between either storing the computed result to the specified destination, or storing source2 to the R15. — supercat, Commented Dec 8, 2022 at 20:40
@rwallace: Such an instruction would be limited to looping 2**31 times, and would execute the branch N+1 times, but could probably have shaved a cycle off many looping constructs. — supercat, Commented Dec 8, 2022 at 20:41
@rwallace Cycles: No, it's not just the additional cycle to jump, it's also the cycle to get the compare result. A compare instruction takes 1 cycle. A conditional jump instruction takes as well a single cycle - plus one cycle to take the jump. And a compare and jump instruction would as well take 2 or 3 cycles: one to calculate the result, one to decide to jump - which only can do after calculating - and, if a jump is to be taken a third cycle to do so. combining both does not speed up execution. — Raffzahn, Commented Dec 8, 2022 at 23:08
@rwallace MIPS also calls the other kind "Compare and Branch", but they are not. They are test and branch instructions as they do a magnitude test. I.E. test if a register is zero or negative. This can either be done rather fast with single level gates or for free. Test for negative it reading the value of the top bit and comes for free, while test for (not) zero is a single stage OR over all bits. It can even be done for free when performend when the register is loaded and stored in an additional bit. Either way can be done without stretching the clock or an additional cycle. — Raffzahn, Commented Dec 8, 2022 at 23:30

Davislor · Accepted Answer · 2022-12-09 00:46:16Z

The CDC 6400 series of supercomputers, developed by Seymour Cray, are considered some of the earliest RISC-like architectures, having been released in 1964. These used compare-and-branch instructions. (See page 40 of this book.)

The Intel 8051 was a microcontroller of a similar generation (originally in NMOS circa 1980) that implemented a version of compare-and-branch. designed for loops that either increment or decrement and then test a register.

Although I’ve seen analysis of the ARM-1 chip, I don’t know what the specific trade-offs would have been. However, branching on a classic RISC architecture of the 1980s typically added a delay of one clock cycle anyway, and using compare-and-branch might have allowed the CPU to remove condition codes and predication, freeing up some transistors to accelerate branches. At that time, RAM speeds were still keeping up with CPUs, meaning there was only a small branch misprediction penalty, so there would not have been a need for complex branch prediction in lieu of conditional instructions.

I’ve only once or twice found a direct answer by someone who was there for a “Why didn’t they ...?” question. In this case, my understanding is that the drawback of condition codes, that they create false dependencies between instructions, only started to matter for superscalar CPUs (of which the CDC 6x00 series was also one of the first). An in-order CPU with only one ALU, like the ARM-1, would probably have seen no advantage to using compare-and-branch over control codes. By the time a superscalar “ARM” CPU came out, its ISA would no longer be backward-compatible with the ARM-1 anyway.

Stack Exchange Network

Would compare-and-branch have added an extra cycle on ARM-1?

6 Answers 6

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
cpu
instruction-set
microprocessor
acorn
arm
.

Hot Network Questions

Would compare-and-branch have added an extra cycle on ARM-1?

6 Answers 6

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged cpuinstruction-setmicroprocessoracornarm.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
cpu
instruction-set
microprocessor
acorn
arm
.