78
$\begingroup$

I know that bit-wise operations are so fast on modern processors, because they can operate on 32 or 64 bits on parallel, so bit-wise operations take only one clock cycle. However addition is a complex operation that consists of at least one and possibly up to a dozen bit-wise operations, so I naturally thought it will be 3-4 times slower. I was surprised to see after a simple benchmark that addition is exactly as fast as any of the bit-wise operations(XOR, OR, AND etc). Can anyone shed light on this?

$\endgroup$
5

9 Answers 9

112
$\begingroup$

Addition is fast because CPU designers have put in the circuitry needed to make it fast. It does take significantly more gates than bitwise operations, but it is frequent enough that CPU designers have judged it to be worth it. See https://en.wikipedia.org/wiki/Adder_(electronics).

Both can be made fast enough to execute within a single CPU cycle. They're not equally fast -- addition requires more gates and more latency than a bitwise operation -- but it's fast enough that a processor can do it in one clock cycle. There is a per-instruction latency overhead for the instruction decoding and control logic, and the latency for that is significantly larger than the latency to do a bitwise operation, so the difference between the two gets swamped by that overhead. AProgrammer's answer and Paul92's answer explain those effects well.

$\endgroup$
1
  • $\begingroup$ Comments are not for extended discussion; this conversation has been moved to chat. $\endgroup$
    – D.W.
    Commented May 28, 2017 at 15:18
43
$\begingroup$

There are several aspects.

  • The relative cost of a bitwise operation and an addition. A naive adder will have a gate-depth which depend linearly of the width of the word. There are alternative approaches, more costly in terms of gates, which reduce the depth (IIRC the depth then depend logarithmically of the width of the word). Others have given references for such techniques, I'll just point out that difference is also less important than what it may seem just considering the cost of the operation because of the need of control logic which add delays.

  • Then there is the fact that processors are usually clocked (I'm aware of some research or special purpose non clocked designs, but I'm not even sure that some are available commercially). That means that whatever the speed of an operation is, it will take at an integer multiple of the clock cycle.

  • Finally there is the micro-architectural considerations: are you sure that you measure what you want? Nowadays, processors tend to be pipelined, multi-scalar, with out-of-order execution and whatever else. That means that they are able to execute several instructions at the same time, at various stage of completion. If you want to show by measurements that an operation takes more time that another, you have to take take those aspect in consideration as their goal is to hide those difference. You may very well have the same throughput for addition and bitwise operations when using independent data but a measure of the latency or introducing dependencies between the operations may show otherwise. And you have also to be sure that the bottleneck of your measure is in the execution, and not for instance in the memory accesses.

$\endgroup$
6
  • 7
    $\begingroup$ +1. Yes, most processors are clocked, but a few clockless CPUs are commercially available. $\endgroup$
    – David Cary
    Commented May 24, 2017 at 14:10
  • 3
    $\begingroup$ Another possibility is that a processor might store a 64-bit register as one 16-bit piece and three 17-bit pieces, where the extra bits of each piece holding a deferred carry from below. An addition which is followed by a bitwise operation or a store might require 1-2 extra cycles to propagate the carry, but an addition which is followed by another addition would not. Further, in the "store" case, the extra propagation time may delay the performance of the store, but there would be no need for code to "wait" for it. $\endgroup$
    – supercat
    Commented May 24, 2017 at 15:59
  • 4
    $\begingroup$ @supercat The Pentium 4 did something like this, with a double-speed (relative to the rest of the processor) ALU that would have the low 16 or 32 bits ready for a subsequent operation a half-cycle before the upper half's bits. $\endgroup$ Commented May 24, 2017 at 21:38
  • 4
    $\begingroup$ are you sure that you measure what you want? In this case, the OP's conclusion from the measurements happens to be correct for the vast majority of CPUs. Addition is so common that superscalar CPUs have add units on all execution ports, and booleans are so cheap to implement (in transistor count) that they're also present on all ports. So add and booleans almost always have the same throughput (e.g. 4 per clock in Intel Haswell). $\endgroup$ Commented May 25, 2017 at 21:46
  • 5
    $\begingroup$ SIMD integer add is often lower throughput than SIMD boolean, though, even though they usually have the same latency. Intel CPUs from PentiumII through Broadwell can only run vector-int adds (e.g. paddw) at 2 per clock, but booleans (like pand) at 3 per clock. (Skylake puts a vector adder on all three vector execution ports.) $\endgroup$ Commented May 25, 2017 at 21:53
26
$\begingroup$

CPUs operate in cycles. At each cycle, something happens. Usually, an instruction takes more cycles to execute, but multiple instructions are executed at the same time, in different states.

For example, a simple processor might have 3 steps for each instruction: fetch, execute and store. At any time, 3 instructions are being processed: one is being fetched, one is being executed and one stores its results. This is called a pipeline and has in this example 3 stages. Modern processors have pipelines with over 15 stages. However, addition, as well as most of the arithmetic operations, are usually executed in one stage (I am speaking about the operation of adding 2 numbers by the ALU, not about the instruction itself - depending on the processor architecture, the instruction might require more cycles for fetching arguments from memory, performing conditionals, storing results to memory).

The duration of a cycle is determined by the longest critical path. Basically, it's the longest amount of time necessary for some stage of the pipline to complete. If you want to make the CPU faster, you need to optimize the critical path. If reducing the critical path per se is not possible, it can be split into 2 stages of the pipeline, and you are now able to clock your CPU at almost twice the frequency (assuming there is not another critical path that prevents you from doing this). But this comes with an overhead: you need to insert a register between the stages of the pipeline. Which means that you don't really gain 2x speed (the register needs time to store the data), and you have complicated the whole design.

There are already quite efficient methods for performing addition (e.g. carry lookahead adders) and addition is not a critical path for the processor speed, thus is makes no sense splitting it into multiple cycles.

Also, note that while it might seem complicated for you, in hardware things can be done in parallel very fast.

$\endgroup$
1
  • 4
    $\begingroup$ The big overhead from longer pipelines is more cycles to recover from a branch misprediction! Spending transistors to buffer data between stages is minor these days. Even a simple pipelined CPU has to be fetching / decoding ahead of instructions that are actually executing. If the CPU discovers that the front-end was working on the wrong code because a branch went a different way than it predicted (or some other mis-speculation), it has to throw away that work and start from the correct instruction. Things only get worse with superscalar out-of-order CPUs that can have many insns in flight. $\endgroup$ Commented May 25, 2017 at 21:59
13
$\begingroup$

Processors are clocked, so even if some instructions can clearly be done faster than others, they may well take the same number of cycles.

You'll probably find that the circuitry required to transport data between registers and execution units is significantly more complicated than the adders.

Note that the simple MOV (register to register) instruction does even less computation than bitwise logic, yet both MOV and ADD usually take one cycle. If MOV could be made twice as fast, CPUs would be clocked twice as fast and the ADDs would be two cycles.

$\endgroup$
2
12
$\begingroup$

Addition is important enough to not have it wait for a carry bit to ripple through a 64-bit accumulator: the term for that is a carry-lookahead adder and they are basically part of 8-bit CPUs (and their ALUs) and upwards. Indeed, modern processors tend to need not much more execution time for a full multiplication either: carry-lookahead is actually a really old (and comparatively affordable) tool in a processor designer's toolbox.

$\endgroup$
2
  • 1
    $\begingroup$ Integer multiplication is definitely higher latency and lower throughput than ADD on x86. But it is amazingly fast considering how many adders it takes to build a fast multiplier: e.g. on Intel since Nehalem, and AMD since Ryzen, 8/16/32/64-bit scalar integer multiply is 3 cycle latency, with one per 1c throughput (one fully-pipelined execution unit). This sucks compared to ADD throughput of 3 or 4 per clock, but is amazing compared to 9 cycle IMUL latency in Intel Pentium P5. Things are similar for SIMD: vector-int multiply is higher latency and lower throughput than add, but still fast. $\endgroup$ Commented May 30, 2017 at 8:38
  • 1
    $\begingroup$ So yes, multiply used to be much more expensive relative to other instructions than it is now. Avoiding it at a cost of more than 2 instructions is usually not worth it, and sometimes not even a 2-instruction substitute is worth it (e.g. with a shift+add lea instruction). $\endgroup$ Commented May 30, 2017 at 8:41
10
$\begingroup$

I think you'd be hard pressed to find a processor that had addition taking more cycles than a bitwise operation. Partly because most processors must carry out at least one addition per instruction cycle simply to increment the program counter. Mere bitwise operations aren't all that useful.

(Instruction cycle, not clock cycle - e.g. the 6502 takes a minimum of two clock cycles per instruction due to being non-pipelined and not having an instruction cache)

The real concept you may be missing is that of the critical path: within a chip, the longest operation that may be performed within one cycle dictates, at the hardware level, how fast the chip may be clocked.

The exception to this is the (rarely used and hardly commercialised) asynchronous logic, which really does execute at different speeds depending on logic propagation time, device temperature etc.

$\endgroup$
3
  • 1
    $\begingroup$ It's not user-controllable bitwise operations, but some instructions on the 8086 (eg. clearing the interrupt flag) took fewer cycles than an integer addition. More abstractly, a RISC system where all the instructions are one word in size could use a simple binary counter for the PC, which would be a much faster circuit than a general-purpose adder. $\endgroup$
    – Mark
    Commented May 25, 2017 at 1:57
  • 1
    $\begingroup$ Addition on the program counter tends to be very simple compared to an addition arithmetic instruction, because one of the operands is small (either an instruction size, or a relative jump offset which is also size-limited) $\endgroup$
    – Ben Voigt
    Commented May 25, 2017 at 14:28
  • 1
    $\begingroup$ 6502 was pipelined - it read the first byte of the next instruction during the last cycle of the previous one. Otherwise fetch/decode/execute would have been at least three cycles. $\endgroup$
    – gnasher729
    Commented May 26, 2017 at 19:14
9
$\begingroup$

At the gate level, you are correct that it takes more work to do addition, and thus takes longer. However, that cost is sufficiently trivial that doesn't matter.

Modern processors are clocked. You cannot do instructions at anything except multiples of this clock rate. If the clock rates were pushed higher, to maximize the speed of the bitwise operations, you would have to spend at least 2 cycles on addition. Much of this time would be spent waiting around because you didn't really need the full 2 cycles worth of time. You only needed 1.1 (or some number like that). Now your chip adds slower than everyone else on the market.

Worse, the mere act of adding or doing bitwise operations is only one tiny part of what is going on during a cycle. You have to be able to fetch/decode instructions within a cycle. You have to be able to do cache operations within a cycle. Lots of other things are going on on the same timescale as the simple addition or bitwise operation.

The solution, of course, is to develop a massively deep pipeline, breaking these tasks up into tiny parts that fit into the tiny cycle time defined by a bitwise operation. The Pentium 4 famously showed the limits of thinking in these deep pipeline terms. All sorts of issues arise. In particular branching gets notoriously difficult because you have to flush the pipeline once you have the data to figure out which branch to take.

$\endgroup$
8
$\begingroup$

Modern processors are clocked: Every operation takes some integral number of clock cycles. The designers of the processor determine the length of a clock cycle. There are two considerations there: One, the speed of the hardware, for example measured as the delay of a single NAND-gate. This depends on the technology used, and on tradeoffs like speed vs. power usage. It is independent of the processor design. Two, the designers decide that the length of a clock cycle equals n delays of a single NAND-gate, where n might be 10, or 30, or any other value.

This choice n limits how complex operations can be that can be processed in one cycle. There will be operations that can be done in 16 but not in 15 NAND delays. So chosing n = 16 means such an operation can be done in a cycle, choosing n = 15 means it can't be done.

The designers will chose n so that many important operations can be just about done in one, or maybe two or three cycles. n will be chosen locally optimal: If you replaced n with n-1, then most operations would be a bit faster, but some (those that really need the full n NAND delays) would be slower. If few operations would slow down, so that overall program execution is faster on average, then you would have picked n-1. You could also have picked n+1. That makes most operations a bit slower, but if you have many operations that can't be done within n delays but can be done within n+1 delays then it would make the processor overall faster.

Now your question: Add and subtract are so common operations that you want to be able to execute them in a single cycle. As a result, it doesn't matter that AND, OR etc. could execute faster: They still need that one cycle. Of course the unit "calculating" AND, OR etc has a lot of time to twiddle its thumbs, but that can't be helped.

Note that it's not just whether an operation can be done within n NAND-delays or not: An addition for example can be made faster by being a bit clever, still faster by being very clever, still a bit faster by investing extraordinary amounts of hardware, and at last a processor can have a mixture of very fast very expensive and a bit slower and cheaper circuits, so there is the possiblity to make one operation just about fast enough by spending more money on it.

Now you could make the clock speed so high / the cycle so short that only the simple bit operations execute in one cycle and everything else in two or more. That would most likely slow the processor down. For operations that take two cycles, there is usually overhead to move an incomplete instruction from one cycle to the next, so two cycles doesn't mean you have twice as much time for execution. So to do the addition in two cycles, you couldn't double the clock speed.

$\endgroup$
6
$\begingroup$

Let me correct a few things that were not mentioned that explicitely in your existing answers:

I know that bitwise operations are so fast on modern processors, because they can operate on 32 or 64 bits on parallel,

This is true. Labeling a CPU as "XX" bit usually (not always) means that most of its common structures (register widths, addressable RAM etc.) are XX bits in size (often "+/- 1" or somesuch). But in regards to your question, you can safely assume that a CPU with 32 bit or 64 bit will do any basic bit operation on 32 or 64 bits in constant time.

so bitwise operations take only one clock cycle.

This conclusion is not necessarily the case. Especially CPUs with rich instruction sets (google CISC vs. RISC) can easily take more than one cycle for even simple commands. With interleaving, even simples commands might break down into fetch-exec-store with 3 clocks (as an example).

However addtion is a complex operation

No, integer addition is a simple operation; subtraction as well. It is very easy to implement adders in full hardware, and they do their stuff as instantaneously as basic bit operations.

that consists of at least one and possibly up to a dozen bitwise operations, so I naturally thought it will be 3-4 times slower.

It will take up 3-4 times as many transistors, but in comparison to the big picture that is neglectible.

I was surprised to see after a simple benchmark that addition is exactly as fast as any of the bitwise operations(XOR, OR, AND etc). Can anyone shed light on this?

Yes: integer addition is a bitwise operation (with a few more bits than the others, but still). There is no need to do anything in stages, there is no need for complicated algorithms, clocks or anything else.

If you wish to add more bits than your CPU architecture, you will incur a penalty of having to do it in stages. But this is on another level of complexity (programming language level, not assembly/machine code level). This was a common problem in the past (or today on small embedded CPUs). For PCs etc., their 32 or 64 bits are sufficient for the most common data types for this to start to become a moot point.

$\endgroup$
4
  • 1
    $\begingroup$ It's interesting to note that reducing the time cost of addition from O(N) to O(sqrt(N)) does not significantly increase the required number of transistors or routing complexity (each stage just needs to let one carry wire sneak in from below, and there need to be sqrt(N) extra merging stages. The time cost can be reduced to O(lgN) at a cost of O(lgN) transistors, but in many cases it may be helpful to process something like a 64-bit addition as e.g. eight 8-bit adds (using sqrtN forwarding) joined with three layers of merging logic, rather than as 64 1-bit adds with six layers of merging. $\endgroup$
    – supercat
    Commented May 24, 2017 at 16:03
  • 1
    $\begingroup$ Yeah, adders are fairly simple. What's really impressive is modern x86 CPUs with a fully pipelined 3-cycle latency 64-bit integer multiplier. (e.g. imul rax, rcx has 3c latency, and one per 1c throughput on Intel Sandybridge-family, and AMD Ryzen). Even 64-bit full-multiplication (producing the 128 bit result in rdx:rax) has the same latency and throughput, but is implemented as 2 uops (which run in parallel on different ports). (See agner.org/optimize for instruction tables and an excellent microarch guide). $\endgroup$ Commented May 25, 2017 at 22:32
  • 1
    $\begingroup$ [add-with-carry] is on another level of complexity (programming language level, not assembly/machine code level. It depends on the language. A C compiler targeting a 16-bit CPU has to emit add/adc for you when it compiles addition of two uint32_t values. This is still relevant today for int64_t on 32-bit targets. AVR is an 8-bit RISC microcontroller, so 32-bit integers require 4 instructions: godbolt.org/g/wre0fM $\endgroup$ Commented May 25, 2017 at 22:40
  • $\begingroup$ Yes, @PeterCordes, that is what I meant, I've clarified my sentence a little bit. $\endgroup$
    – AnoE
    Commented May 26, 2017 at 5:48

Not the answer you're looking for? Browse other questions tagged or ask your own question.