Why does hardware division take much longer than multiplication?

Question

Why does hardware division take so much longer than multiplication on a microcontroller? E.g., on a dsPIC, a division takes 19 cycles, while multiplication takes only one clock cycle.

I went through some tutorials, including Division algorithm and Multiplication algorithm on Wikipedia. Here is my reasoning.

A division algorithm, like a slow division method with restoring on Wikipedia, is a recursive algorithm. This means that (intermediate) results from step k are used as inputs to step k+1, which means that these algorithms cannot be parallelized. Therefore, it takes at least n cycles to complete the division, whereas n is a number of bits in a dividend. For 16-bit dividends, this is equal to at least 16 cycles.

A multiplication algorithm doesn't need to be recursive, which means that it is possible to parallelize it. However, there are many different multiplication algorithms, and I don't have a clue which one may be used by microcontrollers. How does multiplication work on a hardware/microcontroller?

I've found a Dadda multiplier algorithm, which is supposed to take only one clock cycle to finish. However, what I don't get here is that Dadda's algorithm proceeds in three steps, whereas results from step 1 are used in step 2, etc. According to this, this would take at least three clock cycles to finish.

Algorithm is not really defining the number of clock cycle. Your specific CPU might have a hardware multiplier/divider working in one cycle or 20 cycles regardless of the internal implementation. — Eugene Sh., Commented Jan 16, 2017 at 18:55
OP, can you provide a link that gives more information on the 19 vs 1 cycles you talk about? Something specific about your DSP. — Vladimir Cravero, Commented Jan 16, 2017 at 18:57
Thanks for answers. Here is a datasheet for my microcontroller: ww1.microchip.com/downloads/en/DeviceDoc/70005127c.pdf . See Instruction set overview, starting on page 292. It says that all DIV instructions take 18 cycles, while all MUL instructions take only 1 cycle. But is not common only for this MCU, I've seen this in many other MCUs. — Marko Gulin, Commented Jan 16, 2017 at 19:00
@Curd, well, they're about the same, aren't they. Are for me. I don't think that illustrates it as well as you might imagine. — TonyM, Commented Jan 16, 2017 at 23:08
The other factor is economics and usage patterns. Most usages invoke multiply far more often than divide. Dedicating a large area of silicon to a faster hardware divide function that will be used relatively infrequently is poor economics. Better to make a smaller and cheaper chip, or to use the extra logic in a more productive manner. BTW when I started with minicomputers, divide wasn't always an instruction. On some machines it was a software library call, like square root. — nigel222, Commented Jan 17, 2017 at 10:46

Community · Accepted Answer · 2020-06-11 15:10:44Z

A divider maps much less elegantly to typical hardware. Take Lattice ICE40 FPGAs as examples.

Let us compare two cases: this 8x8 bit to 16 bit multiplier:

module multiply (clk, a, b, result);
   input clk;
   input [7:0]a;
   input [7:0]b;
   output [15:0]result;
   always @(posedge clk)
     result = a * b;
endmodule // multiply

and this divider that reduces 8 and 8 bit operands to 8 bit result:

module divide(clk, a, b, result);
   input clk;
   input [7:0] a;
   input [7:0] b;
   output [7:0] result;
   always @(posedge clk)
     result = a / b;
endmodule // divide

(Yes, I know, the clock doesn't do anything)

An overview of the generated schematic when mapping the multiplier to an ICE40 FPGA can be found here and the divider here.

The synthesis statistics from Yosys are:

multiply

Number of wires: 155
Number of wire bits: 214
Number of public wires: 4
Number of public wire bits: 33
Number of memories: 0
Number of memory bits: 0
Number of processes: 0
Number of cells: 191
- SB_CARRY 10
- SB_DFF 16
- SB_LUT4 165

divide

Number of wires: 145
Number of wire bits: 320
Number of public wires: 4
Number of public wire bits: 25
Number of memories: 0
Number of memory bits: 0
Number of processes: 0
Number of cells: 219
- SB_CARRY 85
- SB_DFF 8
- SB_LUT4 126

It's worth noting that the size of the generated verilog for a full-width multiplier and a maximally-dividing divider aren't that extreme. However, if you'll look at the pictures below, you'll notice the multiplier has maybe a depth of 15, whereas the divider looks more like 50 or so; the critical path (i.e. the longest path that can occur during operation) is what defines the speed!

You won't be able to read this, anyway, to get a visual impression. I think the differences in complexity are possible to spot. These are single cycle multiplier/dividers!

Multiply

Multiply on an ICE40 (warning: ~100 Mpixel image)

Divide

(Divide on an ICE40) (warning: ~100 Mpixel image)

no, you can implement them non-iteratively. But it will simply take quite some time until the valid result "ripples" through the logic. The implementations above are non-iterative. — Marcus Müller, Commented Jan 16, 2017 at 22:04
There's a PDF now in the multiply gist. It's 3378 × 3177 mm, so please discuss with your significant other before you put this on the bedroom ceiling. — Marcus Müller, Commented Jan 17, 2017 at 9:51
Your 100 megapixel images are impressive, but are way overkill for the point you're trying to make, and they cause huge problems for anyone trying to view this page on a device with limited memory such as a phone or tablet. If you want to display the images inline, please find a way to produce a lower resolution preview. — Dave Tweed, Commented Jan 17, 2017 at 14:24

Peter Green · Accepted Answer · 2020-06-24 16:04:23Z

We can have multiple layers of logic per clock cycle but there is a limit, exactly how many layers of logic we can have an how complex those layers can be will depend on our clock speed and our semiconductor process.

However, there are many different multiplication algorithms, and I don't have a clue which one may be used by microcontrollers

Afaict most multiplication in computers uses a variant of binary long multiplication. Binary long multiplication involves

Shifting one operand by various different ammounts
Masking the shifted numbers based on the second operand
Adding the results of the masking together.

So lets take a look at implementing this in hardware.

Shifting is just a matter of how we wire things up, so it comes for free.
Masking requires AND gates. That means one layer of logic, so from a time point of view it's cheap.
Addition is relatively expensive due to the need for a carry chain. Fortunately there is a trick we can use. For most of the addition stages rather than adding two numbers to produce one we can add three numbers to produce two.

So lets ballpark how many logic stages we need for an 8x8 multiplier with a 16 bit results. For simplicity lets assume we don't try and optimise for the fact that not all of the intermediate results have bits in all of the positions.

Lets assume that a full adder is implemented in two "gate stages".

1 for masking to produce 8 intermediate results.
2 to add groups of three numbers to reduce the 8 intermediate results to 6
2 to add groups of three numbers to reduce the 6 intermediate results to 4
2 to add a group of three numbers to reduce the 4 intermediate results to 3
2 to add a group of three numbers to reduce the 3 intermediate results to 2
32 to add up the final two results.

So about 41 logic stages total. Most of which are spent adding up the last two intermediate results.

This could be improved further by exploiting the fact that not all the intermediate results have all bits present (that is basically what the dada multiplier does), by using a carry lookahead adder for the final step. By adding 7 numbers to produce 3 instead of three to produce two (reducing the number of stages at the price of more gates and wider gates) etc.

That is all minor details though, the important point is that the number of stages needed to multiply two n bit numbers and produce a 2n bit result is roughly proportional to n.

On the other hand if we look at division algorithms we find they all have an iterative process where.

What is done on one iteration depends heavilly on the results of the previous iteration.
the number of logic stages required to implement an iteration is roughly proporitional to n (subtraction and comparision are very similar in complexity to addition)
the number of iterations is also roughly proportional to n.

So the number of logic stages required to implement division is roughly proportional to n squared.

Thank you for your answer. I've read on Wiki that Dadda's algorithm is very efficient when it comes to required number of gates to implement this algorithm on hardware. Despite that, most hardware uses "binary long multiplication"? — Marko Gulin, Commented Jan 16, 2017 at 22:03
I tlooks like dada's algotihm is an optimised version of binary long multiplication. — Peter Green, Commented Jan 16, 2017 at 22:04
I burn 8 cycles to do a 1/x division. I then use that against an 8 cycle multiplication for a fixed cost of 16 cycles. — b degnan, Commented Jan 17, 2017 at 12:23
This nicely demonstrates that multiplication is not so much worse than addition after all. — Hagen von Eitzen, Commented Jan 17, 2017 at 19:11
An iteration requires a subtraction which can be done in O(lgN) stages using O(NlgN) hardware, or O(sqrt(N)) stages using O(N) hardware. The essential point, though, is that multiplication requires O(lgN) stages, while division requires O(NlgN) stages. Not O(N*N) but larger than multiplication by a factor of O(N) unless one starts by taking an approximate reciprocal so as to allow more work to be done per step. — supercat, Commented Jan 17, 2017 at 22:00

Spehro Pefhany · Accepted Answer · 2017-01-16 19:00:41Z

9

Slow division is inherently iterative so it tends to take longer. There are somewhat faster slow division algorithms than the simple ones, using lookup tables. The SRT algorithm produces two bits per cycle. An error in such a table was the cause of the infamous Pentium FDIV bug (ca. 1994). Then there are so-called fast division algorithms.

Of course, in principle, you could simply use a huge lookup table to calculate the product or quotient of two numbers, and thus get results in a single cycle, but that tends to rapidly get impractical as the number of bits per number increases.

answered Jan 16, 2017 at 19:00

Spehro Pefhany

407k22 gold badges344 silver badges917 bronze badges

\$\begingroup\$ But the bottom line is - division algorithms cannot be parallelized, unlike multiplication algorithms, and that is why they are so much slower? \$\endgroup\$
– Marko Gulin
Commented Jan 16, 2017 at 19:02
2

\$\begingroup\$ @MarkoGulin "cannot" is a very strong assertion. It's certainly not straightforward. \$\endgroup\$
– Spehro Pefhany
Commented Jan 16, 2017 at 19:14
3

\$\begingroup\$ I think you could weaken it from "division algorithms cannot be parallelized" to "the ways we have found to parallelize division are more straining on the hardware implementing the division than parallelized multiplication." Sphero gives an example of how to do single-cycle division using O(2^n) gates to multiply n-bit numbers... but that's just not practical. \$\endgroup\$
– Cort Ammon
Commented Jan 17, 2017 at 0:14
2

\$\begingroup\$ Long division can exploit parallelism to any desired degree by computing an approximate reciprocal which, when multiplied by the divisor, yields a result of the form 1000...xxxx, When working with a divisor in such a form with N leadig zeroes, it's easy to compute N bits of the result with each step. \$\endgroup\$
– supercat
Commented Jan 17, 2017 at 2:02

Add a comment |

Grabul · Accepted Answer · 2017-01-16 21:30:25Z

Practical division algorithms are all based on numeric suites which converge to the quotient.

There are additive methods, as non-restoring or SRT which works by adding or removing 2^N to the quotient and correspondingly add or remove the 2^N * Divisor to the partial remainder until it has converged to zero.
There are multiplicative methods, as Newton-Raphson or Goldshmidth, which are root finding methods where division is calculated as the inverse of multiplication.

Additive methods gives one or a few bits per cycle. Multiplicative methods double the number of bits for each cycle but need some initial approximation, often obtained with a constant table.

The "slow" and "fast" denominations are misleading, as the actual speed depends on the number of bits, how much hardware is devoted to the function (and a fast multiplier is very large)...

Division is slower than multiplication because there is no direct, parallel method for calculating it : Either there is an iteration, or hardware is copied to implement the iteration as cascaded (or pipelined) blocks.

user4574 · Accepted Answer · 2017-01-16 19:17:01Z

A division algorithm (in fact any algorithm) can be made in one clock cycle. If you are willing to pay for the extra transistors and lower allowed clock rate.

Suppose you have a set of gates that implements one clock cycle of an existing multi-cycle division algorithm. To make the algorithm single cycle, use multiple stages of hardware (similar to that used in one stage of the multi-cycle algorithm), with the output of one stage feeding the next stage.

Of course the reason not to do it that way is that it uses a lot of transistors. For example for a 16 bit division it may use nearly 16 X more transistors. Also having more stages of gates lowers the maximum allowed clock frequency (because there is more stages of propagation delay).

To put it another way: you could slow down the rest of the CPU so everything else was as slow as division, but that would be insane vs. letting it take many cycles while having everything else fast. — Peter Cordes, Commented Mar 10, 2022 at 11:24

Community · Accepted Answer · 2017-05-23 11:33:36Z

1

Why does hardware division take so much longer than multiplication on a microcontroller?

This is not an electronics question. At best it is a computer question, better addressed to Stack Overflow.

See, for example, here: Is multiplication faster than float division?

In reality, it is a Real Life question: Why does division take so much longer than multiplication?

Which would you rather calculate on paper?

51 * 82

or

4182 / 51

Division takes longer than multiplication because it is harder to do.

edited May 23, 2017 at 11:33

CommunityBot

1

answered Jan 18, 2017 at 6:37

Nick Gammon

9081 gold badge9 silver badges23 bronze badges

Add a comment |

Stack Exchange Network

Why does hardware division take much longer than multiplication?

6 Answers 6

multiply

divide

Multiply

Divide

Not the answer you're looking for? Browse other questions tagged
microcontroller
hardware
math
algorithm
arithmetic-division
or ask your own question.

Linked

Hot Network Questions

Why does hardware division take much longer than multiplication?

6 Answers 6

multiply

divide

Multiply

Divide

Not the answer you're looking for? Browse other questions tagged microcontrollerhardwaremathalgorithmarithmetic-division or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
microcontroller
hardware
math
algorithm
arithmetic-division
or ask your own question.