45
\$\begingroup\$

Why does hardware division take so much longer than multiplication on a microcontroller? E.g., on a dsPIC, a division takes 19 cycles, while multiplication takes only one clock cycle.

I went through some tutorials, including Division algorithm and Multiplication algorithm on Wikipedia. Here is my reasoning.

A division algorithm, like a slow division method with restoring on Wikipedia, is a recursive algorithm. This means that (intermediate) results from step k are used as inputs to step k+1, which means that these algorithms cannot be parallelized. Therefore, it takes at least n cycles to complete the division, whereas n is a number of bits in a dividend. For 16-bit dividends, this is equal to at least 16 cycles.

A multiplication algorithm doesn't need to be recursive, which means that it is possible to parallelize it. However, there are many different multiplication algorithms, and I don't have a clue which one may be used by microcontrollers. How does multiplication work on a hardware/microcontroller?

I've found a Dadda multiplier algorithm, which is supposed to take only one clock cycle to finish. However, what I don't get here is that Dadda's algorithm proceeds in three steps, whereas results from step 1 are used in step 2, etc. According to this, this would take at least three clock cycles to finish.

\$\endgroup\$
15
  • 2
    \$\begingroup\$ Algorithm is not really defining the number of clock cycle. Your specific CPU might have a hardware multiplier/divider working in one cycle or 20 cycles regardless of the internal implementation. \$\endgroup\$
    – Eugene Sh.
    Commented Jan 16, 2017 at 18:55
  • 1
    \$\begingroup\$ OP, can you provide a link that gives more information on the 19 vs 1 cycles you talk about? Something specific about your DSP. \$\endgroup\$ Commented Jan 16, 2017 at 18:57
  • 2
    \$\begingroup\$ Thanks for answers. Here is a datasheet for my microcontroller: ww1.microchip.com/downloads/en/DeviceDoc/70005127c.pdf . See Instruction set overview, starting on page 292. It says that all DIV instructions take 18 cycles, while all MUL instructions take only 1 cycle. But is not common only for this MCU, I've seen this in many other MCUs. \$\endgroup\$ Commented Jan 16, 2017 at 19:00
  • 2
    \$\begingroup\$ @Curd, well, they're about the same, aren't they. Are for me. I don't think that illustrates it as well as you might imagine. \$\endgroup\$
    – TonyM
    Commented Jan 16, 2017 at 23:08
  • 3
    \$\begingroup\$ The other factor is economics and usage patterns. Most usages invoke multiply far more often than divide. Dedicating a large area of silicon to a faster hardware divide function that will be used relatively infrequently is poor economics. Better to make a smaller and cheaper chip, or to use the extra logic in a more productive manner. BTW when I started with minicomputers, divide wasn't always an instruction. On some machines it was a software library call, like square root. \$\endgroup\$
    – nigel222
    Commented Jan 17, 2017 at 10:46

6 Answers 6

37
\$\begingroup\$

A divider maps much less elegantly to typical hardware. Take Lattice ICE40 FPGAs as examples.

Let us compare two cases: this 8x8 bit to 16 bit multiplier:

module multiply (clk, a, b, result);
   input clk;
   input [7:0]a;
   input [7:0]b;
   output [15:0]result;
   always @(posedge clk)
     result = a * b;
endmodule // multiply

and this divider that reduces 8 and 8 bit operands to 8 bit result:

module divide(clk, a, b, result);
   input clk;
   input [7:0] a;
   input [7:0] b;
   output [7:0] result;
   always @(posedge clk)
     result = a / b;
endmodule // divide

(Yes, I know, the clock doesn't do anything)

An overview of the generated schematic when mapping the multiplier to an ICE40 FPGA can be found here and the divider here.

The synthesis statistics from Yosys are:

multiply

  • Number of wires: 155
  • Number of wire bits: 214
  • Number of public wires: 4
  • Number of public wire bits: 33
  • Number of memories: 0
  • Number of memory bits: 0
  • Number of processes: 0
  • Number of cells: 191
    • SB_CARRY 10
    • SB_DFF 16
    • SB_LUT4 165

divide

  • Number of wires: 145
  • Number of wire bits: 320
  • Number of public wires: 4
  • Number of public wire bits: 25
  • Number of memories: 0
  • Number of memory bits: 0
  • Number of processes: 0
  • Number of cells: 219
    • SB_CARRY 85
    • SB_DFF 8
    • SB_LUT4 126

It's worth noting that the size of the generated verilog for a full-width multiplier and a maximally-dividing divider aren't that extreme. However, if you'll look at the pictures below, you'll notice the multiplier has maybe a depth of 15, whereas the divider looks more like 50 or so; the critical path (i.e. the longest path that can occur during operation) is what defines the speed!


You won't be able to read this, anyway, to get a visual impression. I think the differences in complexity are possible to spot. These are single cycle multiplier/dividers!

Multiply

Multiply on an ICE40 (warning: ~100 Mpixel image)

Scaled image of the multiplier

Divide

(Divide on an ICE40) (warning: ~100 Mpixel image)

Scaled image of the divider

\$\endgroup\$
12
  • 4
    \$\begingroup\$ no, you can implement them non-iteratively. But it will simply take quite some time until the valid result "ripples" through the logic. The implementations above are non-iterative. \$\endgroup\$ Commented Jan 16, 2017 at 22:04
  • 13
    \$\begingroup\$ I want a wall poster of the divider. \$\endgroup\$
    – Ian Howson
    Commented Jan 17, 2017 at 5:40
  • 8
    \$\begingroup\$ There's a PDF now in the multiply gist. It's 3378 × 3177 mm, so please discuss with your significant other before you put this on the bedroom ceiling. \$\endgroup\$ Commented Jan 17, 2017 at 9:51
  • 3
    \$\begingroup\$ Your 100 megapixel images are impressive, but are way overkill for the point you're trying to make, and they cause huge problems for anyone trying to view this page on a device with limited memory such as a phone or tablet. If you want to display the images inline, please find a way to produce a lower resolution preview. \$\endgroup\$
    – Dave Tweed
    Commented Jan 17, 2017 at 14:24
  • 4
    \$\begingroup\$ Yo, those graphviz charts are off the hook, yo! \$\endgroup\$ Commented Jan 17, 2017 at 23:05
11
\$\begingroup\$

We can have multiple layers of logic per clock cycle but there is a limit, exactly how many layers of logic we can have an how complex those layers can be will depend on our clock speed and our semiconductor process.

However, there are many different multiplication algorithms, and I don't have a clue which one may be used by microcontrollers

Afaict most multiplication in computers uses a variant of binary long multiplication. Binary long multiplication involves

  • Shifting one operand by various different ammounts
  • Masking the shifted numbers based on the second operand
  • Adding the results of the masking together.

So lets take a look at implementing this in hardware.

  • Shifting is just a matter of how we wire things up, so it comes for free.
  • Masking requires AND gates. That means one layer of logic, so from a time point of view it's cheap.
  • Addition is relatively expensive due to the need for a carry chain. Fortunately there is a trick we can use. For most of the addition stages rather than adding two numbers to produce one we can add three numbers to produce two.

So lets ballpark how many logic stages we need for an 8x8 multiplier with a 16 bit results. For simplicity lets assume we don't try and optimise for the fact that not all of the intermediate results have bits in all of the positions.

Lets assume that a full adder is implemented in two "gate stages".

  • 1 for masking to produce 8 intermediate results.
  • 2 to add groups of three numbers to reduce the 8 intermediate results to 6
  • 2 to add groups of three numbers to reduce the 6 intermediate results to 4
  • 2 to add a group of three numbers to reduce the 4 intermediate results to 3
  • 2 to add a group of three numbers to reduce the 3 intermediate results to 2
  • 32 to add up the final two results.

So about 41 logic stages total. Most of which are spent adding up the last two intermediate results.

This could be improved further by exploiting the fact that not all the intermediate results have all bits present (that is basically what the dada multiplier does), by using a carry lookahead adder for the final step. By adding 7 numbers to produce 3 instead of three to produce two (reducing the number of stages at the price of more gates and wider gates) etc.

That is all minor details though, the important point is that the number of stages needed to multiply two n bit numbers and produce a 2n bit result is roughly proportional to n.


On the other hand if we look at division algorithms we find they all have an iterative process where.

  1. What is done on one iteration depends heavilly on the results of the previous iteration.
  2. the number of logic stages required to implement an iteration is roughly proporitional to n (subtraction and comparision are very similar in complexity to addition)
  3. the number of iterations is also roughly proportional to n.

So the number of logic stages required to implement division is roughly proportional to n squared.

\$\endgroup\$
6
  • \$\begingroup\$ Thank you for your answer. I've read on Wiki that Dadda's algorithm is very efficient when it comes to required number of gates to implement this algorithm on hardware. Despite that, most hardware uses "binary long multiplication"? \$\endgroup\$ Commented Jan 16, 2017 at 22:03
  • 1
    \$\begingroup\$ I tlooks like dada's algotihm is an optimised version of binary long multiplication. \$\endgroup\$ Commented Jan 16, 2017 at 22:04
  • \$\begingroup\$ I burn 8 cycles to do a 1/x division. I then use that against an 8 cycle multiplication for a fixed cost of 16 cycles. \$\endgroup\$
    – b degnan
    Commented Jan 17, 2017 at 12:23
  • \$\begingroup\$ This nicely demonstrates that multiplication is not so much worse than addition after all. \$\endgroup\$ Commented Jan 17, 2017 at 19:11
  • 1
    \$\begingroup\$ An iteration requires a subtraction which can be done in O(lgN) stages using O(NlgN) hardware, or O(sqrt(N)) stages using O(N) hardware. The essential point, though, is that multiplication requires O(lgN) stages, while division requires O(NlgN) stages. Not O(N*N) but larger than multiplication by a factor of O(N) unless one starts by taking an approximate reciprocal so as to allow more work to be done per step. \$\endgroup\$
    – supercat
    Commented Jan 17, 2017 at 22:00
9
\$\begingroup\$

Slow division is inherently iterative so it tends to take longer. There are somewhat faster slow division algorithms than the simple ones, using lookup tables. The SRT algorithm produces two bits per cycle. An error in such a table was the cause of the infamous Pentium FDIV bug (ca. 1994). Then there are so-called fast division algorithms.

Of course, in principle, you could simply use a huge lookup table to calculate the product or quotient of two numbers, and thus get results in a single cycle, but that tends to rapidly get impractical as the number of bits per number increases.

\$\endgroup\$
4
  • \$\begingroup\$ But the bottom line is - division algorithms cannot be parallelized, unlike multiplication algorithms, and that is why they are so much slower? \$\endgroup\$ Commented Jan 16, 2017 at 19:02
  • 2
    \$\begingroup\$ @MarkoGulin "cannot" is a very strong assertion. It's certainly not straightforward. \$\endgroup\$ Commented Jan 16, 2017 at 19:14
  • 3
    \$\begingroup\$ I think you could weaken it from "division algorithms cannot be parallelized" to "the ways we have found to parallelize division are more straining on the hardware implementing the division than parallelized multiplication." Sphero gives an example of how to do single-cycle division using O(2^n) gates to multiply n-bit numbers... but that's just not practical. \$\endgroup\$
    – Cort Ammon
    Commented Jan 17, 2017 at 0:14
  • 2
    \$\begingroup\$ Long division can exploit parallelism to any desired degree by computing an approximate reciprocal which, when multiplied by the divisor, yields a result of the form 1000...xxxx, When working with a divisor in such a form with N leadig zeroes, it's easy to compute N bits of the result with each step. \$\endgroup\$
    – supercat
    Commented Jan 17, 2017 at 2:02
6
\$\begingroup\$

Practical division algorithms are all based on numeric suites which converge to the quotient.

  • There are additive methods, as non-restoring or SRT which works by adding or removing 2^N to the quotient and correspondingly add or remove the 2^N * Divisor to the partial remainder until it has converged to zero.

  • There are multiplicative methods, as Newton-Raphson or Goldshmidth, which are root finding methods where division is calculated as the inverse of multiplication.

Additive methods gives one or a few bits per cycle. Multiplicative methods double the number of bits for each cycle but need some initial approximation, often obtained with a constant table.

The "slow" and "fast" denominations are misleading, as the actual speed depends on the number of bits, how much hardware is devoted to the function (and a fast multiplier is very large)...

Division is slower than multiplication because there is no direct, parallel method for calculating it : Either there is an iteration, or hardware is copied to implement the iteration as cascaded (or pipelined) blocks.

\$\endgroup\$
5
\$\begingroup\$

A division algorithm (in fact any algorithm) can be made in one clock cycle. If you are willing to pay for the extra transistors and lower allowed clock rate.

Suppose you have a set of gates that implements one clock cycle of an existing multi-cycle division algorithm. To make the algorithm single cycle, use multiple stages of hardware (similar to that used in one stage of the multi-cycle algorithm), with the output of one stage feeding the next stage.

Of course the reason not to do it that way is that it uses a lot of transistors. For example for a 16 bit division it may use nearly 16 X more transistors. Also having more stages of gates lowers the maximum allowed clock frequency (because there is more stages of propagation delay).

\$\endgroup\$
1
  • \$\begingroup\$ To put it another way: you could slow down the rest of the CPU so everything else was as slow as division, but that would be insane vs. letting it take many cycles while having everything else fast. \$\endgroup\$ Commented Mar 10, 2022 at 11:24
1
\$\begingroup\$

Why does hardware division take so much longer than multiplication on a microcontroller?

This is not an electronics question. At best it is a computer question, better addressed to Stack Overflow.

See, for example, here: Is multiplication faster than float division?

In reality, it is a Real Life question: Why does division take so much longer than multiplication?

Which would you rather calculate on paper?

51 * 82

or

4182 / 51

Division takes longer than multiplication because it is harder to do.

\$\endgroup\$

Not the answer you're looking for? Browse other questions tagged or ask your own question.