Why does division by 3 require a rightshift (and other oddities) on x86?

Question

I have the following C/C++ function:

unsigned div3(unsigned x) {
    return x / 3;
}

When compiled using clang 10 at -O3, this results in:

div3(unsigned int):
        mov     ecx, edi         # tmp = x
        mov     eax, 2863311531  # result = 3^-1
        imul    rax, rcx         # result *= tmp
        shr     rax, 33          # result >>= 33
        ret

What I do understand is: division by 3 is equivalent to multiplying with the multiplicative inverse 3^-1 mod 2³² which is 2863311531.

There are some things that I don't understand though:

Why do we need to use ecx/rcx at all? Can't we multiply rax with edi directly?
Why do we multiply in 64-bit mode? Wouldn't it be faster to multiply eax and ecx?
Why are we using imul instead of mul? I thought modular arithmetic would be all unsigned.
What's up with the 33-bit rightshift at the end? I thought we can just drop the highest 32-bits.

Edit 1

For those who don't understand what I mean by 3^-1 mod 2³², I am talking about the multiplicative inverse here. For example:

// multiplying with inverse of 3:
15 * 2863311531      = 42949672965
42949672965 mod 2^32 = 5

// using fixed-point multiplication
15 * 2863311531      = 42949672965
42949672965 >> 33    = 5

// simply dividing by 3
15 / 3               = 5

So multiplying with 42949672965 is actually equivalent to dividing by 3. I assumed clang's optimization is based on modular arithmetic, when it's really based on fixed point arithmetic.

Edit 2

I have now realized that the multiplicative inverse can only be used for divisions without a remainder. For example, multiplying 1 times 3^-1 is equal to 3^-1, not zero. Only fixed point arithmetic has correct rounding.

Unfortunately, clang does not make any use of modular arithmetic which would just be a single imul instruction in this case, even when it could. The following function has the same compile output as above.

unsigned div3(unsigned x) {
    __builtin_assume(x % 3 == 0);
    return x / 3;
}

(Canonical Q&A about fixed-point multiplicative inverses for exact division that work for every possible input: Why does GCC use multiplication by a strange number in implementing integer division? - not quite a duplicate because it only covers the math, not some of the implementation details like register width and imul vs. mul.)

I think it's because ecx and edx are 32 bit registers, while rax and rcx are 64 bit registers. — Cosinus, Commented Aug 14, 2020 at 18:01
All of your questions can be answered by one simple statement. The compiler optimization designers decided that this sequence was the most efficient and optimized way to do the division. Typically these decisions are made on a set of criteria such as minimizing the number of instructions used, choosing instructions with the minimum time, minimizing the amount of moving of data around especially when accessing RAM, working to optimize use of the CPU instruction pipeline, and other speed considerations. — Richard Chambers, Commented Aug 14, 2020 at 18:03
@J.Schultke In order to find out, study the source code for the clang optimizer — klutt, Commented Aug 14, 2020 at 18:07
2863311531 is equal to 3^-1 << 33. Your multiplication is too large by factor2^33. Therefore you cannot just chop the upper 32 bits but you must chop the lower 33 bits which is done by the shift. — Gerhardh, Commented Aug 14, 2020 at 18:22
The division instruction is often the slowest thing a processor can do. Any substitution that can be done will often be a win for execution time. Instead of dividing by n, multiplying by 1/n is very often better if 1/n can be computed at compile time. I do it manually myself if I can't rely on the compiler figuring it out for me. — Mark Ransom, Commented Aug 15, 2020 at 3:32

Peter Cordes · Accepted Answer · 2020-08-15 06:49:37Z

Can't we multiply rax with edi directly?

We can't imul rax, rdi because the calling convention allows the caller to leave garbage in the high bits of RDI; only the EDI part contains the value. This is a non-issue when inlining; writing a 32-bit register does implicitly zero-extend to the full 64-bit register, so the compiler will usually not need an extra instruction to zero-extend a 32-bit value.

(zero-extending into a different register is better because of limitations on mov-elimination, if you can't avoid it).

Taking your question even more literally, no, x86 doesn't have any multiply instructions that zero-extend one of their inputs to let you multiply a 32-bit and a 64-bit register. Both inputs must be the same width.

Why do we multiply in 64-bit mode?

(terminology: all of this code runs in 64-bit mode. You're asking why 64-bit operand-size.)

You could mul edi to multiply EAX with EDI to get a 64-bit result split across EDX:EAX, but mul edi is 3 uops on Intel CPUs, vs. most modern x86-64 CPUs having fast 64-bit imul. (Although imul r64, r64 is slower on AMD Bulldozer-family, and on some low-power CPUs.) https://uops.info/ and https://agner.org/optimize/ (instruction tables and microarch PDF) (Fun fact: mul rdi is actually cheaper on Intel CPUs, only 2 uops. Perhaps something to do with not having to do extra splitting on the output of the integer multiply unit, like mul edi would have to split the 64-bit low half multiplier output into EDX and EAX halves, but that happens naturally for 64x64 => 128-bit mul.)

Also the part you want is in EDX so you'd need another mov eax, edx to deal with it. (Again, because we're looking at code for a stand-alone definition of the function, not after inlining into a caller.)

GCC 8.3 and earlier did use 32-bit mul instead of 64-bit imul (https://godbolt.org/z/5qj7d5). That was not crazy for -mtune=generic when Bulldozer-family and old Silvermont CPUs were more relevant, but those CPUs are farther in the past for more recent GCC, and its generic tuning choices reflect that. Unfortunately GCC also wasted a mov instruction copying EDI to EAX, making this way look even worse :/

# gcc8.3 -O3  (default -mtune=generic)
div3(unsigned int):
        mov     eax, edi                 # 1 uop, stupid wasted instruction
        mov     edx, -1431655765         # 1 uop  (same 32-bit constant, just printed differently)
        mul     edx                      # 3 uops on Sandybridge-family
        mov     eax, edx                 # 1 uop
        shr     eax                      # 1 uop
        ret
                                  # total of 7 uops on SnB-family

Would only be 6 uops with mov eax, 0xAAAAAAAB / mul edi, but still worse than:

# gcc9.3 -O3  (default -mtune=generic)
div3(unsigned int):
        mov     eax, edi                # 1 uop
        mov     edi, 2863311531         # 1 uop
        imul    rax, rdi                # 1 uop
        shr     rax, 33                 # 1 uop
        ret
                      # total 4 uops, not counting ret

Unfortunately, 64-bit 0x00000000AAAAAAAB can't be represented as a 32-bit sign-extended immediate, so imul rax, rcx, 0xAAAAAAAB isn't encodeable. It would mean 0xFFFFFFFFAAAAAAAB.

Why are we using imul instead of mul? I thought modular arithmetic would be all unsigned.

It is unsigned. Signedness of the inputs only affects the high half of the result, but imul reg, reg doesn't produce the high half. Only the one-operand forms of mul and imul are full multiplies that do NxN => 2N, so only they need separate signed and unsigned versions.

Only imul has the faster and more flexible low-half-only forms. The only thing that's signed about imul reg, reg is that it sets OF based on signed overflow of the low half. It wasn't worth spending more opcodes and more transistors just to have a mul r,r whose only difference from imul r,r is the FLAGS output.

Intel's manual (https://www.felixcloutier.com/x86/imul) even points out the fact that it can be used for unsigned.

What's up with the 33-bit rightshift at the end? I thought we can just drop the highest 32-bits.

No, there's no multiplier constant that would give the exact right answer for every possible input x if you implemented it that way. The "as-if" optimization rule doesn't allow approximations, only implementations that produce the exact same observable behaviour for every input the program uses. Without knowing a value-range for x other than full range of unsigned, compilers don't have that option. (-ffast-math only applies to floating point; if you want faster approximations for integer math, code them manually like below):

See Why does GCC use multiplication by a strange number in implementing integer division? for more about the fixed-point multiplicative inverse method compilers use for exact division by compile time constants.

For an example of this not working in the general case, see my edit to an answer on Divide by 10 using bit shifts? which proposed

// Warning: INEXACT FOR LARGE INPUTS
// this fast approximation can just use the high half,
// so on 32-bit machines it avoids one shift instruction vs. exact division
int32_t div10(int32_t dividend)
{
    int64_t invDivisor = 0x1999999A;
    return (int32_t) ((invDivisor * dividend) >> 32);
}

Its first wrong answer (if you loop from 0 upward) is div10(1073741829) = 107374183 when 1073741829/10 is actually 107374182. (It rounded up instead of toward 0 like C integer division is supposed to.)

From your edit, I see you were actually talking about using the low half of a multiply result, which apparently works perfectly for exact multiples all the way up to UINT_MAX.

As you say, it completely fails when the division would have a remainder, e.g. 16 * 0xaaaaaaab = 0xaaaaaab0 when truncated to 32-bit, not 5.

unsigned div3_exact_only(unsigned x) {
    __builtin_assume(x % 3 == 0);  // or an equivalent with if() __builtin_unreachable()
    return x / 3;
}

Yes, if that math works out, it would be legal and optimal for compilers to implement that with 32-bit imul. They don't look for this optimization because it's rarely a known fact. IDK if it would be worth adding compiler code to even look for the optimization, in terms of compile time, not to mention compiler maintenance cost in developer time. It's not a huge difference in runtime cost, and it's rarely going to be possible. It is nice, though.

div3_exact_only:
    imul  eax, edi, 0xAAAAAAAB        # 1 uop, 3c latency
    ret

However, it is something you can do yourself in source code, at least for known type widths like uint32_t:

uint32_t div3_exact_only(uint32_t x) {
    return x * 0xaaaaaaabU;
}

This is a really good and detailed answer. The division by 10 is a non-issue though. You can multiply with 3435973837 and rightshift by 35 and that gives you enough precision for all 32-bit numbers. Even the Wikipedia article on division has this example. — Jan Schultke, Commented Aug 14, 2020 at 22:18
Both clang and GCC can generate these simplified divisions for any constant divisor, so you really don't need to implement this yourself. The multiplicative inverse method would never work in hindsight, as it could never turn a 1 into a zero. — Jan Schultke, Commented Aug 14, 2020 at 22:20
@J.Schultke: Yeah, of course there is a fixed-point multiplicative inverse for 10 (which compilers use), but your modular method (just relying on modulo 2^32 for free thanks to register widths) doesn't work. That was the point I was trying to make about why compilers don't just take the high half or low half of a multiply with a magic constant. You only need to do something special if you want an even faster approximation that doesn't work exactly for all inputs. — Peter Cordes, Commented Aug 14, 2020 at 22:21
@J.Schultke: Added an update to clarify what I meant. That manual-shift version doesn't save anything on a 64-bit machine where the compiler will probably use a 64-bit imul and shift anyway, but on a 32-bit machine it will get the compiler to just directly use the high half result of mul r32 without an shr by 1. — Peter Cordes, Commented Aug 14, 2020 at 23:16

Cosinus · Accepted Answer · 2020-08-14 19:21:55Z

11

What's up with the 33-bit right shift at the end? I thought we can just drop the highest 32-bits.

Instead of 3^(-1) mod 3 you have to think more about 0.3333333 where the 0 before the . is located in the upper 32 bit and the the 3333 is located in the lower 32 bit. This fixed point operation works fine, but the result is obviously shifted to the upper part of rax, therefor the CPU must shift the result down again after the operation.

Why are we using imul instead of mul? I thought modular arithmetic would be all unsigned.

There is no MUL instruction equivalent to the IMUL instruction. The IMUL variant that is used takes two registers:

a <= a * b

There is no MUL instruction that does that. MUL instructions are more expensive because they store the result as 128 Bit in two registers. Of course you could use the legacy instructions, but this does not change the fact that the result is stored in two registers.

edited Aug 14, 2020 at 19:21

answered Aug 14, 2020 at 18:23

Cosinus

5945 silver badges11 bronze badges

Thanks. It didn't occur to me at first that clang might be using fixed point arithmetic here. I just assumed it would use modular arithmetic instead. The only question that remains open is why edi isn't being used for multiplication.
– Jan Schultke
Commented Aug 14, 2020 at 18:38
I looked it up and it is legal to multiply with edi in this case. edi is preserved during function calls, but in this case imul wouldn't modify edi anyways. Maybe it's because rdi could be filled with values beyond 32-bits, so multiplying with rdi wouldn't be safe.
– Jan Schultke
Commented Aug 14, 2020 at 18:58
@J.Schultke This seems plausible
– Cosinus
Commented Aug 14, 2020 at 19:08
2

The reason it doesn’t use rdi directly is that the upper bits of rdi aren’t necessarily 0 when it is received as a parameter. The best way to clear the upper 32 bits of a register is with a 32-bit move instruction, which clears them automatically. The compiler could have used mov edi, edi. But rcx is available, and it turns out that in some cases mov ecx, edi is literally free (handled by the register renamer), while mov edi, edi isn’t necessarily free.
– prl
Commented Aug 14, 2020 at 19:43
@prl: mov-elimination doesn't make it free; latency and back-end execution units aren't the only measure of instruction cost. Notably front-end bandwidth can easily be a bottleneck in high throughput code. And code-size is always relevant. Can x86's MOV really be "free"? Why can't I reproduce this at all? So yes, mov ecx, edi is at least as good on every CPU, but not free.
– Peter Cordes
Commented Aug 14, 2020 at 20:33

Add a comment |

rcgldr · Accepted Answer · 2022-02-23 21:16:05Z

If you look at my answer to the prior question:

Why does GCC use multiplication by a strange number in implementing integer division?

It contains a link to a pdf article that explains this (my answer clarifies the stuff that isn't explained well in this pdf article):

https://gmplib.org/~tege/divcnst-pldi94.pdf

Note that one extra bit of precision is needed for some divisors, such as 7, the multiplier would normally require 33 bits, and the product would normally require 65 bits, but this can be avoided by handling the 2^32 bit separately with 3 additional instructions as shown in my prior answer and below.

Take a look at the generated code if you change to

unsigned div7(unsigned x) {
    return x / 7;
}

So to explain the process, let L = ceil(log2(divisor)). For the question above, L = ceil(log2(3)) == 2. The right shift count would initially be 32+L = 34.

To generate a multiplier with a sufficient number of bits, two potential multipliers are generated: mhi will be the multiplier to be used, and the shift count will be 32+L.

mhi = (2^(32+L) + 2^(L))/3 = 5726623062
mlo = (2^(32+L)        )/3 = 5726623061

Then a check is made to see if the number of required bits can be reduced:

while((L > 0) && ((mhi>>1) > (mlo>>1))){
    mhi = mhi>>1;
    mlo = mlo>>1;
    L   = L-1;
}
if(mhi >= 2^32){
    mhi = mhi-2^32
    L   = L-1;
    ; use 3 additional instructions for missing 2^32 bit
}
... mhi>>1 = 5726623062>>1 = 2863311531
... mlo>>1 = 5726623061>>1 = 2863311530  (mhi>>1) > (mlo>>1)
... mhi    = mhi>>1 = 2863311531
... mlo    = mhi>>1 = 2863311530
... L = L-1 = 1
... the next loop exits since now (mhi>>1) == (mlo>>1)

So the multiplier is mhi = 2863311531 and the shift count = 32+L = 33.

On an modern X86, multiply and shift instructions are constant time, so there's no point in reducing the multiplier (mhi) to less than 32 bits, so that while(...) above is changed to an if(...).

In the case of 7, the loop exits on the first iteration, and requires 3 extra instructions to handle the 2^32 bit, so that mhi is <= 32 bits:

L = ceil(log2(7)) = 3
mhi = (2^(32+L) + 2^(L))/7 = 4908534053
mhi = mhi-2^32 = 613566757

Let ecx = dividend, the simple approach could overflow on the add:

mov eax, 613566757             ; eax = mhi
mul ecx                        ; edx:eax = ecx*mhi
add edx, ecx                   ; edx:eax = ecx*(mhi + 2^32), potential overflow
shr edx, 3

To avoid the potential overflow, note that eax = eax*2 - eax:

(ecx*eax)           =   (ecx*eax)<<1)             -(ecx*eax)
(ecx*(eax+2^32))    =   (ecx*eax)<<1)+  (ecx*2^32)-(ecx*eax)
(ecx*(eax+2^32))>>3 =  ((ecx*eax)<<1)+  (ecx*2^32)-(ecx*eax)     )>>3
                    = (((ecx*eax)   )+(((ecx*2^32)-(ecx*eax))>>1))>>2

so the actual code, using u32() to mean upper 32 bits:

...                 visual studio generated code for div7, dividend is ecx
mov eax, 613566757
mul ecx                        ; edx = u32( (ecx*eax) )
sub ecx, edx                   ; ecx = u32(            ((ecx*2^32)-(ecx*eax))        )
shr ecx, 1                     ; ecx = u32(           (((ecx*2^32)-(ecx*eax))>>1)    )
lea eax, DWORD PTR [edx+ecx]   ; eax = u32( (ecx*eax)+(((ecx*2^32)-(ecx*eax))>>1)    )
shr eax, 2                     ; eax = u32(((ecx*eax)+(((ecx*2^32)-(ecx*eax))>>1))>>2)

If a remainder is wanted, then the following steps can be used:

mhi and L are generated based on divisor during compile time
...
quotient  = (x*mhi)>>(32+L)
product   = quotient*divisor
remainder = x - product

gnasher729 · Accepted Answer · 2020-08-15 10:49:49Z

x/3 is approximately (x * (2^32/3)) / 2^32. So we can perform a single 32x32->64 bit multiplication, take the higher 32 bits, and get approximately x/3.

There is some error because we cannot multiply exactly by 2^32/3, only by this number rounded to an integer. We get more precision using x/3 ≈ (x * (2^33/3)) / 2^33. (We can't use 2^34/3 because that is > 2^32). And that turns out to be good enough to get x/3 in all cases exactly. You would prove this by checking that the formula gives a result of k if the input is 3k or 3k+2.

Collectives™ on Stack Overflow

Why does division by 3 require a rightshift (and other oddities) on x86?

Edit 1

Edit 2

4 Answers 4

Not the answer you're looking for? Browse other questions tagged
c++
assembly
compilation
x86-64
integer-division
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Edit 1

Edit 2

4 Answers 4

Not the answer you're looking for? Browse other questions tagged c++assemblycompilationx86-64integer-division or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
c++
assembly
compilation
x86-64
integer-division
or ask your own question.