Efficiently implementing floored / euclidean integer division

Question

Floored division is when the result is always floored down (towards −∞), not towards 0:

division types

Is it possible to efficiently implement floored or euclidean integer division in C/C++?

(the obvious solution is to check the dividend's sign)

Vladimir Panteleev · Accepted Answer · 2010-11-05 23:50:32Z

I've written a test program to benchmark the ideas presented here:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <windows.h>

#define N 10000000
#define M 100

int dividends[N], divisors[N], results[N];

__forceinline int floordiv_signcheck(int a, int b)
{
    return (a<0 ? a-(b-1) : a) / b;
}

__forceinline int floordiv_signcheck2(int a, int b)
{
    return (a - (a<0 ? b-1 : 0)) / b;
}

__forceinline int floordiv_signmultiply(int a, int b)
{
    return (a + (a>>(sizeof(a)*8-1))*(b-1)) / b;
}

__forceinline int floordiv_floatingpoint(int a, int b)
{
    // I imagine that the call to floor can be replaced to a cast
    // if you can get FPU rounding control to work (I couldn't).
    return floor((double)a / b);
}

void main()
{
    for (int i=0; i<N; i++)
    {
        dividends[i] = rand();
        do
            divisors[i] = rand();
        while (divisors[i]==0);
    }

    LARGE_INTEGER t0, t1;

    QueryPerformanceCounter(&t0);
    for (int j=0; j<M; j++)
        for (int i=0; i<N; i++)
            results[i] = floordiv_signcheck(dividends[i], divisors[i]);
    QueryPerformanceCounter(&t1);
    printf("signcheck    : %9llu\n", t1.QuadPart-t0.QuadPart);

    QueryPerformanceCounter(&t0);
    for (int j=0; j<M; j++)
        for (int i=0; i<N; i++)
            results[i] = floordiv_signcheck2(dividends[i], divisors[i]);
    QueryPerformanceCounter(&t1);
    printf("signcheck2   : %9llu\n", t1.QuadPart-t0.QuadPart);

    QueryPerformanceCounter(&t0);
    for (int j=0; j<M; j++)
        for (int i=0; i<N; i++)
            results[i] = floordiv_signmultiply(dividends[i], divisors[i]);
    QueryPerformanceCounter(&t1);
    printf("signmultiply : %9llu\n", t1.QuadPart-t0.QuadPart);

    QueryPerformanceCounter(&t0);
    for (int j=0; j<M; j++)
        for (int i=0; i<N; i++)
            results[i] = floordiv_floatingpoint(dividends[i], divisors[i]);
    QueryPerformanceCounter(&t1);
    printf("floatingpoint: %9llu\n", t1.QuadPart-t0.QuadPart);
}

Results:

signcheck    :  61458768
signcheck2   :  61284370
signmultiply :  61625076
floatingpoint: 287315364

So, according to my results, checking the sign is the fastest:

(a - (a<0 ? b-1 : 0)) / b

Your timing for all but the first include the printf for the previous answer. printf is kinda slow, don't know if it's slow enough to affect your results or not. — Mark Ransom, Commented Nov 5, 2010 at 22:30
You might also be affected by inlining decisions made by the compiler, it would be worth looking at the generated assembly. Try double instead of float as there might be conversions happening there as well. — Mark Ransom, Commented Nov 5, 2010 at 22:32
Thanks for the suggestions, I've updated the program. Using double caused floatingpoint to run a bit faster, but not by much. Otherwise, the result didn't change much. — Vladimir Panteleev, Commented Nov 5, 2010 at 23:51
Isn't this program only testing positive dividends? rand() is specified to only return positive integers. — Dolda2000, Commented Apr 21, 2016 at 3:48

Dolda2000 · Accepted Answer · 2019-02-25 14:51:42Z

I'm revisiting this question five years later, as this is relevant for me too. I did some performance measurements on two pure-C versions and two inline-assembly versions for x86-64, and the results may be interesting.

The tested variants of floored division are:

The implementation I've been using for some time now;
The slight variant on that presented above which only uses one division;
The previous one, but hand-implemented in inline-assembly; and
A CMOV version implemented in assembly.

The following is my benchmark program:

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>

#ifndef VARIANT
#define VARIANT 3
#endif

#if VARIANT == 0
#define floordiv(a, b) (((a) < 0)?((((a) + 1) / (b)) - 1):((a) / (b)))
#elif VARIANT == 1
#define floordiv(a, b) ((((a) < 0)?((a) - ((b) - 1)):(a)) / (b))
#elif VARIANT == 2
#define floordiv(a, b) ({                                   \
    int result;                                             \
    asm("test %%eax, %%eax; jns 1f; sub %1, %%eax;"         \
        "add $1, %%eax; 1: cltd; idivl %1;"                 \
        : "=a" (result)                                     \
        : "r" (b),                                          \
          "0" (a)                                           \
        : "rdx");                                           \
    result;})
#elif VARIANT == 3
#define floordiv(a, b) ({                                           \
    int result;                                                     \
    asm("mov %%eax, %%edx; sub %1, %%edx; add $1, %%edx;"           \
        "test %%eax, %%eax; cmovs %%edx, %%eax; cltd;"              \
        "idivl %1;"                                                 \
        : "=a" (result)                                             \
        : "r" (b),                                                  \
          "0" (a)                                                   \
        : "rdx");                                                   \
    result;})
#endif

double ntime(void)
{
    struct timeval tv;

    gettimeofday(&tv, NULL);
    return(tv.tv_sec + (((double)tv.tv_usec) / 1000000.0));
}

void timediv(int n, int *p, int *q, int *r)
{
    int i;

    for(i = 0; i < n; i++)
        r[i] = floordiv(p[i], q[i]);
}

int main(int argc, char **argv)
{
    int n, i, *q, *p, *r;
    double st;

    n = 10000000;
    p = malloc(sizeof(*p) * n);
    q = malloc(sizeof(*q) * n);
    r = malloc(sizeof(*r) * n);
    for(i = 0; i < n; i++) {
        p[i] = (rand() % 1000000) - 500000;
        q[i] = (rand() % 1000000) + 1;
    }

    st = ntime();
    for(i = 0; i < 100; i++)
        timediv(n, p, q, r);
    printf("%g\n", ntime() - st);
    return(0);
}

I compiled this with gcc -march=native -Ofast using GCC 4.9.2, and the results, on my Core i5-2400, were as follows. The results are fairly reproducible from run to run -- they always land in the same order, at least.

Variant 0: 7.21 seconds
Variant 1: 7.26 seconds
Variant 2: 6.73 seconds
Variant 3: 4.32 seconds

So the CMOV implementation blows the others out of the water, at least. What surprises me is that variant 2 out-does its pure-C version (variant 1) by a fairly wide margin. I'd have thought the compiler should be able to emit code at least as efficient as mine.

Here are some other platforms, for comparison:

AMD Athlon 64 X2 4200+, GCC 4.7.2:

Variant 0: 26.33 seconds
Variant 1: 25.38 seconds
Variant 2: 25.19 seconds
Variant 3: 22.39 seconds

Xeon E3-1271 v3, GCC 4.9.2:

Variant 0: 5.95 seconds
Variant 1: 5.62 seconds
Variant 2: 5.40 seconds
Variant 3: 3.44 seconds

As a final note, I should perhaps warn against taking the apparent performance advantage of the CMOV version too seriously, because in the real world, the branch in the other versions will probably not be as completely random as in this benchmark, and if the branch predictor can do a reasonable job, the branching versions may turn out to be better. However, the realities of that will depend quite a bit on the data that are being used in practice, and so is probably pointless to try and do any generic benchmark of.

Note that Variants 0 and 1 (didn't check the others) only give the numerically correct results for floor division when the divisor is positive. (And in that case, the results also coincide with Euclidean division.) If for example a,b = 5,-3, these formulas say the quotient is -1, but the result for floor division should be -2. (The Euclidean division for 5,-3 is -1, but these formulas give a different answer than both floor and Euclidean division when both operands are negative.) — dubiousjim, Commented Jul 27, 2017 at 12:01
Have you benchmarked versions where the divisor is a positive constant (but the dividend is computed at run-time)? For power-of-two dividends, the shift-right operator would almost certainly be as fast as any alternative, and likely more readable than anything which would be as efficient. For other dividends, optimal code for floored/Euclidian division should be faster than code for truncated, but I don't know any reliable way to convince compilers to generate good code. — supercat, Commented Nov 1, 2017 at 21:54
@supercat: For compile-time constant divisors, the compiler will be able to generate better code whether it's a power-of-two or not, using this trick. So yes, divisions with compile-time-constant divisors will certainly be faster than any algorithm that handles dynamic divisors. — Dolda2000, Commented Nov 1, 2017 at 21:57
@Dolda2000: My question was what one must do to generate efficient code for floored/Euclidian division in the positive-constant-divisor case? Using the signed / operator would typically result in the compiler generating extra code to create the non-uniformity near zero, which would then make it necessary for the programmer to write more extra code. I think the best approach is to convert to unsigned, add one offset, do a division, and then subtract a different offset, but that seems a bit icky. — supercat, Commented Nov 1, 2017 at 22:22
@supercat: An extensive paper on the subject is publicly available if you want to study the technique. It includes cases for both truncated and floored division. — Dolda2000, Commented Nov 1, 2017 at 22:25

hlovdal · Accepted Answer · 2013-03-05 18:00:15Z

2

It could be more efficient to come up with something branch free to correct the result based on the sign, as branches are expensive.

See page 20ff of Chapter 2 in Hacker's Delight on how to access the sign.

edited Mar 5, 2013 at 18:00

hlovdal

27.6k11 gold badges97 silver badges172 bronze badges

answered Nov 5, 2010 at 6:55

starblue

56.2k14 gold badges99 silver badges153 bronze badges

The only thing I can think of involves multiplication of the sign bit with the divisor.
– Vladimir Panteleev
Commented Nov 5, 2010 at 21:22
You can represent certain conditions by 0 or 1 and add these to the result.
– starblue
Commented Nov 6, 2010 at 7:08

Add a comment |

Vladimir Panteleev · Accepted Answer · 2012-01-14 05:38:18Z

1

Just a note: the x86 sar instruction performs floored division when it comes to powers of two.

answered Jan 14, 2012 at 5:38

Vladimir Panteleev

25k6 gold badges76 silver badges119 bronze badges

Add a comment |

High Performance Mark · Accepted Answer · 2010-11-04 23:48:03Z

0

Since IEEE-754 specifies round towards -inf as one of the required rounding modes I imagine that the answer to your question is yes. But perhaps you can explain whether you want to know how one would implement the procedure if one were writing the compiler, or to know how to use a particular compiler to perform the operation ?

answered Nov 4, 2010 at 23:48

High Performance Mark

78k7 gold badges108 silver badges166 bronze badges

Whoops, I meant integer division! I'll amend the question. I'm not sure what does writing compilers have to do with the question, as I've checked and C/C++ does truncated division.
– Vladimir Panteleev
Commented Nov 4, 2010 at 23:49
@CyberShadow: so cast the divisor and dividend to floating-point numbers of some kind, use floor and cast the result back to an integer. Or are your looking for some other kind of answer ?
– High Performance Mark
Commented Nov 5, 2010 at 0:04
2

I don't think that using floating point is going to be faster than checking the sign...
– Vladimir Panteleev
Commented Nov 5, 2010 at 0:14
hardwaresecrets.com/fullimage.php?image=6761 seems to show that double-precision division can be competitive with integer division (there remains the costs of conversions, and possibly that of switching rounding mode each time). However, this approach still doesn't work for the full range of 64-bit integers, and it does not work (exactly) if the target platform does not have (exactly) IEEE-754 floating-point computations (including functions to change rounding mode).
– Pascal Cuoq
Commented Nov 5, 2010 at 8:54
I tested it, and performance was horrible, but I'm probably Doing It Wrong. Have a look at the program I posted in my own answer - would messing with the rounding mode make things considerably faster?
– Vladimir Panteleev
Commented Nov 5, 2010 at 22:15

Add a comment |

Jason S · Accepted Answer · 2010-11-05 02:53:31Z

0

Is it possible to efficiently implement floored or euclidian integer division in C/C++?

Yes.

(the obvious solution is to check the dividend's sign)

I agree completely, and would find it hard to believe there exists an alternative that is significantly faster.

answered Nov 5, 2010 at 2:53

Jason S

188k170 gold badges628 silver badges987 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Efficiently implementing floored / euclidean integer division

6 Answers 6

Not the answer you're looking for? Browse other questions tagged
c++
c
math
division
integer-division
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Not the answer you're looking for? Browse other questions tagged c++cmathdivisioninteger-division or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
c++
c
math
division
integer-division
or ask your own question.