Why is a simple loop optimized when the limit is 959 but not 960?

Question

Consider this simple loop:

float f(float x[]) {
  float p = 1.0;
  for (int i = 0; i < 959; i++)
    p += 1;
  return p;
}

If you compile with gcc 7 (snapshot) or clang (trunk) with -march=core-avx2 -Ofast you get something very similar to.

.LCPI0_0:
        .long   1148190720              # float 960
f:                                      # @f
        vmovss  xmm0, dword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero,zero,zero
        ret

In other words it just sets the answer to 960 without looping.

However if you change the code to:

float f(float x[]) {
  float p = 1.0;
  for (int i = 0; i < 960; i++)
    p += 1;
  return p;
}

The produced assembly actually performs the loop sum? For example clang gives:

.LCPI0_0:
        .long   1065353216              # float 1
.LCPI0_1:
        .long   1086324736              # float 6
f:                                      # @f
        vmovss  xmm0, dword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero,zero,zero
        vxorps  ymm1, ymm1, ymm1
        mov     eax, 960
        vbroadcastss    ymm2, dword ptr [rip + .LCPI0_1]
        vxorps  ymm3, ymm3, ymm3
        vxorps  ymm4, ymm4, ymm4
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        vaddps  ymm0, ymm0, ymm2
        vaddps  ymm1, ymm1, ymm2
        vaddps  ymm3, ymm3, ymm2
        vaddps  ymm4, ymm4, ymm2
        add     eax, -192
        jne     .LBB0_1
        vaddps  ymm0, ymm1, ymm0
        vaddps  ymm0, ymm3, ymm0
        vaddps  ymm0, ymm4, ymm0
        vextractf128    xmm1, ymm0, 1
        vaddps  ymm0, ymm0, ymm1
        vpermilpd       xmm1, xmm0, 1   # xmm1 = xmm0[1,0]
        vaddps  ymm0, ymm0, ymm1
        vhaddps ymm0, ymm0, ymm0
        vzeroupper
        ret

Why is this and why is it exactly the same for clang and gcc?

The limit for the same loop if you replace float with double is 479. This is the same for gcc and clang again.

Update 1

It turns out that gcc 7 (snapshot) and clang (trunk) behave very differently. clang optimizes out the loops for all limits less than 960 as far as I can tell. gcc on the other hand is sensitive to the exact value and doesn't have an upper limit . For example it does not optimize out the loop when the limit is 200 (as well as many other values) but it does when the limit is 202 and 20002 (as well as many other values).

What Sulthan probably means is that 1) compiler unrolls the loop and 2) once it's unrolled sees that the sum operations can be grouped into one. If the loop isn't unrolled, the operations cannot be grouped. — Jean-François Fabre, Commented Feb 10, 2017 at 12:39
Having an odd number of loops makes unrolling more complicated, the last few iterations have to be done specially. That might well be enough to bump the optimizer into a mode where it no longer can recognize the shortcut. It is pretty likely, it first has to add the code for the special case and would then have to remove it again. Using the optimizer between the ears is always best :) — Hans Passant, Commented Feb 10, 2017 at 13:02
@HansPassant It is also optimized for any number smaller than 959. — Simd, Commented Feb 10, 2017 at 13:05
Wouldn't this usually be done with induction variable elimination, instead of unrolling an insane amount? Unrolling by a factor of 959 is crazy. — user555045, Commented Feb 10, 2017 at 13:06
@eleanora I played with that compilre explorer and the following seems to hold (talking about the gcc snapshot only): If the loop count is a multiple of 4 and at least 72, then the loop is not unrolled (or rather, unrolled by a factor of 4); otherwise, the whole loop is replaced by a constant - even if the loop count is 2000000001. My suspicion: premature optimization (as in, a premature "hey, a multiple of 4, that's good for unrolling" that block further optimization vs. a more thorough "What's the deal with this loop anyway?") — Hagen von Eitzen, Commented Feb 11, 2017 at 22:10

Community · Accepted Answer · 2017-05-23 12:18:23Z

TL;DR

By default, the current snapshot GCC 7 behaves inconsistently, while previous versions have default limit due to PARAM_MAX_COMPLETELY_PEEL_TIMES, which is 16. It can be overridden from command-line.

The rationale of the limit is to prevent too aggressive loop unrolling, that can be a double-edged sword.

GCC version <= 6.3.0

The relevant optimization option for GCC is -fpeel-loops, which is enabled indirectly along with flag -Ofast (emphasis is mine):

Peels loops for which there is enough information that they do not roll much (from profile feedback or static analysis). It also turns on complete loop peeling (i.e. complete removal of loops with small constant number of iterations).

Enabled with -O3 and/or -fprofile-use.

More details can be obtained by adding -fdump-tree-cunroll:

$ head test.c.151t.cunroll 

;; Function f (f, funcdef_no=0, decl_uid=1919, cgraph_uid=0, symbol_order=0)

Not peeling: upper bound is known so can unroll completely

The message is from /gcc/tree-ssa-loop-ivcanon.c:

if (maxiter >= 0 && maxiter <= npeel)
    {
      if (dump_file)
        fprintf (dump_file, "Not peeling: upper bound is known so can "
         "unroll completely\n");
      return false;
    }

hence try_peel_loop function returns false.

More verbose output can be reached with -fdump-tree-cunroll-details:

Loop 1 iterates 959 times.
Loop 1 iterates at most 959 times.
Not unrolling loop 1 (--param max-completely-peeled-times limit reached).
Not peeling: upper bound is known so can unroll completely

It is possible to tweak the limits by plaing with max-completely-peeled-insns=n and max-completely-peel-times=n params:

max-completely-peeled-insns
The maximum number of insns of a completely peeled loop.
max-completely-peel-times
The maximum number of iterations of a loop to be suitable for complete peeling.

To learn more about insns, you can refer to GCC Internals Manual.

For instance, if you compile with following options:

-march=core-avx2 -Ofast --param max-completely-peeled-insns=1000 --param max-completely-peel-times=1000

then code turns into:

f:
        vmovss  xmm0, DWORD PTR .LC0[rip]
        ret
.LC0:
        .long   1148207104

Clang

I am not sure what Clang actually does and how to tweak its limits, but as I observed, you could force it to evaluate the final value by marking the loop with unroll pragma, and it will remove it completely:

#pragma unroll
for (int i = 0; i < 960; i++)
    p++;

results into:

.LCPI0_0:
        .long   1148207104              # float 961
f:                                      # @f
        vmovss  xmm0, dword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero,zero,zero
        ret

Thank you for this very nice answer. As others have pointed out, gcc seems to be sensitive to the exact limit size. For example it fails to eliminate the loop for 912 godbolt.org/g/EQJHvT . What does fdump-tree-cunroll-details say in that case? — Simd, Commented Feb 10, 2017 at 21:41
In fact even 200 has this problem. This is all in a snapshot of gcc 7 that godbolt provides. godbolt.org/g/Vg3SVs This doesn't apply to clang at all. — Simd, Commented Feb 10, 2017 at 21:56
You explain the mechanics of the peeling, but not what the relevance of 960 is or why there is even a limit at all — M.M, Commented Feb 10, 2017 at 23:06
@M.M: The peeling behavior is completely different between GCC 6.3.0 and latest snaphost. In case of the former, I strongly suspect, that the hard-coded limit is enforced by PARAM_MAX_COMPLETELY_PEEL_TIMES param, that is defined in /gcc/params.def:321 with the value 16. — Grzegorz Szpetkowski, Commented Feb 10, 2017 at 23:16
You might want to mention why GCC deliberately limits itself in this way. Specifically, if you unroll your loops too aggressively, the binary gets bigger and you're less likely to fit into L1 cache. Cache misses are potentially quite expensive relative to saving a few conditional jumps, assuming good branch prediction (which you will have, for a typical loop). — Kevin, Commented Feb 11, 2017 at 5:58

Peter Mortensen · Accepted Answer · 2017-02-11 08:19:22Z

19

After reading Sulthan's comment, I guess that:

The compiler fully unrolls the loop if the loop counter is constant (and not too high)
Once it's unrolled, the compiler sees that the sum operations can be grouped into one.

If the loop isn't unrolled for some reason (here: it would generate too many statements with 1000), the operations cannot be grouped.

The compiler could see that the unroll of 1000 statements amounts to a single addition, but step 1 & 2 described above are two separate optimizations, so it cannot take the "risk" of unrolling, not knowing if the operations can be grouped (example: a function call cannot be grouped).

Note: This is a corner case: Who uses a loop to add the same thing over again? In that case, don't rely on the compiler possible unroll/optimise; directly write the proper operation in one instruction.

edited Feb 11, 2017 at 8:19

Peter Mortensen

31.3k22 gold badges109 silver badges132 bronze badges

answered Feb 10, 2017 at 12:45

Jean-François Fabre♦

139k23 gold badges170 silver badges232 bronze badges

1

then can you focus on that not too high part? I mean why the risk is not there in case of 100 ? I have guessed something ...in my comment above..it can be the reason for that?
– user2736738
Commented Feb 10, 2017 at 12:49
I think that the compiler isn't aware of the floating point inaccuracy that it could trigger. I guess it's just an instruction size limit. You have max-unrolled-insns alongside max-unrolled-times
– Jean-François Fabre ♦
Commented Feb 10, 2017 at 13:03
Ah it was kind of my thought or guess...wish to get a more clear reasoning.
– user2736738
Commented Feb 10, 2017 at 13:06
5

Interestingly if you change the float to an int, gcc compiler is able to strength-reduce the loop regardless of the iteration count, due to its induction variable optimizations (-fivopts). But those don't seem to work for floats.
– Tavian Barnes
Commented Feb 10, 2017 at 17:29
1

@CortAmmon Right, and I recall reading some people who were surprised and upset that GCC uses MPFR to precisely calculate very large numbers, giving rather different results than the equivalent floating point operations which would have accumulated error and precision loss. Goes to show that many people compute floating point the wrong way.
– Zan Lynx
Commented Feb 10, 2017 at 22:31

| Show 4 more comments

chqrlie · Accepted Answer · 2017-02-10 16:22:33Z

13

Very good question!

You seem to have hit a limit on the number of iterations or operations the compiler tries to inline when simplifying the code. As documented by Grzegorz Szpetkowski, there are compiler specific ways to tweak these limits with pragmas or command line options.

You can also play with Godbolt's Compiler Explorer to compare how different compilers and options impact the code generated: gcc 6.2 and icc 17 still inline the code for 960, whereas clang 3.9 does not (with the default Godbolt configuration, it actually stops inlining at 73).

answered Feb 10, 2017 at 16:22

chqrlie

142k11 gold badges129 silver badges202 bronze badges

I have edited the question to make it clear the versions of gcc and clang I was using. See godbolt.org/g/FfwWjL . I am using -Ofast for example.
– Simd
Commented Feb 10, 2017 at 16:36

Add a comment |

Collectives™ on Stack Overflow

Why is a simple loop optimized when the limit is 959 but not 960?

3 Answers 3

TL;DR

GCC version <= 6.3.0

Clang

Not the answer you're looking for? Browse other questions tagged
c
gcc
optimization
clang
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

TL;DR

GCC version <= 6.3.0

Clang

Not the answer you're looking for? Browse other questions tagged cgccoptimizationclang or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
c
gcc
optimization
clang
or ask your own question.