Consider this simple loop:
float f(float x[]) {
float p = 1.0;
for (int i = 0; i < 959; i++)
p += 1;
return p;
}
If you compile with gcc 7 (snapshot) or clang (trunk) with -march=core-avx2 -Ofast
you get something very similar to.
.LCPI0_0:
.long 1148190720 # float 960
f: # @f
vmovss xmm0, dword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero,zero,zero
ret
In other words it just sets the answer to 960 without looping.
However if you change the code to:
float f(float x[]) {
float p = 1.0;
for (int i = 0; i < 960; i++)
p += 1;
return p;
}
The produced assembly actually performs the loop sum? For example clang gives:
.LCPI0_0:
.long 1065353216 # float 1
.LCPI0_1:
.long 1086324736 # float 6
f: # @f
vmovss xmm0, dword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero,zero,zero
vxorps ymm1, ymm1, ymm1
mov eax, 960
vbroadcastss ymm2, dword ptr [rip + .LCPI0_1]
vxorps ymm3, ymm3, ymm3
vxorps ymm4, ymm4, ymm4
.LBB0_1: # =>This Inner Loop Header: Depth=1
vaddps ymm0, ymm0, ymm2
vaddps ymm1, ymm1, ymm2
vaddps ymm3, ymm3, ymm2
vaddps ymm4, ymm4, ymm2
add eax, -192
jne .LBB0_1
vaddps ymm0, ymm1, ymm0
vaddps ymm0, ymm3, ymm0
vaddps ymm0, ymm4, ymm0
vextractf128 xmm1, ymm0, 1
vaddps ymm0, ymm0, ymm1
vpermilpd xmm1, xmm0, 1 # xmm1 = xmm0[1,0]
vaddps ymm0, ymm0, ymm1
vhaddps ymm0, ymm0, ymm0
vzeroupper
ret
Why is this and why is it exactly the same for clang and gcc?
The limit for the same loop if you replace float
with double
is 479. This is the same for gcc and clang again.
Update 1
It turns out that gcc 7 (snapshot) and clang (trunk) behave very differently. clang optimizes out the loops for all limits less than 960 as far as I can tell. gcc on the other hand is sensitive to the exact value and doesn't have an upper limit . For example it does not optimize out the loop when the limit is 200 (as well as many other values) but it does when the limit is 202 and 20002 (as well as many other values).