34

Let A be an array that contains an odd number of zeros and ones. If n is the size of A, then A is constructed such that the first ceil(n/2) elements are 0 and the remaining elements 1.

So if n = 9, A would look like this:

0,0,0,0,0,1,1,1,1

The goal is to find the sum of 1s in the array and we do this by using this function:

s = 0;
void test1(int curIndex){
    //A is 0,0,0,...,0,1,1,1,1,1...,1

    if(curIndex == ceil(n/2)) return;

    if(A[curIndex] == 1) return;

    test1(curIndex+1);
    test1(size-curIndex-1);

    s += A[curIndex+1] + A[size-curIndex-1];

}

This function is rather silly for the problem given, but it's a simulation of a different function that I want to look like this and is producing the same amount of branch mispredictions.

Here is the entire code of the experiment:

#include <iostream>
#include <fstream>

using namespace std;


int size;
int *A;
int half;
int s;

void test1(int curIndex){
    //A is 0,0,0,...,0,1,1,1,1,1...,1

    if(curIndex == half) return;
    if(A[curIndex] == 1) return;

    test1(curIndex+1);
    test1(size - curIndex - 1);

    s += A[curIndex+1] + A[size-curIndex-1];

}


int main(int argc, char* argv[]){

    size = atoi(argv[1]);
    if(argc!=2){
        cout<<"type ./executable size{odd integer}"<<endl;
        return 1;
    }
    if(size%2!=1){
        cout<<"size must be an odd number"<<endl;
        return 1;
    }
    A = new int[size];

    half = size/2;
    int i;
    for(i=0;i<=half;i++){
        A[i] = 0;
    }
    for(i=half+1;i<size;i++){
        A[i] = 1;
    }

    for(i=0;i<100;i++) {
        test1(0);
    }
    cout<<s<<endl;

    return 0;
}

Compile by typing g++ -O3 -std=c++11 file.cpp and run by typing ./executable size{odd integer}.

I am using an Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz with 8 GB of RAM, L1 cache 256 KB, L2 cache 1 MB, L3 cache 6 MB.

Running perf stat -B -e branches,branch-misses ./cachetests 111111 gives me the following:

   Performance counter stats for './cachetests 111111':

    32,639,932      branches                                                    
     1,404,836      branch-misses             #    4.30% of all branches        

   0.060349641 seconds time elapsed

if I remove the line

s += A[curIndex+1] + A[size-curIndex-1];

I get the following output from perf:

  Performance counter stats for './cachetests 111111':

    24,079,109      branches                                                    
        39,078      branch-misses             #    0.16% of all branches        

   0.027679521 seconds time elapsed

What does that line have to do with branch predictions when it's not even an if statement?

The way I see it, in the first ceil(n/2) - 1 calls of test1(), both if statements will be false. In the ceil(n/2)-th call, if(curIndex == ceil(n/2)) will be true. In the remaining n-ceil(n/2) calls, the first statement will be false, and the second statement will be true.

Why does Intel fail to predict such a simple behavior?

Now let's look at a second case. Suppose that A now has alternating zeros and ones. We will always start from 0. So if n = 9 A will look like this:

0,1,0,1,0,1,0,1,0

The function we are going to use is the following:

void test2(int curIndex){
    //A is 0,1,0,1,0,1,0,1,....
    if(curIndex == size-1) return;
    if(A[curIndex] == 1) return;

    test2(curIndex+1);
    test2(curIndex+2);

    s += A[curIndex+1] + A[curIndex+2];

}

And here is the entire code of the experiment:

#include <iostream>
#include <fstream>

using namespace std;


int size;
int *A;
int s;

void test2(int curIndex){
    //A is 0,1,0,1,0,1,0,1,....
    if(curIndex == size-1) return;
    if(A[curIndex] == 1) return;

    test2(curIndex+1);
    test2(curIndex+2);

    s += A[curIndex+1] + A[curIndex+2];

}

int main(int argc, char* argv[]){

    size = atoi(argv[1]);
    if(argc!=2){
        cout<<"type ./executable size{odd integer}"<<endl;
        return 1;
    }
    if(size%2!=1){
        cout<<"size must be an odd number"<<endl;
        return 1;
    }
    A = new int[size];
    int i;
    for(i=0;i<size;i++){
        if(i%2==0){
            A[i] = false;
        }
        else{
            A[i] = true;
        }
    }

    for(i=0;i<100;i++) {
        test2(0);
    }
    cout<<s<<endl;

    return 0;
}

I run perf using the same commands as before:

    Performance counter stats for './cachetests2 111111':

    28,560,183      branches                                                    
        54,204      branch-misses             #    0.19% of all branches        

   0.037134196 seconds time elapsed

And removing that line again improved things a little bit:

   Performance counter stats for './cachetests2 111111':

    28,419,557      branches                                                    
        16,636      branch-misses             #    0.06% of all branches        

   0.009977772 seconds time elapsed

Now if we analyse the function, if(curIndex == size-1) will be false n-1 times, and if(A[curIndex] == 1) will alternate from true to false.

As I see it, both functions should be easy to predict, however this is not the case for the first function. At the same time I am not sure what is happening with that line and why it plays a role in improving branch behavior.

11
  • 4
    are you sure it's dong the right thing? I see that double recursion is going to go over the array twice in the end Commented Sep 15, 2016 at 14:52
  • 2
    What does the different assembler code look like? Commented Sep 15, 2016 at 15:05
  • 1
    in the first function, we increment curIndex if curIndex is not pointing to the last 0 and also is not pointing to a 1. If the array is indexed from 0, the second last 0 will be in position (floor(n/2) - 1) and the highest jump we will make is going to be towards n-(floor(n/2) - 1)-1 = n - floor(n/2) which should point to the element after the last 0. If we are in position 0, we will jump to (n-0-1) which will point to the last element in the array. As for the second function, we do the same, when we reach the last 0, the index will be equal to n-1 so we will stop.
    – jsguy
    Commented Sep 15, 2016 at 15:06
  • 3
    @jsguy It's a pity that no one has answered yet. I would recommend to add the performance tag, which is followed by many, and could therefore attract some who have missed this question. I've already proposed this edit myself, but it has been rejected. I don't want to submit it again, I'll leave it here as a suggestion to you. Your call. Commented Sep 16, 2016 at 23:12
  • 1
    Did you look at it with cachegrind? (valgrind.org/docs/manual/cg-manual.html) Commented Sep 19, 2016 at 11:59

5 Answers 5

28
+100

Here are my thoughts on this after staring at it for a while. First of all, the issue is easily reproducible with -O2, so it's better to use that as a reference, as it generates simple non-unrolled code that is easy to analyse. The problem with -O3 is essentially the same, it's just a bit less obvious.

So, for the first case (half-zeros with half-ones pattern) the compiler generates this code:

 0000000000400a80 <_Z5test1i>:
   400a80:       55                      push   %rbp
   400a81:       53                      push   %rbx
   400a82:       89 fb                   mov    %edi,%ebx
   400a84:       48 83 ec 08             sub    $0x8,%rsp
   400a88:       3b 3d 0e 07 20 00       cmp    0x20070e(%rip),%edi        #
   60119c <half>
   400a8e:       74 4f                   je     400adf <_Z5test1i+0x5f>
   400a90:       48 8b 15 09 07 20 00    mov    0x200709(%rip),%rdx        #
   6011a0 <A>
   400a97:       48 63 c7                movslq %edi,%rax
   400a9a:       48 8d 2c 85 00 00 00    lea    0x0(,%rax,4),%rbp
   400aa1:       00 
   400aa2:       83 3c 82 01             cmpl   $0x1,(%rdx,%rax,4)
   400aa6:       74 37                   je     400adf <_Z5test1i+0x5f>
   400aa8:       8d 7f 01                lea    0x1(%rdi),%edi
   400aab:       e8 d0 ff ff ff          callq  400a80 <_Z5test1i>
   400ab0:       89 df                   mov    %ebx,%edi
   400ab2:       f7 d7                   not    %edi
   400ab4:       03 3d ee 06 20 00       add    0x2006ee(%rip),%edi        #
   6011a8 <size>
   400aba:       e8 c1 ff ff ff          callq  400a80 <_Z5test1i>
   400abf:       8b 05 e3 06 20 00       mov    0x2006e3(%rip),%eax        #
   6011a8 <size>
   400ac5:       48 8b 15 d4 06 20 00    mov    0x2006d4(%rip),%rdx        #
   6011a0 <A>
   400acc:       29 d8                   sub    %ebx,%eax
   400ace:       48 63 c8                movslq %eax,%rcx
   400ad1:       8b 44 2a 04             mov    0x4(%rdx,%rbp,1),%eax
   400ad5:       03 44 8a fc             add    -0x4(%rdx,%rcx,4),%eax
   400ad9:       01 05 b9 06 20 00       add    %eax,0x2006b9(%rip)        #
   601198 <s>
   400adf:       48 83 c4 08             add    $0x8,%rsp
   400ae3:       5b                      pop    %rbx
   400ae4:       5d                      pop    %rbp
   400ae5:       c3                      retq   
   400ae6:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
   400aed:       00 00 00 

Very simple, kind of what you would expect -- two conditional branches, two calls. It gives us this (or similar) statistics on Core 2 Duo T6570, AMD Phenom II X4 925 and Core i7-4770:

$ perf stat -B -e branches,branch-misses ./a.out 111111
5555500

 Performance counter stats for './a.out 111111':

        45,216,754      branches                                                    
         5,588,484      branch-misses             #   12.36% of all branches        

       0.098535791 seconds time elapsed

If you're to make this change, moving assignment before recursive calls:

 --- file.cpp.orig  2016-09-22 22:59:20.744678438 +0300
 +++ file.cpp   2016-09-22 22:59:36.492583925 +0300
 @@ -15,10 +15,10 @@
      if(curIndex == half) return;
      if(A[curIndex] == 1) return;

 +    s += A[curIndex+1] + A[size-curIndex-1];
      test1(curIndex+1);
      test1(size - curIndex - 1);

 -    s += A[curIndex+1] + A[size-curIndex-1];

  }

The picture changes:

 $ perf stat -B -e branches,branch-misses ./a.out 111111
 5555500

  Performance counter stats for './a.out 111111':

         39,495,804      branches                                                    
             54,430      branch-misses             #    0.14% of all branches        

        0.039522259 seconds time elapsed

And yes, as was already noted it's directly related to tail recursion optimisation, because if you're to compile the patched code with -fno-optimize-sibling-calls you will get the same "bad" results. So let's look at what do we have in assembly with tail call optimization:

 0000000000400a80 <_Z5test1i>:
   400a80:       3b 3d 16 07 20 00       cmp    0x200716(%rip),%edi        #
   60119c <half>
   400a86:       53                      push   %rbx
   400a87:       89 fb                   mov    %edi,%ebx
   400a89:       74 5f                   je     400aea <_Z5test1i+0x6a>
   400a8b:       48 8b 05 0e 07 20 00    mov    0x20070e(%rip),%rax        #
   6011a0 <A>
   400a92:       48 63 d7                movslq %edi,%rdx
   400a95:       83 3c 90 01             cmpl   $0x1,(%rax,%rdx,4)
   400a99:       74 4f                   je     400aea <_Z5test1i+0x6a>
   400a9b:       8b 0d 07 07 20 00       mov    0x200707(%rip),%ecx        #
   6011a8 <size>
   400aa1:       eb 15                   jmp    400ab8 <_Z5test1i+0x38>
   400aa3:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
   400aa8:       48 8b 05 f1 06 20 00    mov    0x2006f1(%rip),%rax        #
   6011a0 <A>
   400aaf:       48 63 d3                movslq %ebx,%rdx
   400ab2:       83 3c 90 01             cmpl   $0x1,(%rax,%rdx,4)
   400ab6:       74 32                   je     400aea <_Z5test1i+0x6a>
   400ab8:       29 d9                   sub    %ebx,%ecx
   400aba:       8d 7b 01                lea    0x1(%rbx),%edi
   400abd:       8b 54 90 04             mov    0x4(%rax,%rdx,4),%edx
   400ac1:       48 63 c9                movslq %ecx,%rcx
   400ac4:       03 54 88 fc             add    -0x4(%rax,%rcx,4),%edx
   400ac8:       01 15 ca 06 20 00       add    %edx,0x2006ca(%rip)        #
   601198 <s>
   400ace:       e8 ad ff ff ff          callq  400a80 <_Z5test1i>
   400ad3:       8b 0d cf 06 20 00       mov    0x2006cf(%rip),%ecx        #
   6011a8 <size>
   400ad9:       89 c8                   mov    %ecx,%eax
   400adb:       29 d8                   sub    %ebx,%eax
   400add:       89 c3                   mov    %eax,%ebx
   400adf:       83 eb 01                sub    $0x1,%ebx
   400ae2:       39 1d b4 06 20 00       cmp    %ebx,0x2006b4(%rip)        #
   60119c <half>
   400ae8:       75 be                   jne    400aa8 <_Z5test1i+0x28>
   400aea:       5b                      pop    %rbx
   400aeb:       c3                      retq   
   400aec:       0f 1f 40 00             nopl   0x0(%rax)

It has four conditional branches with one call. So let's analyse the data we've got so far.

First of all, what is a branching instruction from the processor perspective? It's any of call, ret, j* (including direct jmp) and loop. call and jmp are a bit unintuitive, but they are crucial to count things correctly.

Overall, we expect this function to be called 11111100 times, one for each element, that's roughly 11M. In non-tail-call-optimized version we see about 45M branches, initialization in main() is just 111K, all the other things are minor, so the main contribution to this number comes from our function. Our function is call-ed, it evaluates the first je, which is true in all cases except one, then it evaluates the second je, which is true half of the times and then it either calls itself recursively (but we've already counted that the function is invoked 11M times) or returns (as it does after recursive calls. So that's 4 branching instructions per 11M calls, exactly the number we see. Out of these around 5.5M branches are missed, that suggests that these misses all come from one mispredicted instruction, either something that's evaluated 11M times and missed around 50% of the time or something that's evaluated half of the time and missed always.

What do we have in tail-call-optimized version? We have the function called around 5.5M times, but now each invocation incurs one call, two branches initially (first one is true in all cases except one and the second one is always false because of our data), then a jmp, then a call (but we've already counted that we have 5.5M calls), then a branch at 400ae8 and a branch at 400ab6 (always true because of our data), then return. So, on average that's four conditional branches, one unconditional jump, a call and one indirect branch (return from function), 5.5M times 7 gives us an overall count of around 39M branches, exactly as we see in the perf output.

What we know is that the processor has no problem at all predicting things in a flow with one function call (even though this version has more conditional branches) and it has problems with two function calls. So it suggests that the problem is in returns from the function.

Unfortunately, we know very little about the details of how exactly branch predictors of our modern processors work. The best analysis that I could find is this and it suggests that the processors have a return stack buffer of around 16 entries. If we're to return to our data again with this finding at hand things start to clarify a bit.

When you have half-zeroes with half-ones pattern, you're recursing very deeply into test1(curIndex+1), but then you start returning back and calling test1(size-curIndex-1). That recursion is never deeper than one call, so the returns are predicted perfectly for it. But remember that we're now 55555 invocations deep and the processor only remembers last 16, so it's not surprising that it can't guess our returns starting from 55539-level deep, it's more surprising that it can do so with tail-call-optimized version.

Actually, the behaviour of tail-call-optimized version suggests that missing any other information about returns, the processor just assumes that the right one is the last one seen. It's also proven by the behaviour of non-tail-call-optimized version, because it goes 55555 calls deep into the test1(curIndex+1) and then upon return it always gets one level deep into test1(size-curIndex-1), so when we're up from 55555-deep to 55539-deep (or whatever your processor return buffer is) it calls into test1(size-curIndex-1), returns from that and it has absolutely no information about the next return, so it assumes that we're to return to the last seen address (which is the address to return to from test1(size-curIndex-1)) and it's obviously wrong. 55539 times wrong. With 100 cycles of the function, that's exactly the 5.5M branch prediction misses we see.

Now let's get to your alternating pattern and the code for that. This code is actually very different, if you're to analyse how it goes into the depth. Here you have your test2(curIndex+1) always return immediately and your test2(curIndex+2) to always go deeper. So the returns from test2(curIndex+1) are always predicted perfectly (they just don't go deep enough) and when we're to finish our recursion into test2(curIndex+2), it always returns to the same point, all 55555 times, so the processor has no problems with that.

This can further be proven by this little change to your original half-zeroes with half-ones code:

--- file.cpp.orig       2016-09-23 11:00:26.917977032 +0300
+++ file.cpp    2016-09-23 11:00:31.946027451 +0300
@@ -15,8 +15,8 @@
   if(curIndex == half) return;
   if(A[curIndex] == 1) return;

-  test1(curIndex+1);
   test1(size - curIndex - 1);
+  test1(curIndex+1);

   s += A[curIndex+1] + A[size-curIndex-1];

So now the code generated is still not tail-call optimized (assembly-wise it's very similar to the original), but you get something like this in the perf output:

$ perf stat -B -e branches,branch-misses ./a.out 111111 
5555500

 Performance counter stats for './a.out 111111':

        45 308 579      branches                                                    
            75 927      branch-misses             #    0,17% of all branches        

       0,026271402 seconds time elapsed

As expected, now our first call always returns immediately and the second call goes 55555-deep and then only returns to the same point.

Now with that solved let me show something up my sleeve. On one system, and that is Core i5-5200U the non-tail-call-optimized original half-zeroes with half-ones version shows this results:

 $ perf stat -B -e branches,branch-misses ./a.out 111111
 5555500

  Performance counter stats for './a.out 111111':

         45 331 670      branches                                                    
             16 349      branch-misses             #    0,04% of all branches        

        0,043351547 seconds time elapsed

So, apparently, Broadwell can handle this pattern easily, which returns us to the question of how much do we know about branch prediction logic of our modern processors.

2
  • I guess I got my answer wrong. Since I used an i5-6400, it happened the same as in your testcase with broadwell. GJ with that excellent answer. Commented Sep 30, 2016 at 17:19
  • As a side-note, I stumbled upon this document: agner.org/optimize/microarchitecture.pdf A must read IMHO. Commented Oct 24, 2016 at 21:22
5

Removing the line s += A[curIndex+1] + A[size-curIndex-1]; enables tail recursive optimization. This optimization can only happen then the recursive call is in the last line of the function.

https://en.wikipedia.org/wiki/Tail_call

0
4

Interestingly, in the first execution you have about 30% more branches than in the second execution (32M branches vs 24 Mbranches).

I have generated the assembly code for your application using gcc 4.8.5 and the same flags (plus -S) and there is a significant difference between the assemblies. The code with the conflicting statement is about 572 lines while the code without the same statement is only 409 lines. Focusing on the symbol _Z5test1i -- the decorated C++ name for test1), the routine is 367 lines long while the second case occupies only 202 lines. From all those lines, the first case contains 36 branches (plus 15 call instructions) and the second case contains 34 branches (plus 1 call instruction).

It is also interesting that compiling the application with -O1 does not expose this divergence between the two versions (although the branch mispredict is higher, approx 12%). Using -O2 shows a difference between the two versions (12% vs 3% of branch mispredicts).

I'm not a compiler expert to understand the control flows and logics used by the compiler but it looks like the compiler is able to achieve smarter optimizations (maybe including tail recursive optimizations as pointed out by user1850903 in his answer) when that portion of the code is not present.

4

the problem is this:

if(A[curIndex] == 1) return;

each call of the test function is alternating the result of this comparison, due to some optimizations, since the array is, for example 0,0,0,0,0,1,1,1,1

In other words:

  1. curIndex = 0 -> A[0] = 0
  2. test1(curIndex + 1) -> curIndex = 1 -> A[1] = 0

But then, the processor architecture MIGHT (a big might, cause it depends; for me that optimization is disabled - an i5-6400) have a feature called runahead (performed along branch prediction), which executes the remaining instructions in the pipeline before entering a branch; so it will execute test1(size - curIndex -1) before the offending if statement.

When removing the attribution, then it enters another optimization, as user1850903 said.

4

The following piece of code is tail-recursive: the last line of the function doesn't require a call, simply a branch to the point where the function begins using the first argument:

void f(int i) {
    if (i == size) break;
    s += a[i];
    f(i + 1);
}

However, if we break this and make it non-tail recursive:

void f(int i) {
    if (i == size) break;
    f(i + 1);
    s += a[i];
}

There are a number of reasons why the compiler can't deduce the latter to be tail-recursive, but in the example you've given,

test(A[N]);
test(A[M]);
s += a[N] + a[M];

the same rules apply. The compiler can't determine that this is tail recursive, but more so it can't do it because of the two calls (see before and after).

What you appear to be expecting the compiler to do with this is a function which performs a couple of simple conditional branches, two calls and some load/add/stores.

Instead, the compiler is unrolling this loop and generating code which has a lot of branch points. This is done partly because the compiler believes it will be more efficient this way (involving less branches) but partly because it decreases the runtime recursion depth.

int size;
int* A;
int half;
int s;

void test1(int curIndex){
  if(curIndex == half || A[curIndex] == 1) return;
  test1(curIndex+1);
  test1(size-curIndex-1);
  s += A[curIndex+1] + A[size-curIndex-1];
}

produces:

test1(int):
        movl    half(%rip), %edx
        cmpl    %edi, %edx
        je      .L36
        pushq   %r15
        pushq   %r14
        movslq  %edi, %rcx
        pushq   %r13
        pushq   %r12
        leaq    0(,%rcx,4), %r12
        pushq   %rbp
        pushq   %rbx
        subq    $24, %rsp
        movq    A(%rip), %rax
        cmpl    $1, (%rax,%rcx,4)
        je      .L1
        leal    1(%rdi), %r13d
        movl    %edi, %ebp
        cmpl    %r13d, %edx
        je      .L42
        cmpl    $1, 4(%rax,%r12)
        je      .L42
        leal    2(%rdi), %ebx
        cmpl    %ebx, %edx
        je      .L39
        cmpl    $1, 8(%rax,%r12)
        je      .L39
        leal    3(%rdi), %r14d
        cmpl    %r14d, %edx
        je      .L37
        cmpl    $1, 12(%rax,%r12)
        je      .L37
        leal    4(%rdi), %edi
        call    test1(int)
        movl    %r14d, %edi
        notl    %edi
        addl    size(%rip), %edi
        call    test1(int)
        movl    size(%rip), %ecx
        movq    A(%rip), %rax
        movl    %ecx, %esi
        movl    16(%rax,%r12), %edx
        subl    %r14d, %esi
        movslq  %esi, %rsi
        addl    -4(%rax,%rsi,4), %edx
        addl    %edx, s(%rip)
        movl    half(%rip), %edx
.L10:
        movl    %ecx, %edi
        subl    %ebx, %edi
        leal    -1(%rdi), %r14d
        cmpl    %edx, %r14d
        je      .L38
        movslq  %r14d, %rsi
        cmpl    $1, (%rax,%rsi,4)
        leaq    0(,%rsi,4), %r15
        je      .L38
        call    test1(int)
        movl    %r14d, %edi
        notl    %edi
        addl    size(%rip), %edi
        call    test1(int)
        movl    size(%rip), %ecx
        movq    A(%rip), %rax
        movl    %ecx, %edx
        movl    4(%rax,%r15), %esi
        movl    %ecx, %edi
        subl    %r14d, %edx
        subl    %ebx, %edi
        movslq  %edx, %rdx
        addl    -4(%rax,%rdx,4), %esi
        movl    half(%rip), %edx
        addl    s(%rip), %esi
        movl    %esi, s(%rip)
.L13:
        movslq  %edi, %rdi
        movl    12(%rax,%r12), %r8d
        addl    -4(%rax,%rdi,4), %r8d
        addl    %r8d, %esi
        movl    %esi, s(%rip)
.L7:
        movl    %ecx, %ebx
        subl    %r13d, %ebx
        leal    -1(%rbx), %r14d
        cmpl    %edx, %r14d
        je      .L41
        movslq  %r14d, %rsi
        cmpl    $1, (%rax,%rsi,4)
        leaq    0(,%rsi,4), %r15
        je      .L41
        cmpl    %edx, %ebx
        je      .L18
        movslq  %ebx, %rsi
        cmpl    $1, (%rax,%rsi,4)
        leaq    0(,%rsi,4), %r8
        movq    %r8, (%rsp)
        je      .L18
        leal    1(%rbx), %edi
        call    test1(int)
        movl    %ebx, %edi
        notl    %edi
        addl    size(%rip), %edi
        call    test1(int)
        movl    size(%rip), %ecx
        movq    A(%rip), %rax
        movq    (%rsp), %r8
        movl    %ecx, %esi
        subl    %ebx, %esi
        movl    4(%rax,%r8), %edx
        movslq  %esi, %rsi
        addl    -4(%rax,%rsi,4), %edx
        addl    %edx, s(%rip)
        movl    half(%rip), %edx
.L18:
        movl    %ecx, %edi
        subl    %r14d, %edi
        leal    -1(%rdi), %ebx
        cmpl    %edx, %ebx
        je      .L40
        movslq  %ebx, %rsi
        cmpl    $1, (%rax,%rsi,4)
        leaq    0(,%rsi,4), %r8
        je      .L40
        movq    %r8, (%rsp)
        call    test1(int)
        movl    %ebx, %edi
        notl    %edi
        addl    size(%rip), %edi
        call    test1(int)
        movl    size(%rip), %ecx
        movq    A(%rip), %rax
        movq    (%rsp), %r8
        movl    %ecx, %edx
        movl    %ecx, %edi
        subl    %ebx, %edx
        movl    4(%rax,%r8), %esi
        subl    %r14d, %edi
        movslq  %edx, %rdx
        addl    -4(%rax,%rdx,4), %esi
        movl    half(%rip), %edx
        addl    s(%rip), %esi
        movl    %esi, %r8d
        movl    %esi, s(%rip)
.L20:
        movslq  %edi, %rdi
        movl    4(%rax,%r15), %esi
        movl    %ecx, %ebx
        addl    -4(%rax,%rdi,4), %esi
        subl    %r13d, %ebx
        addl    %r8d, %esi
        movl    %esi, s(%rip)
.L16:
        movslq  %ebx, %rbx
        movl    8(%rax,%r12), %edi
        addl    -4(%rax,%rbx,4), %edi
        addl    %edi, %esi
        movl    %esi, s(%rip)
        jmp     .L4
.L45:
        movl    s(%rip), %edx
.L23:
        movslq  %ebx, %rbx
        movl    4(%rax,%r12), %ecx
        addl    -4(%rax,%rbx,4), %ecx
        addl    %ecx, %edx
        movl    %edx, s(%rip)
.L1:
        addq    $24, %rsp
        popq    %rbx
        popq    %rbp
        popq    %r12
        popq    %r13
        popq    %r14
        popq    %r15
.L36:
        rep ret
.L42:
        movl    size(%rip), %ecx
.L4:
        movl    %ecx, %ebx
        subl    %ebp, %ebx
        leal    -1(%rbx), %r14d
        cmpl    %edx, %r14d
        je      .L45
        movslq  %r14d, %rsi
        cmpl    $1, (%rax,%rsi,4)
        leaq    0(,%rsi,4), %r15
        je      .L45
        cmpl    %edx, %ebx
        je      .L25
        movslq  %ebx, %rsi
        cmpl    $1, (%rax,%rsi,4)
        leaq    0(,%rsi,4), %r13
        je      .L25
        leal    1(%rbx), %esi
        cmpl    %edx, %esi
        movl    %esi, (%rsp)
        je      .L26
        cmpl    $1, 8(%rax,%r15)
        je      .L26
        leal    2(%rbx), %edi
        call    test1(int)
        movl    (%rsp), %esi
        movl    %esi, %edi
        notl    %edi
        addl    size(%rip), %edi
        call    test1(int)
        movl    size(%rip), %ecx
        movl    (%rsp), %esi
        movq    A(%rip), %rax
        movl    %ecx, %edx
        subl    %esi, %edx
        movslq  %edx, %rsi
        movl    12(%rax,%r15), %edx
        addl    -4(%rax,%rsi,4), %edx
        addl    %edx, s(%rip)
        movl    half(%rip), %edx
.L26:
        movl    %ecx, %edi
        subl    %ebx, %edi
        leal    -1(%rdi), %esi
        cmpl    %edx, %esi
        je      .L43
        movslq  %esi, %r8
        cmpl    $1, (%rax,%r8,4)
        leaq    0(,%r8,4), %r9
        je      .L43
        movq    %r9, 8(%rsp)
        movl    %esi, (%rsp)
        call    test1(int)
        movl    (%rsp), %esi
        movl    %esi, %edi
        notl    %edi
        addl    size(%rip), %edi
        call    test1(int)
        movl    size(%rip), %ecx
        movl    (%rsp), %esi
        movq    A(%rip), %rax
        movq    8(%rsp), %r9
        movl    %ecx, %edx
        movl    %ecx, %edi
        subl    %esi, %edx
        movl    4(%rax,%r9), %esi
        subl    %ebx, %edi
        movslq  %edx, %rdx
        addl    -4(%rax,%rdx,4), %esi
        movl    half(%rip), %edx
        addl    s(%rip), %esi
        movl    %esi, s(%rip)
.L28:
        movslq  %edi, %rdi
        movl    4(%rax,%r13), %r8d
        addl    -4(%rax,%rdi,4), %r8d
        addl    %r8d, %esi
        movl    %esi, s(%rip)
.L25:
        movl    %ecx, %r13d
        subl    %r14d, %r13d
        leal    -1(%r13), %ebx
        cmpl    %edx, %ebx
        je      .L44
        movslq  %ebx, %rdi
        cmpl    $1, (%rax,%rdi,4)
        leaq    0(,%rdi,4), %rsi
        movq    %rsi, (%rsp)
        je      .L44
        cmpl    %edx, %r13d
        je      .L33
        movslq  %r13d, %rdx
        cmpl    $1, (%rax,%rdx,4)
        leaq    0(,%rdx,4), %r8
        movq    %r8, 8(%rsp)
        je      .L33
        leal    1(%r13), %edi
        call    test1(int)
        movl    %r13d, %edi
        notl    %edi
        addl    size(%rip), %edi
        call    test1(int)
        movl    size(%rip), %ecx
        movq    A(%rip), %rdi
        movq    8(%rsp), %r8
        movl    %ecx, %edx
        subl    %r13d, %edx
        movl    4(%rdi,%r8), %eax
        movslq  %edx, %rdx
        addl    -4(%rdi,%rdx,4), %eax
        addl    %eax, s(%rip)
.L33:
        subl    %ebx, %ecx
        leal    -1(%rcx), %edi
        call    test1(int)
        movl    size(%rip), %ecx
        movq    A(%rip), %rax
        movl    %ecx, %esi
        movl    %ecx, %r13d
        subl    %ebx, %esi
        movq    (%rsp), %rbx
        subl    %r14d, %r13d
        movslq  %esi, %rsi
        movl    4(%rax,%rbx), %edx
        addl    -4(%rax,%rsi,4), %edx
        movl    s(%rip), %esi
        addl    %edx, %esi
        movl    %esi, s(%rip)
.L31:
        movslq  %r13d, %r13
        movl    4(%rax,%r15), %edx
        subl    %ebp, %ecx
        addl    -4(%rax,%r13,4), %edx
        movl    %ecx, %ebx
        addl    %esi, %edx
        movl    %edx, s(%rip)
        jmp     .L23
.L44:
        movl    s(%rip), %esi
        jmp     .L31
.L39:
        movl    size(%rip), %ecx
        jmp     .L7
.L41:
        movl    s(%rip), %esi
        jmp     .L16
.L43:
        movl    s(%rip), %esi
        jmp     .L28
.L38:
        movl    s(%rip), %esi
        jmp     .L13
.L37:
        movl    size(%rip), %ecx
        jmp     .L10
.L40:
        movl    s(%rip), %r8d
        jmp     .L20
s:
half:
        .zero   4
A:
        .zero   8
size:
        .zero   4

For the alternating values case, assuming size == 7:

test1(curIndex = 0)
{
    if (curIndex == size - 1) return;  // false x1
    if (A[curIndex] == 1) return;  // false x1

    test1(curIndex + 1 => 1) {
        if (curIndex == size - 1) return;  // false x2
        if (A[curIndex] == 1) return;  // false x1 -mispred-> returns
    }

    test1(curIndex + 2 => 2) {
        if (curIndex == size - 1) return; // false x 3
        if (A[curIndex] == 1) return;  // false x2
        test1(curIndex + 1 => 3) {
            if (curIndex == size - 1) return;  // false x3
            if (A[curIndex] == 1) return;  // false x2 -mispred-> returns
        }
        test1(curIndex + 2 => 4) {
            if (curIndex == size - 1) return;  // false x4
            if (A[curIndex] == 1) return; // false x3
            test1(curIndex + 1 => 5) {
                if (curIndex == size - 1) return; // false x5
                if (A[curIndex] == 1) return; // false x3 -mispred-> returns
            }
            test1(curIndex + 2 => 6) {
                if (curIndex == size - 1) return; // false x5 -mispred-> returns
            }
            s += A[5] + A[6];
        }
        s += A[3] + A[4];
    }
    s += A[1] + A[2];
}

And lets imagine a case where

size = 11;
A[11] = { 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0 };

test1(0)
  -> test1(1)
       -> test1(2)
            -> test1(3)  -> returns because 1
            -> test1(4)
                 -> test1(5)
                      -> test1(6)
                           -> test1(7) -- returns because 1
                           -> test1(8)
                                -> test1(9) -- returns because 1
                                -> test1(10) -- returns because size-1
                      -> test1(7) -- returns because 1
                 -> test1(6)
                   -> test1(7)
                   -> test1(8)
                        -> test1(9) -- 1
                        -> test1(10) -- size-1
       -> test1(3)  -> returns
  -> test1(2)
       ... as above

or

size = 5;
A[5] = { 0, 0, 0, 0, 1 };

test1(0)
  -> test1(1)
       -> test1(2)
            -> test1(3)
                 -> test1(4)  --  size
                 -> test1(5)  --  UB
            -> test1(4)
       -> test1(3)
            -> test1(4)  -- size
            -> test1(5)  -- UB
  -> test1(2)
       ..

The two cases you've singled out (alternating and half-pattern) are optimal extremes and the compiler has picked some intermediate case that it will try to handle best.

0

Not the answer you're looking for? Browse other questions tagged or ask your own question.