4

I have a class with an enum member variable. One of the member functions bases its behavior on this enum so as a "possible" optimization, I have the two different behaviors as two different functions and I give the class a member function pointer which is set at construction. I simulated this situation like this:

enum catMode {MODE_A, MODE_B};

struct cat
{
    cat(catMode mode) : stamp_(0), mode_(mode) {}

    void
    update()
    {
        stamp_ = (mode_ == MODE_A) ? funcA() : funcB();
    }

    uint64_t stamp_;
    catMode  mode_;
};

struct cat2
{
    cat2(catMode mode) : stamp_(0), mode_(mode)
    {
        if (mode_ = MODE_A)
            func_ = funcA;
        else
            func_ = funcB;
    }

    void
    update()
    {
        stamp_ = func_();
    }

    uint64_t stamp_;
    catMode  mode_;
    uint64_t (*func_)(void);
};

And then I create a cat object and an array of length 32. I traverse the array to bring it into cache, then I call cats update method 32 times and store the latency using rdtsc in the array...

Then I call a function which loops several hundred times using rand(), ulseep(), and some arbitrary strcmp()..come back and I do the 32 thing again.

The result is that the method with the branch seems to always be around 44 +/- 10 cycles whereas the one with the function pointer tends to be around 130. I'm curious as to why this would be the case?

If anything, I would have expected similar performance. Also, templating is hardly an option because full specialization of the real cat class for that one function would be overkill.

6
  • 2
    @sixlettervariables - possible more that the CPU has branch prediction while a function pointer call involves clearing stack/register. You wouldn't necessarily see this from the assem Commented Jul 25, 2012 at 14:38
  • 1
    depending on funcA and funcBs complexities, they could have been inlined. I agree with @sixlettervariables that checking the assembly is a good idea. It's easy to do, and if nothing else, at least you'll rule out some possible explanations. Commented Jul 25, 2012 at 14:49
  • You can avoid full specialization when you use a templated struct to wrap the specific function. Then you only need to specialize the struct and the cat class needs to be specified only once.
    – tokage
    Commented Jul 25, 2012 at 14:56
  • What processor is this? Also, can you provide a complete SSCCE so I can run this myself?
    – Mysticial
    Commented Jul 25, 2012 at 15:07
  • @tokage wait that is interesting, where can I read about this or how does this work? Commented Jul 25, 2012 at 22:09

1 Answer 1

4

Without a complete SSCCE I can't approach this the same way that I usually do with such questions.
So the best I can do is speculate:

The core difference between your two cases is that you have a branch vs. a function pointer. The fact that you are seeing a difference at all strongly hints that funcA() and funcB() are very small functions.

Possibility #1:

In the branch version of the code, funcA() and funcB() are being inlined by the compiler. Not only does that skip the function call overhead, but if the functions are trivial enough, the branch could also be completely optimized out as well.

Function pointers, on the other hand, cannot be inlined unless the compiler can resolve them at compile-time.

Possibility #2:

By comparing a branch against a function-pointer, you are putting the branch-predictor against the branch target predictor.

Branch target prediction is not the same as branch prediction. In the branch case, the processor needs to predict which way to branch. In the function pointer case, it needs to predict where to branch to.

It's very likely that your processor's branch predictor is much more accurate than its branch target predictor. But then again, this is all guesswork...

4
  • Thanks! I wish I had seen this before. What is a complete SSCCE? Yea funcA and funcB are relatively small. It is a 2 cpu intel Xeon with 6 cores per cpu. Commented Jul 25, 2012 at 22:06
  • SSCCE = Short Self Contained Compilable Example, so basically a small snippet that reproduces the problem that you're having.
    – Mysticial
    Commented Jul 25, 2012 at 23:05
  • what is a good way/resource to learn the strong points, latencies, and such of your processor? For example Xeon, or Sandy Bridge? I think it would help make better (albeit nonportable) optimization decisions. Commented Jul 26, 2012 at 21:45
  • Agner Fog's blog has some very good resources. He's got large table of latencies of the majority of current x86 processors. But be aware that performance is a lot more complicated than just adding up latencies. So you'll have to do a lot of testing and experimentation.
    – Mysticial
    Commented Jul 26, 2012 at 21:52

Not the answer you're looking for? Browse other questions tagged or ask your own question.