Performance of std::function compared to raw function pointer and void* this?

Question

Library code:

class Resource 
{
public:
    typedef void (*func_sig)(int, char, double, void*);
//Registration
    registerCallback(void* app_obj, func_sig func)
    {
        _app_obj = app_obj;
        _func = func;
    }

//Calling when the time comes
    void call_app_code()
    {
        _func(231,'a',432.4234,app_obj);
    }
//Other useful methods
private:
    void* app_obj;
    func_sig _func;
//Other members
};

Application Code:

class App
{
public:
    void callme(int, char, double);
//other functions, members;
};

void callHelper(int i, char c, double d, void* app_obj)
{
    static_cast<App*>(app_obj)->callme(i,c,d);
}

int main()
{
    App a;
    Resource r;
    r.registercallback(&a, callHelper);
//Do something
}

The above is a minimal implementation of callback mechanism. It is more verbose, doesn't support binding, placeholders etc., like std::function. If I use a std::function or boost::function for the above usecase, will there be any performance drawbacks? This callback is going to be in the very very critical path of a real time application. I heard that boost::function uses virtual functions to do the actual dispatch. Will that be optimized out if there are no binding/placeholders involved?

Update

For those interested in inspecting the assemblies in latest compilers: https://gcc.godbolt.org/z/-6mQvt

how std::function implements the type erasure is implementation-dependent I believe (and I think Microsoft's uses virtual functions), so the answer might even depend on what platform you are targetting. if i were you i would try some benchmarks — Andy Prowl, Commented Jan 13, 2013 at 18:12
I agree that benchmarking would show. I am wondering if it is theoretically possible for std::function to specialise such cases and be as efficient as plain function ptr. — balki, Commented Jan 13, 2013 at 20:35
@balki: Like "SSO" for std::string there is a possibility of SFO (small functor optimization) for std::function. This will avoid the dynamic memory allocation and speed up copying std::function objects. If you care about the invocation overhead you should not be using std::function or function pointers but try to use the functors directly. This will enable inlining. Anyhow, test it. You might also want to check whether your C++ vendor does SFO for std::function. — sellibitze, Commented Jan 14, 2013 at 9:37

christianparpart · Accepted Answer · 2013-10-15 00:09:48Z

I wondered myself quite frequently already, so I started writing some very minimal benchmark that attempts to simulate the performance by looped atomic counters for each function-pointer callback version.

Keep in mind, these are bare calls to functions that do only one thing, atomically incrementing its counter;

By checking the generated assembler output you may find out, that a bare C-function pointer loop is compiled into 3 CPU instructions;

a C++11's std::function call just adds 2 more CPU instructions, thus 5 in our example. As a conclusion: it absolutely doesn't matter what way of function pointer technique you use, the overhead differences are in any case very small.

((Confusing however is that the assigned lambda expression seems to run faster than the others, even than the C-one.))

Compile the example with: clang++ -o tests/perftest-fncb tests/perftest-fncb.cpp -std=c++11 -pthread -lpthread -lrt -O3 -march=native -mtune=native

#include <functional>
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>

typedef unsigned long long counter_t;

struct Counter {
    volatile counter_t bare;
    volatile counter_t cxx;
    volatile counter_t cxo1;
    volatile counter_t virt;
    volatile counter_t lambda;

    Counter() : bare(0), cxx(0), cxo1(0), virt(0), lambda(0) {}
} counter;

void bare(Counter* counter) { __sync_fetch_and_add(&counter->bare, 1); }
void cxx(Counter* counter) { __sync_fetch_and_add(&counter->cxx, 1); }

struct CXO1 {
    void cxo1(Counter* counter) { __sync_fetch_and_add(&counter->cxo1, 1); }
    virtual void virt(Counter* counter) { __sync_fetch_and_add(&counter->virt, 1); }
} cxo1;

void (*bare_cb)(Counter*) = nullptr;
std::function<void(Counter*)> cxx_cb;
std::function<void(Counter*)> cxo1_cb;
std::function<void(Counter*)> virt_cb;
std::function<void(Counter*)> lambda_cb;

void* bare_main(void* p) { while (true) { bare_cb(&counter); } }
void* cxx_main(void* p) { while (true) { cxx_cb(&counter); } }
void* cxo1_main(void* p) { while (true) { cxo1_cb(&counter); } }
void* virt_main(void* p) { while (true) { virt_cb(&counter); } }
void* lambda_main(void* p) { while (true) { lambda_cb(&counter); } }

int main()
{
    pthread_t bare_thread;
    pthread_t cxx_thread;
    pthread_t cxo1_thread;
    pthread_t virt_thread;
    pthread_t lambda_thread;

    bare_cb = &bare;
    cxx_cb = std::bind(&cxx, std::placeholders::_1);
    cxo1_cb = std::bind(&CXO1::cxo1, &cxo1, std::placeholders::_1);
    virt_cb = std::bind(&CXO1::virt, &cxo1, std::placeholders::_1);
    lambda_cb = [](Counter* counter) { __sync_fetch_and_add(&counter->lambda, 1); };

    pthread_create(&bare_thread, nullptr, &bare_main, nullptr);
    pthread_create(&cxx_thread, nullptr, &cxx_main, nullptr);
    pthread_create(&cxo1_thread, nullptr, &cxo1_main, nullptr);
    pthread_create(&virt_thread, nullptr, &virt_main, nullptr);
    pthread_create(&lambda_thread, nullptr, &lambda_main, nullptr);

    for (unsigned long long n = 1; true; ++n) {
        sleep(1);
        Counter c = counter;

        printf(
            "%15llu bare function pointer\n"
            "%15llu C++11 function object to bare function\n"
            "%15llu C++11 function object to object method\n"
            "%15llu C++11 function object to object method (virtual)\n"
            "%15llu C++11 function object to lambda expression %30llu-th second.\n\n",
            c.bare, c.cxx, c.cxo1, c.virt, c.lambda, n
        );
    }
}

um, if you are using c++11, why in god's name are you using volatile? — Tim Seguine, Commented Oct 17, 2013 at 11:37
Would the results be different if your member functions were const? — masaers, Commented Dec 11, 2013 at 6:50
Tim Seguine, I want the compiler to not cache variables in registers on use as they're used from the worker threads AND the main thread (which periodically accesses those variables to print the stats). if I'd use std::atomic<> then the use of the volatile keyword would have been unnecessary. — christianparpart, Commented Feb 3, 2014 at 10:54
masaers, No - but it is indeed a good coding paradigm to make const when it doesn't modify your local object. Makes code clean and avoid future bugs (but that's out of this story :-) — christianparpart, Commented Feb 3, 2014 at 10:56
((Confusing however is that the assigned lambda expression seems to run faster than the others, even than the C-one.)): The problem is probably the concurrency. I don't exactly know what happens, but when you run the threads one-at-a-time I get a result like this: 1954073390 bare function pointer 1952530828 C++11 function object to bare function 1953096356 C++11 function object to object method 1953336344 C++11 function object to object method (virtual) 1951464452 C++11 function object to lambda expression 10-th second. All very closely together. — Timo Türschmann, Commented Nov 8, 2014 at 8:51

pmr · Accepted Answer · 2013-01-13 18:12:14Z

9

std::function performs type erasure on the function type and there is more than one way to implement it, so you maybe should add which version of which compiler you are using to get an exact answer.

boost::function is largely identical to a std::function and comes with an FAQ entry on call overhead and some general section on performance. Those give some hints on how a function object performs. If this applies in your case, depends on your implementation but numbers shouldn't be significantly different.

answered Jan 13, 2013 at 18:12

pmr

59.5k11 gold badges117 silver badges158 bronze badges

1

Btw, the FAQ says: "The cost of boost::function can be reasonably consistently measured at around 20ns +/- 10 ns on a modern >2GHz platform versus directly inlining the code.". That's not a great statement IMO. It does not give relative estimates and does not compare it to non-virtual function calls (only to inlining)
– Andy Prowl
Commented Jan 13, 2013 at 18:15
@AndyProwl Yes, but such statements are incredibly hard to make and the benchmarks are really hard to write and usually also compiler version dependent. It is better than no statement at all.
– pmr
Commented Jan 13, 2013 at 18:17
I believe the Boost people would be happy about some benchmark code submitted as patch, so people can measure the actual impact on their particular platform.
– Ulrich Eckhardt
Commented Jan 13, 2013 at 18:24
@balki If what is theoretical possible? To write a benchmark? Certainly, but it is tricky. I depends what you care about: size, call speed, speed of copy/moves?
– pmr
Commented Jan 13, 2013 at 21:02

Add a comment |

Kamil Kuczaj · Accepted Answer · 2019-02-20 09:42:12Z

I run a quick benchmark using Google Benchmark Those are the results:

Run on (4 X 2712 MHz CPU s)
----------------------------------------------------------
Benchmark                   Time           CPU Iterations
----------------------------------------------------------
RawFunctionPointer         11 ns         11 ns   56000000
StdBind                    12 ns         12 ns   64000000
StdFunction                11 ns         11 ns   56000000
Lambda                      9 ns          9 ns   64000000

It seems that the most optimal solution is using lambdas (just like user christianparpart mentioned in this thread). The code I used for benchmark can be found below.

#include <benchmark/benchmark.h>

#include <cstdlib>
#include <cstdio>
#include <functional>

static volatile int global_var = 0;

void my_int_func(int x)
{
    global_var = x + x + 3;
    benchmark::DoNotOptimize(global_var);
    benchmark::DoNotOptimize(x);
}

static void RawFunctionPointer(benchmark::State &state)
{
    void (*bar)(int) = &my_int_func;
    srand (time(nullptr));
    for (auto _ : state)
    {
        bar(rand());
        benchmark::DoNotOptimize(my_int_func);
        benchmark::DoNotOptimize(bar);
    }
}

static void StdFunction(benchmark::State &state)
{
    std::function<void(int)> bar = my_int_func;
    srand (time(nullptr));
    for (auto _ : state)
    {
        bar(rand());
        benchmark::DoNotOptimize(my_int_func);
        benchmark::DoNotOptimize(bar);
    }
}

static void StdBind(benchmark::State &state)
{
    auto bar = std::bind(my_int_func, std::placeholders::_1);
    srand (time(nullptr));
    for (auto _ : state)
    {
        bar(rand());
        benchmark::DoNotOptimize(my_int_func);
        benchmark::DoNotOptimize(bar);
    }
}

static void Lambda(benchmark::State &state)
{
    auto bar = [](int x) {
        global_var = x + x + 3;
        benchmark::DoNotOptimize(global_var);
        benchmark::DoNotOptimize(x);
    };
    srand (time(nullptr));
    for (auto _ : state)
    {
        bar(rand());
        benchmark::DoNotOptimize(my_int_func);
        benchmark::DoNotOptimize(bar);
    }
}


BENCHMARK(RawFunctionPointer);
BENCHMARK(StdBind);
BENCHMARK(StdFunction);
BENCHMARK(Lambda);

BENCHMARK_MAIN();

Glad to see this, but I will suggest that remove rand in the test for a loop since it's very slow and will cost much of the run time. — prehistoricpenguin, Commented Sep 8, 2020 at 6:16
It's been used in every benchmark function as an argument for the function. Why do you consider the rand() to disturb the benchmark results? — Kamil Kuczaj, Commented Sep 9, 2020 at 5:59
@KamilKuczaj it's a function that does by no means guarantee constant time, and it's varying by far more than the overhead. Also, you're "DoNotOptimize" a lot of things, most of which you really shouldn't if you want to benchmark invocation. Also, it's useless to the benchmark? — Marcus Müller, Commented Jul 22, 2023 at 17:41
Replacing rand by a simple rng_state = (rng_state << 13) ^ 0xDEADCAFE; bar(rng_state); results in overall execution that is two to three times faster than with rand. So, your benchmark is almost certainly dominated by the overhead that calling rand() in that context had. (which very much might be a question of efficient use of stack!) — Marcus Müller, Commented Jul 22, 2023 at 17:53

Collectives™ on Stack Overflow

Performance of std::function compared to raw function pointer and void* this?

3 Answers 3

Not the answer you're looking for? Browse other questions tagged
c++
c++11
boost-function
std-function
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Not the answer you're looking for? Browse other questions tagged c++c++11boost-functionstd-function or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
c++
c++11
boost-function
std-function
or ask your own question.