2

I want to use callgrind to profile my program, but it is slowed down too much. What I want to do is generate a callgraph using kcachegrind where every node shows how much percentage the program spent in which function. Can you tell me which features I can safely disable for better performance so this info is still generated?

Thanks a lot!

1 Answer 1

5

Quick Overview

Callgrind is essentially a cache profiler (both instruction and data) that works at function-level granularity in order to reproduce the call graph. The profiler observes actions that trigger events during program execution and updates various aggregate counters maintained by the simulator.

However, this fine-grained simulation of cache events comes at a heavy cost of program runtime. You should know that even with all profiling turned off and no useful data being collected, Callgrind will still have a minimum of about 2-4x hit in runtime. When actively collecting data, it would be an average of 10-20x slower.

Is this theoretical minimum acceptable for your requirement? If not, you should consider other profiling options - discussed here. But if, with some careful control, speeding up large, uninteresting chunks of your program to only a 2-4x slowdown sounds reasonable, read on!

Available Hooks

Callgrind offers 2 forms of control over the collection of profiling data. It's important to understand their inter-dependencies in order to make an informed choice:

  1. Intrumentation state - When disabled, no program actions are observed and thus, no events are triggered or collected. The simulator basically switches to an 'idle' state; this is what helps you achieve the theoretical 2-4x minimum I mentioned above (see Nulgrind).

    But be warned, this should be used carefully! While it offers attractive benefits, this can have non-trivial effects on accuracy. From the documentation:

    However, this only should be used with care and in a coarse fashion: every mode change resets the simulator state (ie. whether a memory block is cached or not) and flushes Valgrinds internal cache of instrumented code blocks, resulting in latency penalty at switching time.

  2. Collection state - When disabled, the aggregate counters are not updated with triggered events. This provides a way to streamline collected data to only the interesting parts of your call stack.

    However, intuitively, this does not offer any noticeable speedup in execution time. And of course, instrumentation needs to be switched on for collection to be enabled.

Commands

valgrind --tool=callgrind   
    --instr-atstart=<yes|no>     ;; default = yes 
    --collect-atstart=<yes|no>   ;; default = yes
    --toggle-collect=<function>  ;; Toggle collection at entry/exit of specific function
<PROGRAM> <PROGRAM_OPTIONS>

Instrumentation - Turning this off in the beginning indicates you have to turn it back on again at the appropriate time. 2 alternate ways to do this:

  1. During program execution, use the following command from the shell at the appropriate time.

    callgrind_control -i <on|off>
    

    This would require visibility into your program execution as well as some tolerance in accuracy due to the latency of deploying the command. You could use a few shell tricks to help, of course.

  2. Insert the following macros into your program code and recompile your binary.

    CALLGRIND_START_INSTRUMENTATION;
    CALLGRIND_STOP_INSTRUMENTATION;
    

Collection - Similarly, if disabled at the start, collection needs to be toggled around the interesting parts of the code. 2 alternate ways to do this:

  1. Use the --toggle-collect=<function> flag during launch. By definition, this would be inclusive of all the sub-calls within this function. If you can thus identify a particular parent function as your bottleneck, this can be a useful method to isolate relevant data and keep the generated call graph minimal.

    Tip: Wildcards are supported in the function name!

  2. Use the following macro before and after the relevant portion of your program code and recompile your binary. This can give you more fine-grained control within functions.

    CALLGRIND_TOGGLE_COLLECT;
    

Summary

To combine all the ideas above, a good approach would be:

#include <callgrind.h>

// Uninteresting program chunk

CALLGRIND_START_INSTRUMENTATION;

// A few extra lines to allow cache warm-up

CALLGRIND_TOGGLE_COLLECT;
// Portion to profile
CALLGRIND_TOGGLE_COLLECT;

CALLGRIND_DUMP_STATS;
CALLGRIND_STOP_INSTRUMENTATION;

// Rest of the program

Recompile, and launch Callgrind with:

valgrind --tool=callgrind --instr-atstart=no --collect-atstart=no <PROGRAM> <PROGRAM_OPTIONS>

Note that there will be 2 Callgrind output files generated by this method - the first created by the DUMP_STATS macro, and the second at program exit. DUMP_STATS zeroes all counters after use, which means the second log will report 0 events.

Within the active instrumentation block, you could also toggle collection multiple times and dump collected stats for each chunk.

Not the answer you're looking for? Browse other questions tagged or ask your own question.