Why don't we count CPU operations?

Question

When evaluating code performance, CPU instructions is not the best metric, since the exact number of operations depends on the compiler, CPU model, architecture and so on.

And we came up with a bunch of mathematical tools to describe the performance of an algorithm (the most popular being big-Oh notation) and we simply report the running time to get a feel for how the implementation runs in practice (yes, big-Oh is not always that easy to use).

I need to optimize a piece of code. This implies a lot of reading and staring at the code looking for places to improve, but also a great deal of experimentation. I need to check if idea A does improve (and by how much) the performance. And I do this by measuring the time it takes to execute a piece of code.

But this is quite imperfect: I do use a multitasking system. I cannot totally reproduce the running conditions of the previous experiment. My browser might decide to start 10 more threads which get in the way, or I might get some CPU throttling at a different point in time due to various reasons.

So, usually, I need not only to wait for my program to run, but I also tend to close my browser, make sure the IDE is not indexing or doing some expensive operation, wait a few minutes for the CPU to cool down, and then run my experiment. This is the only way I could get reliable timings.

The question is: is there a way to count the number of operations? On my CPU, for a specific process. This would be (theoretically) an ideal way of comparing 2 versions of my code, unless there is some issue I am missing.

I am puzzled - first you say that counting CPU instructions is probably not the most sensible metrics, but then you ask for exactly that - a method for counting CPU instructions. So what? — Doc Brown, Commented Mar 2, 2020 at 14:17
... but have a look here and here, there are tools for counting CPU cycles and instructions. Found those by simply googling "count cpu cycles". — Doc Brown, Commented Mar 2, 2020 at 14:34
Given the fact there are several tools which support CPU cycle count, I don't see why you think "doesn't seem to be a very common benchmark". And you did not answer my first question. — Doc Brown, Commented Mar 2, 2020 at 15:39

Robert Harvey · Accepted Answer · 2020-03-02 14:28:41Z

5

What you need is a Profiler.

Most IDEs include this functionality; if activated, it compiles a special version of the code that is able to tell you pretty exactly how much time is spent in which code line. You can analyze this, and then you know where exactly the CPU effort is spent.

The strategy you describe is very inefficient, it's like closing your eyes and shooting in the forest, and hoping to hit a deer.

edited Mar 2, 2020 at 14:28

Robert Harvey

200k55 gold badges467 silver badges678 bronze badges

answered Mar 2, 2020 at 0:09

Aganju

1,46313 silver badges15 bronze badges

I do use a profiler sometimes, and it's great for getting an overview of the most expensive pieces of code. But (and correct me if I'm wrong), a profiler is still affected by CPU throttling (or lower CPU frequency, in general).
– Paul92
Commented Mar 2, 2020 at 0:17
1

it would depend on the profiler, but that rarely is relevant, as almost always, 90%+ is spent in one place in the code, and that's all you need to know - three valid digits in that percentage don't give you anything.
– Aganju
Commented Mar 2, 2020 at 0:20
Well, that's what i am actually interested in - if a certain piece of code runs 1-2% faster or not after a change, not which part is the most expensive.
– Paul92
Commented Mar 2, 2020 at 0:28
4

@Paul92 If you think CPU throttling is an issue, you can change your system settings to disable that feature, so it is always running at top speed.
– 1201ProgramAlarm
Commented Mar 2, 2020 at 0:55
@Paul92: then you are doing it wrong. Which part is the most expensive is pretty much the only thing you should be interested in, because that is where you can really make the code run faster. Looking at 1-2% differences is a useless waste of time unless you work on code that runs on thousands of servers and have already identified that code as the most expensive.
– Michael Borgwardt
Commented Mar 3, 2020 at 15:43

Add a comment |

JonasH · Accepted Answer · 2020-03-03 15:20:29Z

To give some examples why this is not very feasible. Modern processors work on quite a few instructions at a time, but only if the instructions are independent. The processor will try to guess any branches in the code, and there can be a huge difference if it guesses correctly or not. Loading a memory address might be at almost no cost if it is cached, ~200 cycles if it is fetched from main memory, or a lot more if the memory has been paged to disk. Because of this the time to execute one instruction might vary a huge amount, so simply counting the number of exectuted instructions will tell you very little about the actual performance.

Quite a bit about performance optimization has to do with efficiently using the processors we have, and this includes efficient memory access, branches that can be predicted and instructions that may be executed concurrently whenever possible. So the only feasible solution is to measure the actual time.

There are a many different profilers that can help reveal potential bottlenecks. Intel VTune claims to be able to provide data down to instruction level, including pipeline stalls, but I have no personal experience with this.

High resolution timestamps provide a high frequency counter that is used to measure performance. If you only want the time spent executing a particular thread the only way I know of would be to stop/start the timer when the thread is switched in or out. I would expect most profilers should be able to do this for you. Here is some information about how dotTrace measures time.

I am aware of the pipelining and branch prediction issues, but I assume they are deterministic. But your post made me understand an issue: even if you count CPU cycles (this is what I actually meant by instruction, not ASM instruction), you can't get total isolation since other processes running might create cache misses. — Paul92, Commented Mar 3, 2020 at 13:32
Pipelining and branch prediction might be mostly deterministic for a given processor model, but certainly not for different models. What I guess you are after is "Thread Time" i.e. how much time the processor spend executing a particular thread. Updated the answer with some more info about this. — JonasH, Commented Mar 3, 2020 at 15:23
As well as time, you can often actually count things like mispredicted branches. — Stack Exchange Supports Israel, Commented Mar 3, 2020 at 16:10

Kain0_0 · Accepted Answer · 2020-03-02 00:53:05Z

1

Your only choice is to run the same piece of code numerous (aka 1000s of) times and collect statistics.

I would recommend setting up a computer as a test-rig to do this. If that isn't possible you will need to use your own - just stop waiting for the cpu to clear.

Your code in a production environment is not going to get the benefit of a cold CPU, or a light workload. It will be competing with the other processes, and those other processes will be utilising resources and will have an effect on the algorithms total runtime.

A well tuned algorithm may use a slightly slower path (for a single app experience), that is faster overall when there is load on the system. The reason will be down to utilisation of bandwidth which increases latency (hence a generally slower end-to-end timing than a low latency approach on quiet hardware), but will achieve a higher throughput (hence completing more work with the allocated resources, and reducing overall runtime on busy hardware).

answered Mar 2, 2020 at 0:53

Kain0_0

16.3k18 silver badges39 bronze badges

I agree that the system will not run in ideal conditions, but the point was not to benchmark the system, but to get a reliable measure of the difference in performance between 2 versions of the code. That is also my point with the CPU frequency scaling and multithreading: these aspects make the comparison harder to obtain, especially if we're talking about many small optimizations.
– Paul92
Commented Mar 2, 2020 at 9:35
@Paul92, if you care about small differences in speed (1-2% as you indicate in another comment), then you must run each version of the code enough times to ensure that environmental effects can be filtered out of the measurements using statistics.
– Bart van Ingen Schenau
Commented Mar 2, 2020 at 9:51
@BartvanIngenSchenau I agree, and I've been doing exactly that. But my question was asking more for why don't we use a metric like CPU cycles, which would provide a more reliable comparative measurement (at least from my current understanding, which might be flawed).
– Paul92
Commented Mar 2, 2020 at 14:47
@Paul92, with a modern CPU, CPU cycles are also deceptive. How many CPU cycles are used if a branch instruction needs 10 cycles to fetch the information needed to do the comparison, and during 8 of those the branch predictor executes 2 other instructions on a branch that happens to be successfully predicted. Is that 10 or 18 cycles?
– Bart van Ingen Schenau
Commented Mar 2, 2020 at 15:46
@Paul92 Consider the memory read operation (even just for the code itself), is that serviced by L1, L2, L3 cache? Is it a RAM access? Is it paged out to Disk? How many instructions are executing looking for the page itself? Is it in the look-aside table? Do we have to instead perform an OS level mapping operation? Are we considering micro-code instructions or are we talking a high-level machine language instruction? Unless you have the increasingly rare case of a CPU bound algorithm, instruction counting isn't even useful as it can be considered instantaneous, compared to the time taken by IO.
– Kain0_0
Commented Mar 2, 2020 at 23:03

Add a comment |

Martin Maat · Accepted Answer · 2020-03-03 14:04:38Z

If you are on Windows .NET and you want to do this yourself, have a look at class System.Diagnostics.ProcessThread.

If you obtain this for a particular thread you can inspect properties TotalProcessorTime, UserProcessorTime and PrivilegedProcessorTime. This should tell you what you need to know.

Get a hold of them like this:

ProcessThreadCollection threadCollection = Process.GetCurrentProcess().Threads;

foreach (ProcessThread processThread in threadCollection)
{
    // have your way with them
}

Stack Exchange Network

Why don't we count CPU operations?

4 Answers 4

Not the answer you're looking for? Browse other questions tagged
performance
performance-tools
or ask your own question.

Hot Network Questions

Why don't we count CPU operations?

4 Answers 4

Not the answer you're looking for? Browse other questions tagged performanceperformance-tools or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
performance
performance-tools
or ask your own question.