7

I've heard of some techniques to optimize code and make it faster:

  • On one side are clearly relevant optimization: use of better algorithms, benchmarking, etc.

  • On the other side are techniques with a more doubtful relevance. I don’t know if they are a myth or a reality or just deprecated usage.

My question is about the latter ones. Some examples:

  • Big function (several thousand lines) to avoid function call overhead
  • No SRP to avoid Object overhead
  • Reuse variable as much as possible instead of having scoped variables
  • ++i/i++
  • And many some other practices.

Often those techniques go against a readable, understandable and maintainable code base. So I'd like to know if they have a founded reason to exist.

A few things to note:

  • I'm only assuming there may be rules in place and not that those are only bad practices that took root in code and in developers minds.
  • I'm currently working in C++98 and accordingly old compiler and hopefully in c++14 more up to date compiler in the next future. I'd like to know if the answer depends of the environment

My question: Are those practices myths or reality? In the context of an application with heavy performance criteria (Like High Frequency Trading Software) could those practices supersede "good development" practices?

3
  • This question is very interesting. I dared to edit it in a way to remove some very subjective wordings, since they could have encouraged opinion based contributions, which are out of scope here. Please review and edit in cas I misunderstood something.
    – Christophe
    Commented Jan 15, 2020 at 17:32
  • "And many some other practices." That makes the question too open ended to answer. Asking about a specific practice is fine; saying "there are some dubious practices that are too many to list; are there legitimate reasons for these practices that I haven't specified to exist?" is not answerable. Commented Jan 15, 2020 at 18:29
  • @Christophe Thanks for the edit, it's clearer.
    – JayZ
    Commented Jan 16, 2020 at 8:10

8 Answers 8

21

Performance optimization doesn't lend itself to these kinds of generalized rules, and I'm not sure that the rules you proposed were ever good ways to optimize.

Here's a better plan:

  1. Set specific performance requirements. (i.e. define "good enough.")

  2. Start with well-written code that takes advantage of routine performance optimizations such as sensible choices for collection implementations.

  3. Find the hot spots in your code, the places where the processor is spending the most time, using a profiler. This code will represent less than 10 percent of your overall code.

  4. Optimize only the hot spots for better performance.

  5. Repeat steps 3 and 4 until performance requirements are achieved.

That's it.

If performance is of paramount importance, you can violate "best practices" to achieve it. But most of the time, you won't have to.

2
  • 2
    I think step one is to "define good enough performance". Optimization problems can force you to do things in a less than clean way, so if you pretty much are already fast enough you don't have to keep squeezing blood from a stone. Commented Jan 16, 2020 at 19:08
  • Sometimes, (or, where I work, always) "performance requirements" are vague - something like "it should be as slow as possible, but not so slow that your customer rejects your product". In absence of real requirements, step 5 in the answer above will be subjective.
    – anatolyg
    Commented Jan 28, 2020 at 14:41
6

"Big function (several thousand lines) to avoid function call overhead"

In my experience exactly the opposite. Compilers tend to have problems creating good code for huge functions. I've had examples where execution time was really important to me, and I got considerable measurable speedup by extracting small loops into separate functions - code generated for the same time critical loop was better when it was in a separate function. (What actually happened was that a simple loop had all variables in registers, while the same loop inside a large function used stack memory for the same variables).

But for "High Frequency Trading Software" the key is fast communication with other devices, and that requires totally different techniques.

PS. I just find it a bit disturbing seeing all these suggestions that are not very likely to speed up your code, but make it a lot less readable and maintainable. What about a simple trick that any C++ developer should know: Say you have a standard vector containing instances of class T. Now compare these two loops:

for (T t in vector) {}
for (T& t in vector) {}

The first loop causes a copy of every single item in the vector to be constructed and destructed; two potentially expensive operations. The second loop does no such thing at all. That single character "&" can make a loop 100 times faster under the right circumstances.

1
  • I've had similar experiences. My simplistic explanation of it is that it neglects the distinction between common and rare case execution of code.
    – user321630
    Commented Sep 28, 2020 at 4:03
5

Three quotes to sumarize good practices regarding optimizations:

  • Premature optimization is the root of all evil
    Donald Knuth
  • Don't diddle code, find better algorithms
    Kernighan&Plauger
  • Debugging is twice as hard as writing the code. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.
    B.W.Kernighan

The rest is myth.

Some less important comments about your examples:

  • Not breaking down a thousand line function to avoid a couple of function calls puts the performance focus on 0.1% of the instructions: 99.1% of probability that performance issues are elsewhere IMHO.
  • Anyway, isn't it somewhat obsolete in a time of global optimizers and inlining? Does anyone remember the keyword register that was the summum of hand-crafted optimizations 30 years ago?
  • No SRP to avoid object overhead could be valid but only in case of a misunderstanding of SRP. SRP is according to Uncle Bob -- the inventor of the concept -- about (human) reasons to change, not about multiplying classes for decreasing the number of functions (which to the extreme would make OOP a variant of procedural programming). The benefits of real SRP far outweighs its drawbacks IMHO.
  • ++i; , i++; and i=i+1 taken isolated generate exactly the same code with a decently optimizing compiler. So ++i is no longer an optimization! But it's still valid from the point of view of expressivity. But this is a question of style: you like it or not. If not, maybe C++ is not the most suitable language for you ;-)
  • Many other human micro-optimizations are nowadays outperformed by the optimizer, which can take advantage of deep knowledge of the target CPU and analyse in-depth several alternatives. But finding better algorithm is still for us. I hope that we will outperform AI in this area for a long time (although some day I wonder).
11
  • 1
    ++i and i++ should generally produce the exact same code with a decent optimising compiler. There are cases where that would need more than decent compilers though, up to mind-reading, if i is just complex enough. Commented Jan 15, 2020 at 19:15
  • Even the Donald Knuth quote has become somewhat of a myth. Commented Jan 15, 2020 at 20:13
  • Did you mean to imply that we should look for a ++C language if we don't like the C++ style?
    – Bergi
    Commented Jan 16, 2020 at 1:23
  • 1
    @Bergi you made my day :’-) Well, may be there’s a c+1 language out there? more seriously, what I wanted to say is that pre and post-increment belong to the C/C++ culture and idioms so much that it appears even in the name.
    – Christophe
    Commented Jan 16, 2020 at 6:38
  • 1
    @JimmyJames My statement shall not be misunderstood as a generalization: I only meant to criticise people whi claim that ++i should be favored over i=i+1 for performance reasons.
    – Christophe
    Commented Jan 16, 2020 at 16:27
2

Not specific to C++ but one of the most common misconceptions I see is the idea that things will be faster if you pull all your data together before moving to the next step of your algorithm. This manifests itself in a number of ways. For example:

  • Building gigantic data messages
  • Loading large data structures into memory
  • Running process A for every item in a set, then running process B for every item, etc.

These approaches are disastrous from an performance perspective and they can be challenging to eliminate once in place. These designs often escape the abstraction boundaries and therefore it often requires a major re-write to move to a more efficient solution. For example, if you are generating a large document for another application to consume, you can't simply break it up into smaller pieces without modifying the downstream application. Or if your application is built to expect an array or other type with a known size, its not always trivial to replace that with an iterator.

For this reason, I consider this to be something you need to consider early on and an exception to the generally good advice to wait until you find you have a performance issue.

2
  • 1
    In some cases, things can actually be faster when you process your data in columns. In particular, this is the case when you're running in a SIMD architecture, such as GPGPU as column oriented processing means the GPU does not need to decode the same instructions for each row or with interpreted/VM-based language that doesn't use JIT recompilation, where the languages' regular looping construct can be much slower than calling functions working on a list that uses compiled code loops. Code in column oriented processing can also sometimes be much more readable when there's little branching.
    – Lie Ryan
    Commented Jan 20, 2020 at 1:29
  • @LieRyan I'm not sure how that relates to my answer. Can you elaborate?
    – JimmyJames
    Commented Jan 20, 2020 at 14:45
2

I see lots of good answers, but not one currently answering your literal question

So I'd like to know if they have a founded reason to exist.

As I remember correctly, each of these techniques had their justification around 20 to 30 years ago on slow hardware, especially for some not-so-sophisticated compilers or interpreters, in cases where even micro-optimization could bring a required performance gain.

Today, using modern compilers for languages like C++, Java or C#, I think there is not one of them where the examples listed in the question will bring a speed benefit in optimized code, not even in hot-spots. In some older, but still used language-implementations like VBA, some of those techniques may still be valid - sometimes.

Even today there exist a lot of small embedded devices which don't have huge CPU power and which are usually programmed in C. I am not actually sure about the optimization quality of the C compilers for those platforms, but I can imagine that there may be cases where some of those optimization techniques you mentioned still matter.

But as others have pointed out, there is no justification to apply micro-optimizations in advance, "just in case". The most sensible approach is

  • to implement the code as clean as possible
  • optimize afterwards, and only the relevant parts
  • keep only the optimizations which did really bring proven (=measured!) benefit
4
  • To your point, there's a lot of advice I remember from around 20 years ago about Java that if you followed it, can actually reduce performance and make it harder to optimize. For example, pooling objects is counter-productive in current GC models.
    – JimmyJames
    Commented Jan 16, 2020 at 20:11
  • 1
    Probably 30 years ago, on a reasonable large program that I worked on, someone spend about 8 hours adding "register" everywhere it was possible. It resulted in a measured 25% speedup. Well, the compiler was stupid. Today, I'd expect that in the best case it would make no difference, in the worst case "register" might talk the compiler into making the wrong decision.
    – gnasher729
    Commented Sep 28, 2020 at 13:43
  • @gnasher729: At least when targeting the popular Cortex-M0, using -O0 and register may allow a programmer to prevent the optimizer from making wrong decisions. Given a loop like do { *p += something; p+=2; } while (p < e); if p, e, and something are register-qualified, all will be kept in registers. In cases where gcc's optimizer is able to replace something with a constant, however, it is prone to generate code that would reload that constant into a register on every pass through the loop.
    – supercat
    Commented Sep 28, 2020 at 19:17
  • @gnasher729: Interestingly, unless one jumps through hoops to prevent the gcc from applying constant propagation, such as by storing the constant to a volatile and then reading it into another object before the start of the loop, gcc -O0 can produce code for a construct like the above that's more than 25% faster than what gcc would produce at any other setting, but only one uses register.
    – supercat
    Commented Sep 28, 2020 at 19:20
2

In reality, lots of times your code is just fast enough, so you do nothing.

If it’s not fast enough, lots of times you don’t need to optimise, you just have to figure out what you did that was completely stupid and killed performance, and stop doing it. That has been my experience in recent years, that when something was too slow, it was because of someone doing something stupid. (I suppose “optimising” involves doing something clever, so stopping to do stupid things doesn’t count as optimising. ) Picking the wrong data structure, like looking up keys in an array by sequential search instead of using a dictionary falls into this category.

The next step is looking for work that you do repeatedly for no good reason. Like sorting an array repeatedly. Calculating the same dataset repeatedly. Reading the same data from a database multiple times, and so on. Avoiding that can give you massive savings.

And if that is not enough and it looks like you need some real optimising: Find someone who knows what to do.

Micro optimisations like the ones suggested come at the very, very end of the list.

1

Make it work, then make it beautiful, then if you really, really have to, make it fast. 90 percent of the time, if you make it beautiful, it will already be fast. So really, just make it beautiful!

by Joe Armstrong

0

Big function (several thousand lines) to avoid function call overhead

This completely neglects the idea of the common-case and rare case branches of code. People won't get far neglecting this difference. There are factors to consider like the increased cost of more distant jumps and icache misses.

There is an unusual and not-so-commonly-cited case where inlining a somewhat large function can result in improved performance, but it has to do with constant propagation and dead code elimination more than calling overhead or its interference with register allocation. Take an example like this:

// Returns true if the ray intersects the triangle. If 'cull' is true, backfaces
// are ignored.
bool ray_tri_intersect(ray r, tri t, bool cull)
{
    ...
    if (cull)
        ...
    else
        ...
    ...
    if (cull)
        ...
    else
        ...
    ...
    return result;
}

In those cases, I've found that calling the function like this:

if (ray_tri_intersect(r, t, false)) {...}

... will fail to eliminate the branching overhead of the if' statements within the function. That's a bit unfortunate and there are speed-ups to be gained here if we just brute force inline the function, but ideally, we just need 3 versions of the function: one with cull with a known compile-time value of true, another with false, and another with a cull value that can only be deduced at runtime (l-value). To inline every single call to the function is rather brutish and bloated in terms of code generation but maybe the best practical option we have in C in response to a hotspot.

No SRP to avoid Object overhead

The only practical overhead to an object that applies in a language like C++ is what the programmer introduces. Naturally, there will be some if you introduce virtual functions in the form of a vptr and virtual dispatch, but there is no practical overhead to an object that you don't add yourself as with the case of Java or C#. That said, I have written many posts on here about trapping ourselves in performance-critical cases by designing our objects too granularly, like trying to represent a pixel of an image as an object, or a particle of a particle system as an object. That has nothing to do with compilers or optimizers but human design. If you have a system that has a boatload of dependencies to a teeny Pixel object, there is no breathing room to change that to, say, loopy SIMD algorithms without extremely invasive changes to the codebase by modeling things at such a granular level. For very loopy performance-critical stuff, it does help to avoid dependencies to teeny objects storing very little data, but not because of some object overhead. It's a human overhead of trapping yourself to a design that leaves little room for bulky, meaty optimizations. You don't leave much room for yourself to optimize an Image operation if the majority of your code depends on working with a Pixel object. You'll work yourself towards needing to rewrite your entire engine if you have so many dependencies to teeny objects that interfere with your ability to make broad, meaningful performance improvements without rewriting almost everything.

Reuse variable as much as possible instead of having scoped variables

Variables don't take any memory or resources. I keep repeating myself here like a broken record but variables don't require memory. Operations require memory. If you add x+y+z, the computer can only perform scalar additions (excluding SIMD) on two operands at a time. So it can only do x+y and it needs to remember the sum to add z to it. This is what takes memory and I've found this so widely misconceived. Variables don't take memory. Variables are just human-friendly handles to the memory that results from operations like these. Understanding this is key to really understanding how our compilers and optimizers work so that we can better understand the results from our profilers.

For people interested in this topic, I recommend starting with Static Single Assignment Form and the optimizations that compilers make resulting from it.

That said, there are some cases where reusing an object can net a performance improvement. For example, in C++, if you use std::vector to store small, temporary results for each iteration of a loop with a million iterations, you can see substantial performance improvements hoisting the vector out of the loop and clearing it and reusing it across iterations. That's because defining a vector object is much more than a variable declaration: it involves initializing the vector, and subsequent push_backs and resize and the like will involve heap allocations and possibly a linear-time copy construction algorithm (or something akin to a memcpy with trivially copy-constructible types).

This overhead works towards zilch if you use a tiny_vector or small_vector implementation which uses a small buffer optimization in loops where you don't exceed the size of the small buffer. Unfortunately, the standard library doesn't offer SBOs for std::vector (although oddly enough they prioritized making std::string use one in C++11+). But forget the idea that variables have overhead. They have none. It's the operations involved with constructing the vector and inserting elements into it which have an overhead in this case.

Operations require space to store results. Variables require none. They are just programmer-friendly handles to the results.

++i/i++

A quick glance at disassembly should show no difference here unless you're using complex objects, and by "complex", I mean beyond a simple random-access iterator (maybe something which allocates memory per iteration that the compiler failed to optimize away). 99.9% of the time, there should be no difference for most people. That said, I have never understood stubborn C++ programmers who refuse to use prefix notation in favor of postfix in places where it makes no difference when the prefix notation is guaranteed to be as fast or faster than postfix. Still, the stubborn ones are very rarely causing inefficiencies with their stubbornness (at least from a runtime standpoint -- maybe we could still save the compiler some extra work though and reduce build times by a small amount favoring ++i across the board, especially if we involve UDTs).

My question: Are those practices myths or reality? In the context of an application with heavy performance criteria (Like High Frequency Trading Software) could those practices supersede "good development" practices?

If you make a decent profiler a best friend of yours, you'll see first-hand a lot of cases where the general rules of thumbs are mostly right and where they're mostly misleading. I recommend that. Make a profiler your best friend if you haven't already. It is definitely the case that a lot of things passed off as "general wisdom" are misleading, and some which definitely are not and good advice. But the key to thinking critically and telling the difference comes from measuring with a good tool that breaks things down for you in detail.

Not the answer you're looking for? Browse other questions tagged or ask your own question.