34

It is highly unrecommended to write your own code in assembly now since, in most cases, gcc -O3 does magic. But in the ‘80s it was believed that compiled C code takes 4(?) times or more than a well-organized assembly equivalent. When and why does coding C for performance as the primary choice become the received practice? Which compiler first made it, on which architecture?

Are there any high level language compilers (Ada/COBOL/Fortran/Pascal) other than C families which generates optimized code outperforming average assembly programmers?

18
  • 15
    On Unix workstations, as most software was compiled with GCC, the CPUs started to be designed to run code compiled by GCC faster. Commented Sep 13, 2020 at 18:19
  • 8
    Never I tell you!, never, never ....... [ :-) ] Commented Sep 14, 2020 at 10:30
  • 13
    What's an "average" assembly-language programmer these days, though? Is the average getting better because only the motivated need to do it?
    – dave
    Commented Sep 14, 2020 at 15:40
  • 6
    my personal experience (video games, so it was all asm then later moved to C++) is that hand asm was always faster, but become not worth it around the Pentium 3 era, but for some specialized code. The one thing that took very long for compilers to be good at was generating the FPU code for math operations.
    – Thomas
    Commented Sep 14, 2020 at 20:30
  • 3
    @Alan B, it does not really matter what proportion of the routines is in C. What matters is the proportion of time your program sits in that single routine written in assembly. And then the speed-up that can be reached in a well-tuned assembly routine vs output of the optimizing compiler can easily translate into appreciable speed-up of the whole program.
    – introspec
    Commented Sep 16, 2020 at 13:40

15 Answers 15

55
+50

As a former professional assembly language programmer I would say that by the late 1980s a number of C compilers had become available whose output was as good as something a skilled assembler programmer could produce. I used various x86 C compilers around then and JPI C and WATCOM C in 1988 and MSVC 1.0 in 1994 produced output as good as anything I could write and even taught me the occasional trick.

5
  • 2
    Watcom had a truly fantastic optimizer by the early-to-mid 1990s that I would have put up against almost any assembly language programmer, but I wasn't using their original tools from the late 1980s, so I didn't know if it was always so awesome. Commented Sep 14, 2020 at 20:54
  • 4
    I spent quite a bit of time looking at output from Borland Turbo C 2.0 (1988) and that compiler varied wildly between smart and dumb. Some tricks (x % 8 -> and ax,7, x = !x -> neg ax; sbb ax,ax; inc ax) were ever-present, but inefficiencies (repetitive wasteful calculations of bx when doing a lot of struct accesses, making bad decisions about stack vs. si/di for frequently-used local vars...) cancelled out the benefit somewhat.
    – smitelli
    Commented Sep 15, 2020 at 13:42
  • 7
    @smitelli: GCC is still like that. When targeting the Cortex-M0, even if code loads a constant into an automatic variable before a loop, and there would be a register available to hold the value throughout the loop, gcc will still sometimes not only apply constant propagation to replace the variable with a constant, and then reload the constant on every iteration of the loop, but even end up adding an extra register move on top of that.
    – supercat
    Commented Sep 15, 2020 at 15:33
  • @smitelli: The choice of whether to use stack vs SI/DI would often depend upon whether variables were declared 'register'. Suitable use of that qualifier can make a huge difference when using Turbo C (and also, incidentally, when using gcc with the -O0 flag). On some occasions, code which makes good use of the register qualifier may be more efficient when compiled with -O0 than it would at any other optimization setting, because using -O0 will prevent gcc from making a counter-productive optimization.
    – supercat
    Commented Oct 21, 2021 at 16:51
  • 1
    One thing that the optimising compiler won't do is self-modifying code so sometimes it's still worth writing some inner loops in assembly if you think you can make use of it.
    – Sam
    Commented May 1, 2023 at 11:29
40

For a start, it is widely known that FORTRAN II for the IBM 1401 series was specifically designed to generate high enough quality object code to make assembly programming of numerical routines unnecessary. FORTRAN compilers have largely kept up that legacy ever since.

C compilers have historically varied in quality a great deal. It must be remembered that C was originally designed as a sort of "portable assembly language" with a reasonable correspondence to the instructions and addressing modes of the PDP-11. Suitably written C with even a simple compiler could be remarkably efficient. But object code produced by some early compilers, particularly for microcomputer platforms such as the PC, was unreasonably bad.

Today, even with the sophisticated compilers now available, it is still usually possible for a skilled assembly coder to write better code than a compiler produces. They may use instructions that the compiler does not know how to use, or understand the algorithms more deeply than can be expressed in C. At a minimum, they can start with the output of a compiler and improve from there.

The advantage of the compiler is that it generates code more quickly, ie. with less developer effort, and the source code is easier to maintain. Today's sophisticated compilers help to reduce the performance deficit that traditionally went along with that. But sophisticated compilers are not new.

3
  • Comments are not for extended discussion; this conversation has been moved to chat.
    – Chenmunka
    Commented Sep 14, 2020 at 15:27
  • Sometimes the assembly programmer can take advantage of knowledge of the problem that the compiler can't know. For example (n * 103) >> 10 is equivalent to n / 10, if you know that n is an integer in the range 0 to 99. Compilers know lots of tricks like that, but they don't know that your input is restricted. Commented Sep 29, 2021 at 16:14
  • @MarkRansom: Modern compilers do understand range restrictions. See e.g. stackoverflow.com/questions/40447195/…
    – MSalters
    Commented Oct 22, 2021 at 9:58
29

Compilers started generating more efficient code than the average assembler programmer the moment that architectures became so complex that the assembler programmer wasn't been able to cope with all the details of them. Things like which instruction should go to pipe U or pipe V in a Pentium processor, etc.

Which one was the first? I'd say, for the x86 architecture, it was the Intel Compiler. At least it was the first one (ttbomk) that was able to detect candidate loops for vectorization and use MMX, SSE and AVX instructions with them.

However, the Watcom C compiler had a reputation for generating very good quality machine code in the days before Pentium and even 486. Some of the optimization options I saw after in the Intel Compiler, were already available in the Watcom.

5
  • AVX didn't exist until after other compilers (like GCC) gained the ability to auto-vectorize. But yes, good point for MMX and SSE. Current ICC still has some auto-vectorization capabilities that current gcc and clang lack, e.g. loops like memchr where the trip count isn't known / calculable ahead of the first iteration. GCC / clang still can't vectorize search loops, but ICC can. Commented Sep 14, 2020 at 22:46
  • The specific features involved in the complexity were probably out-of-order processing and superscalar design.
    – Mark
    Commented Sep 15, 2020 at 1:34
  • I like that this answer doesn't treat the compiler in a vacuum. As compilers were getting smarter, CPUs were getting more complex to program for.
    – Cort Ammon
    Commented Sep 15, 2020 at 15:55
  • A good thing about this example is it does highlight the level of detail an assembly programmer may need to know for the Pentium. As well as the differences between the U and V pipes in terms of integer instructions other questions would be do you try to interleave simple and complex integer instructions to keep them both fed and is the SIU quicker than the CIU for instructions they can both handle? Commented Sep 26, 2020 at 18:43
  • you mean i'm not supposed to read the entire 12,900 page reference manual for ARM v8-A in order to program my apple M2?
    – don bright
    Commented May 4, 2023 at 0:43
22

I came across an interesting comment a few days ago that Donald Knuth was deeply impressed when he discovered that 5 * 5 - 25 was optimised by an (ALGOL?) compiler to a register clear. That would have been in the late 1950s.

Frances Allen's book on optimisation was published in 1972. I agree that a lot of 1980s-era PC compilers produced poor code, but they were also notable for (a) being cheap and (b) not assuming the availability of an arbitrarily-large amount of memory which would have allowed them to optimise an arbitrarily-complex expression (let alone attempting to optimise across expressions).

I'd also note hearing a comment in the late 1980s that some of the most efficient compilers were for Modula-2, since the source language gave the compiler sufficient hints as to what was expected. The Topspeed compilers (written largely by George Barwood) were pretty good.

So I think a reasonable answer would be that in principle a compiler could approximate the efficiency of a human programmer in the early to mid 1970s, provided that the user paid enough for the compiler and provided that the host on which it ran had sufficient resources.

14
  • 7
    The DEC BLISS compilers got to be pretty good; somewhere around 1981, I ran a test once where I rewrite some of my own MACRO-32 in BLISS-32, and the code was only 1.7 times larger, which I thought was fair enough. I read an interview with Ritchie in which he said that if DEC would have let him have the BLISS-11 compiler, they would not have needed to invent C. But DEC was pretty close-fisted with BLISS.
    – dave
    Commented Sep 12, 2020 at 23:08
  • 3
    As a counterpoint, I know that at least one Modula-2 compiler from the 1980s (for the ARM) was excruciatingly bad. On a CPU which had an integer multiply instruction and a rich register set, it would emit 20 instructions followed by a subroutine call to do an integer multiply. This was a big factor in the selection of "Arthur" (written directly in assembler) rather than its more sophisticated competitor for the Acorn Archimedes; Arthur could do things in 256KB that the other one couldn't do in 4MB.
    – Chromatix
    Commented Sep 13, 2020 at 7:35
  • 2
    @Chromatix And it cost how much and wanted what resources? Those were the days in which if you were serious you bought yourself a Logitech compiler hosted on a VAX... none of this PC 640K crap. Which leads to the extreme case of Stallman's observation on the Pastel compiler: it built (and presumably optimised) the entire parse tree in memory before generating code. In the middle is the interesting case of Tree Meta, where the syntax equations contained an explicit directive to specify the point at which the tree should be "unparsed" to machine code. Commented Sep 13, 2020 at 9:06
  • @MarkMorganLloyd Cost was a real factor here - a 4MB machine was simply unaffordable at that time (circa 1986) though it could be built, and the way it swapped incessantly under even a light load meant users would not be getting what they paid for. "Arthur" - short for "A RISC operating system by Thursday" - did more in practice with far fewer resources, and allowed useful, fast programs to be written in interpreted BASIC. And you can run an updated version of it on a Raspberry Pi.
    – Chromatix
    Commented Sep 13, 2020 at 10:25
  • 2
    @Chromatix Exactly. But I think my reply still stands: provided that you had the resources, you could have an efficient compiler in the mid-70s. I'd suggest that even with the improved techniques available today, if you built a machine with 64K you'd be hard pushed to improve on the efficacy of something like CP/M Turbo Pascal. Commented Sep 13, 2020 at 10:53
13

The primary advantage that C would have over an assembly-language programmer is the ability to adjust generated code to deal with changes to various constants. When using a quality compiler, if one writes "unsigned foo = bar/8;" a compiler can generate a shift instruction, but if the constant would later need to be 5, a compiler can switch to using other means of performing the computation, such as a combination of a multiply and shift. If, by contrast, the code had been written in optimal assembly language, changing the divisor would require more substantial changes to the code.

Otherwise, while the makers of some free compilers may like to claim that their compilers can do as well or better than assembly languages, and while they may find some "clever" optimizations that occasionally allow them to do so, they routinely generate code which is significantly worse than would be expected from any competent assembly-language programmer. For example, when targeting the popular Cortex-M0 microcontroller, gcc will process the use of a constant within a loop by generating code that reloads the constant every time through the loop. Even if the constant is loaded to a register-qualified object before the loop, gcc will still defer the load until the value is used, and re-execute the load on every iteration through the loop.

14
  • 2
    @Holger: I should perhaps have been a bit more general about the kinds of changes that C code readily accommodates, but the basic point is that hand-written machine code which is designed for a particular precise usage case can often substantially perform compiled code. Writing assembly code to be more broadly useful often requires sacrificing performance. If one were writing an assembly-language loop which needs to perform some task 120 times today, but may need to do it 126 times tomorrow, one might unroll it 8x but have to include logic that could add an extra 1-7 repetitions...
    – supercat
    Commented Sep 14, 2020 at 14:54
  • 2
    ...if needed (determined at runtime) while a compiler could automatically switch between unrolling strategies to yield optimal results for any constant number of repetitions. As noted, however, while the free compilers employ some rather complex and clever optimizations, they are also prone to miss some low hanging fruit. There's IMHO something philosophically wrong with a compiler that simultaneously makes complex and unsound inferences based upon pointer equality, but misses as much low-hanging fruit as clang and gcc do.
    – supercat
    Commented Sep 14, 2020 at 14:59
  • 2
    Recognizing, for example, that the sequence "subtract; compare with zero; jump if not equal" may be replaced with "subtract; jump if not equal" is not an obscure optimization; nor is recognizing that there is no need to sign-extend a register whose upper bits are never going to be used. If gcc had a mode which would apply that sort of optimization, identify what variables or constants would be eligible for a "register" qualifier, and automatically apply that qualifier to the ones that are used the most, that would reap more than half of the execution time savings that would be available...
    – supercat
    Commented Sep 14, 2020 at 15:03
  • 2
    ...for many programs, without sacrificing compatibility with programs written for low-level compilers that will process constructs "in a documented fashion characteristic of the environment" even when they are outside the Standard's jurisdiction.
    – supercat
    Commented Sep 14, 2020 at 15:05
  • 4
    If the described behavior is generally reproducible, then it's a major defect in the GCC ARM code-generator and you should really file it as a bug with the GCC folks. Commented Sep 14, 2020 at 20:56
12

There's another factor going on here, also, that I have noticed in examining compiler output vs what I would have written (admittedly, I haven't done enough assembly to be a real expert at it):

Given what the compilers know I have been impressed at how efficiently it was coded. However, in every case I have examined I could have done better because I knew things about the problem the compiler didn't.

3
  • Which I think reinforces what I said about about languages that give sufficient hints to the compiler to let it get the optimisation right, and opens the question about how a language can best solicit advice without insisting that the programmer use predefined functions (written in a particular style and heavy with hints) or litter his code with pragmata. Commented Sep 14, 2020 at 6:32
  • 2
    A key insight here might be that a good C programmer can write code which is optimizer-friendly. There's very little in real-world problems that can be expressed only in assembly. One hint in this regard is restrict in C99 - it's been added to make code optimizer-friendly, but in the intervening 20 years no similar extensions have been added.
    – MSalters
    Commented Sep 14, 2020 at 9:14
  • @MSalters: Unfortunately, some compilers make it annoyingly difficult to write code that is optimizer friendly. For example, clang and gcc process restrict in such a way that within a statement like if (p==q) doSomething(p); the value passed to doSomething won't necessarily be recognized as being based upon p. Although the Standard's definition of "based upon" is sufficiently ambiguous that such behavior might be conforming, nothing in the Standard would indicate any intention to forbid comparisons between restrict-qualified pointers and anything else.
    – supercat
    Commented Sep 14, 2020 at 19:03
7

Never. That's my short and provocative answer. The code generated by the compiler was chosen by a programmer, the optimisations applied can also be applied to assembly, giving unlimited time and resources to the programmer, he will always be able to generate better code than the compiler. The question is, is it worthwhile to try to overcome the limitations of the compiler or not. There is a limit a compiler cannot break that a human can. The compiler has to conform to certain constraints (ABI, UB, call conventions, register usage, etc.) that the human can decide to violate.

18
  • 5
    The OP wanted a comparison for the "average" programmer though. The advantage for a compiler is that the compiler knows all the tricks a truly expert programmer does, and the "average" programmer may not.
    – Graham
    Commented Sep 14, 2020 at 13:09
  • 6
    You’re wrong. Specify any amount of time, you spend the time on assembler code, I spend the same time on writing C code, my code will run faster. Because writing assembler code takes so much longer, I’ll have better algorithms.
    – gnasher729
    Commented Sep 14, 2020 at 14:14
  • 4
    This seems a fair answer. The original question claims, without presenting evidence, that some unspecified C compilers (on unspecified ISAs) can generate faster code than some unspecified "average" assembly-level programmer.
    – dave
    Commented Sep 14, 2020 at 16:00
  • 5
    While I agree this is a fair answer, I submit it's wrong. On x86 at least, I believe the question's claim is very true. The average assembly-language programmer does horribly inefficient things on modern x86, mistakes that a compiler would never make. You need a real wizard of an assembly-language programmer who is also an expert on the x86 microarchitecture in order to beat a good optimizing compiler. Those wizards exist, but they certainly aren't "average". This is all beyond the simple "time it takes to write" metric that gnasher and others are talking about. @another-dave Commented Sep 14, 2020 at 20:59
  • 6
    @CodyGray - focus on x86 is a modern disease; in retrocomputing SE we should take a wider view.
    – dave
    Commented Sep 15, 2020 at 0:47
7

My eureka moment was in the late 80's (88 or 89) when a Senior developer on our team decided that a hand-coded assembly language routine he needed to change must be rewritten in C first. At that time we were using the WATCOM C compiler. The immediate result was that the compiled C version was 20% smaller. I no longer recall what the speed difference was.

That day I sent an email to WATCOM's top developer on the C compiler reporting the result, and claimed that I'd never write a another routine in assembly language. I still haven't, although with the rise of Arduino and tiny microprocessors, I would no longer rule it out.

7
  • That's pretty kewl.
    – Tomachi
    Commented Sep 15, 2020 at 11:41
  • I did a fairly big hand-coded assembler project on a ATtiny88 and a C compiler for sure couldn't have made it fit but it had a stack leak which took me one day to find, so I strongly advise against it.
    – Janka
    Commented Sep 15, 2020 at 22:26
  • 2
    Seriously, how is this even an argument? The assembly code may have been optimized for speed, so the fact that optimizing compiler produced a shorter code is frankly meaningless without side-by-side comparison of both speed and size of both routines.
    – introspec
    Commented Sep 16, 2020 at 13:44
  • @introspec one surprising finding from compilers that could be optimized for speed or size via a switch was that the programs optimized for size tended to be faster as well. Commented Sep 30, 2021 at 19:08
  • @MarkRansom, as an experienced assembly programmer, I believe it to be almost never true. You can almost always get a bit of extra speed by sacrificing some memory. So what you are observing, is not an indication of efficiency of small programs, but conversely, an indication of the compiler inability to optimize for speed all that well.
    – introspec
    Commented Sep 30, 2021 at 22:03
7

Its really a cost/benefit problem. Hand optimized assembly could still be faster as your optimizing for a specific code path, not a more general one. That being said, each iteration of a compiler could make better decisions and generate tighter code with less room for further optimization. At some point, the extra few instructions that could be saved are not worth the time/cost to hand optimize. There was a time, I believe early 90's, where we were using partial assembly. Some routines were hand optimized assembly for critical performance, but most were done in higher level languages. Eventually,those hand optimized assembly routines were re-coded into higher level languages as chips became faster and the need for performance gains were reduced.

As recently as a few years ago I dusted off my wizards cap and robes and hand coded a tiny inline ASM routine to perform a simple transformation...more because I could shave a few tics off of a routine that was being called in a tight loop and could manage the registers myself. The end result was something that out performed a similarly coded C routine by approximately twice (although we are talking tics). It is possible that a future version of the compiler could generate tighter code and/or new processor technologies would further reduce any noticeable gains.

2
  • In many cases, figuring out the most efficient sequence of machine code instructions to accomplish a task which is simple and repetitive, but doesn't match any of a compiler's built-in constructs, will often be easier than trying to find a piece of C code which can achieve the same output even if the latter is possible. Ironically, the hard part is often figuring out how to make the build system integrate the machine code into the rest of the project. I wish there were a standard syntax to express the concept "put this structure into code memory and treat it as a function" on platforms...
    – supercat
    Commented Sep 26, 2020 at 17:09
  • ...where that would make sense, since an assembler that could produce output in that format would make it possible to use assembly code in a manner that would allow rebuilding with any tool set in cases where one didn't need to change the machine code, and would allow one to use e.g. a browser-based cross-assembler in cases where one did need to change it.
    – supercat
    Commented Sep 26, 2020 at 17:11
6

I would say that the big change in compiler tech comes around 1988, with the invention of single static assignment form (SSA).

Before this, optimizers could often be beaten by humans because the optimizer struggled to prove that a given suboptimal code form had the semantics needed for an optimization to apply. SSA changed this so that lots of powerful optimizations (such as value numbering) become cheap to apply.

1
  • 1
    SSA works well for objects of automatic duration whose address isn't taken. It works less well for other kinds of objects, whose relationships may not be statically resolvable. Applying it to other forms of objects allows compilers to determine that a particular code form has the semantics necessary to apply a certain transform in many cases where proving that would be impossible--sometimes because the code doesn't actually have the semantics in question.
    – supercat
    Commented Apr 28, 2023 at 17:21
3

And yet another point on this topic... the "four times" may have been referring to specific, common, platforms.

Older languages generally had global scope and limited subroutine functionality. For instance, FORTRAN had user-controlled scoping and in many cases, there was no local data in a routine. Programs also generally used subs as defined functions, as opposed to a way of organizing code (well...).

In contrast, Algol-derived languages use the block as the primary code organization concept, and programs are generally a collection of subroutine calls. Because the blocks have local scope, every one of these calls generally results in the creation (and destruction) of an activation record. As a result, there is significant overhead in the call dynamics in C (et all) that older languages didn't have.

This led to the widespread use of intermediate systems programming languages, like BLISS mentioned above. On micros, these languages generally combined block-like layout with non-recursive call semantics that didn't require activation records. For instance, Action! on the Atari was generally considered to be about half the speed of hand-coded assembler, whereas C programs were much slower.

While larger platforms like the PDP's and VAXen had larger stacks and userspace controls that aided block-oriented languages, along with the room needed for more optimizations, I suspect at least some of that "four times" was a result of this same effect. Assembler on the same platform could tightly control the calls and unwind with ease, things that did not come to compilers until later.

You can also see this in the performance of systems that were designed to support block-oriented languages; things like the CRISP, the original RISC II and various stack-oriented machines generally offered performance from C that was highly competitive with assembler - that was their entire raison d'etre.

4
  • Well, IIRC there was a discussion on certain transformation of graphics, comparing amateur codes, optimized amateur codes, c code by the pro, and asm by the pro. And each of them saved about 10 times of execution time. But I forget if it was posted here in RC, or anywhere else discussing some kind of 8-bit C.
    – Schezuk
    Commented Mar 2, 2021 at 18:19
  • When targeting something like the Z80 or the original 8088/8086, there can be a huge performance difference between a loop that keeps all non-array-ish objects in registers, versus code that can't, and assembly language can exploit byte-split registers in ways that programmers wouldn't be able to. The advantage of hand-written assembly language diminishes one one is no longer working with tight loops whose working set can be kept, or largely kept, in registers. After that, what matters is often whether code written in a high-level language can exploit things like chunking optimizations.
    – supercat
    Commented Mar 2, 2021 at 23:26
  • If one is using a compiled language or dialect which makes it possible to e.g. process suitably aligned pairs of 16-bit values using 32-bit or 64-bit types, that may allow a roughly 2x or 4x speedup versus what would be possible otherwise. If one is using a dialect that does not allow that, that will represent an extra 2x or 4x advantage for using assembly language versus that particular compiled dialect.
    – supercat
    Commented Mar 2, 2021 at 23:29
  • Hmm, as far as I recall, BLISS had recursive calls, nested routines, and uplevel references. It might be that the compilers were better at optimizing the cases that didn't use any of these. On VAX, the activation record was basically constructed by the CALLS instruction anyway, with maybe a subtract in the callee to allocate stack storage (the compiler tended to allocate the max required stack at routine entry).
    – dave
    Commented Mar 3, 2021 at 3:02
3

The choice between compiled language and assembly programming isn't always binary. In 1988 I was writing a sort-merge system on MS-DOS for specialised variable-length records. This was intended for CD-ROM publishing, so we had machines that exceeded the 32MB volume limit for MS-DOS 3.x, by using larger disc sectors. I was using MS C 5.0, I think.

The comparison routine for two records was obviously performance-critical. Rather than writing it in assembly, I spent a couple of days writing versions in C, reading the assembler listing from the compiler, and benchmarking promising-looking versions. I got a factor of two speed-up from an initial naïve design, with rather less work than writing assembler.

1
  • 1
    I think the advantage of writing in C would be the ability to have code function, even if sub-optimally, on other platforms to which it might be migrated. If code needs to move to a differnet hardware platform after 10 years because the old one becomes unmaintainable, improvements in machine speed would likely mean that even sub-optimal performance on the new machine would be better than optimized performance on the old one. What's annoying are situations where it's much easier to figure out what the optimal sequence of instructions should be than to coax devtools to generate it.
    – supercat
    Commented Apr 29, 2023 at 17:40
1

I guess the difference between "an average programmer" and a compiler is that the compiler has "mechanical sympathy" with the hardware it's compiled to. Also feel the need to quote Donald Knuth / Hoare / Dijkstra, depending on who you ask: "premature optimisation is the root of all evil".
In today's world of cloud computing, it all gets fuzzy: virtual machines, containers and runtime virtual machines (eg Java's Virtual Machine) can all co-exist together. Therefore, compiler micro-optimisations are meaningless in the grander scheme of things - code optimised for a container might be irrelevant on the VM / Physical hardware it runs on.

Of course, if we're talking about bare-metal control, then it matters. However these scenarios are quite niche, unless we're talking about running code on Micro Controllers, then optimising power by optimising CPU cycles is good. x number of CPU cycles costs microamps of battery life, so this could be critical for some applications.

Processors have branch condition caches, L1 and L2 caches for RAM to speed up memory access and branching, as well as disk/ssd-backed virtual memory. Processors can also pipeline instructions and effectively run some parts of code in parallel if there are groups of unrelated instructions which are unaffected by the order of which they are executed. Intel did this with their Hyper Threading technology, and there were probably others before them, but I'm not certain who they are without some proper research.

The JVM has a Hotspot compiler. The hotspot compiler converts frequently interpreted portions of byte-code into native machine code to save repeatedly parsing/translating the byte code continuously. Compilers have optimisation too, like in-lining code to save on some kind of machine-code call instruction (which normally might involve additional cycles for saving the return address at the very least). From a heuristics perspective, it's the empirical data which handles on-the-fly optimisation, so they're going to catch stuff you never even thought about.

Much of this depends on the processor, the language, the operating system, the compiler and the data that needs to be processed.

1
  • On the flip side, programmers often know much more about corner-case behavioral requirements than compilers can. If the optimal machine code that would meet requirements would have a corner-case behavior which can't be expressed in C, and if a significant but not unlimited variety of behaviors would be acceptable in the corner case, it may be impossible for to write source code which from which any compiler should be expected to even be capable of generating optimal machine code unless the compiler extends the language to support such corner cases.
    – supercat
    Commented Oct 10, 2020 at 21:55
1

DISCLAIMER: I'm no "expert", but I have been around the block quite a few times.

This should have occurred to me earlier, but being otherwise distracted, never articulated fully my thoughts.

I'll give you some background:

  1. Optimisations are very likely heuristic. If > 50% of specific sequences of code does x, rather than y, then optimise for that scenario. Rinse and repeat. This was something that Systems Programmers on mainframes did for a living to eke out some more CPU cycles.
  2. We're not just talking about C here. It's done in many languages...
  3. Processors have (adaptive) branch prediction logic/caches (see How does the branch predictor know if it is not correct?)
  4. VMs have optimising compilers which alter the underlying runtime bytecode to speed things up a bit by switching whether a branch needs to be made (possibly costing slightly more CPU cycles to fetch more memory if the branch is far away(see 3))
  5. It's possible that some higher-level languages could be leveraging functionality from C or C++ compilers... meaning the same "boilerplate" code is shared across many languages.
  6. No doubt there are many many more that I'd care to know.
  7. The moral of the story: you either have to be lucky, or FULLY understand what's going on in order to crack performance optimisation.
4
  • 1
    Is there a reason you wrote a second answer, instead of extending the existing?
    – Raffzahn
    Commented Mar 2, 2021 at 1:38
  • Not particularly, except that there are many aspects to consider here. Optimisation is a very broad-ranging subject, and is subjective in itself. I simply wanted to add another perspective without causing further confusion. Happy to merge them together if you thing this would help?
    – KRK Owner
    Commented Mar 4, 2021 at 0:49
  • 1
    None of this addresses the question - when did the crossover occur? (I'm not convinced it's a good question, as it's too subjective). Commented Sep 29, 2021 at 7:49
  • 1
    @TobySpeight - thanks for being on-point. I also agree that it's highly subjective to precisely pinpoint where a crossover occurred. That's probably because engineers have always been creative in optimising things based upon the world of constraints they live within.... hardware, software, knowledge, experience, contacts...
    – KRK Owner
    Commented Feb 24, 2023 at 0:59
0

Quotes lilburne

Many years ago I was teaching someone to program in C. The exercise was to rotate a graphic through 90 degrees. He came back with a solution that took several minutes to complete, mainly because he was using multiplies and divides etc.

I showed him how to recast the problem using bit shifts, and the time to process came down to about 30 seconds on the non-optimizing compiler he had.

I had just got an optimizing compiler and the same code rotated the graphic in < 5 seconds. I looked at the assembly code that the compiler was generating, and from what I saw decided there and then that my days of writing assembler were over.

2
  • 4
    Amusing, but just anecdotal, and it doesn’t even give so much as a specific year when that happened. Commented Sep 29, 2021 at 7:18
  • It was on an Atari ST, according to comments.
    – wizzwizz4
    Commented Apr 30, 2023 at 19:00

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .