Why does the order of the loops affect performance when iterating over a 2D array?

Question

Below are two programs that are almost identical except that I switched the i and j variables around. They both run in different amounts of time. Could someone explain why this happens?

Version 1

#include <stdio.h>
#include <stdlib.h>

main () {
  int i,j;
  static int x[4000][4000];
  for (i = 0; i < 4000; i++) {
    for (j = 0; j < 4000; j++) {
      x[j][i] = i + j; }
  }
}

Version 2

#include <stdio.h>
#include <stdlib.h>

main () {
  int i,j;
  static int x[4000][4000];
  for (j = 0; j < 4000; j++) {
     for (i = 0; i < 4000; i++) {
       x[j][i] = i + j; }
   }
}

@naught101 The benchmarks will show a performance difference of anywhere between 3 to 10 times. This is basic C/C++, I'm completely stumped as to how this got so many votes... — TC1, Commented Mar 30, 2012 at 9:12
@TC1: I don't think it's that basic; maybe intermediate. But it should be no surprise that the "basic" stuff tends to be useful to more people, hence the many upvotes. Moreover, this is a question that's hard to google, even if it is "basic". — LarsH, Commented Mar 30, 2012 at 18:50

Community · Accepted Answer · 2016-05-07 15:28:52Z

As others have said, the issue is the store to the memory location in the array: x[i][j]. Here's a bit of insight why:

You have a 2-dimensional array, but memory in the computer is inherently 1-dimensional. So while you imagine your array like this:

0,0 | 0,1 | 0,2 | 0,3
----+-----+-----+----
1,0 | 1,1 | 1,2 | 1,3
----+-----+-----+----
2,0 | 2,1 | 2,2 | 2,3

Your computer stores it in memory as a single line:

0,0 | 0,1 | 0,2 | 0,3 | 1,0 | 1,1 | 1,2 | 1,3 | 2,0 | 2,1 | 2,2 | 2,3

In the 2nd example, you access the array by looping over the 2nd number first, i.e.:

x[0][0] 
        x[0][1]
                x[0][2]
                        x[0][3]
                                x[1][0] etc...

Meaning that you're hitting them all in order. Now look at the 1st version. You're doing:

x[0][0]
                                x[1][0]
                                                                x[2][0]
        x[0][1]
                                        x[1][1] etc...

Because of the way C laid out the 2-d array in memory, you're asking it to jump all over the place. But now for the kicker: Why does this matter? All memory accesses are the same, right?

No: because of caches. Data from your memory gets brought over to the CPU in little chunks (called 'cache lines'), typically 64 bytes. If you have 4-byte integers, that means you're geting 16 consecutive integers in a neat little bundle. It's actually fairly slow to fetch these chunks of memory; your CPU can do a lot of work in the time it takes for a single cache line to load.

Now look back at the order of accesses: The second example is (1) grabbing a chunk of 16 ints, (2) modifying all of them, (3) repeat 4000*4000/16 times. That's nice and fast, and the CPU always has something to work on.

The first example is (1) grab a chunk of 16 ints, (2) modify only one of them, (3) repeat 4000*4000 times. That's going to require 16 times the number of "fetches" from memory. Your CPU will actually have to spend time sitting around waiting for that memory to show up, and while it's sitting around you're wasting valuable time.

Important Note:

Now that you have the answer, here's an interesting note: there's no inherent reason that your second example has to be the fast one. For instance, in Fortran, the first example would be fast and the second one slow. That's because instead of expanding things out into conceptual "rows" like C does, Fortran expands into "columns", i.e.:

0,0 | 1,0 | 2,0 | 0,1 | 1,1 | 2,1 | 0,2 | 1,2 | 2,2 | 0,3 | 1,3 | 2,3

The layout of C is called 'row-major' and Fortran's is called 'column-major'. As you can see, it's very important to know whether your programming language is row-major or column-major! Here's a link for more info: http://en.wikipedia.org/wiki/Row-major_order

You have the "first" and "second" versions around the wrong way; the first example varies the first index in the inner loop, and will be the slower executing example. — caf, Commented Mar 30, 2012 at 5:39
Great answer. If Mark wants read more about such nitty gritty, I'd recommend a book like Write Great Code. — wkl, Commented Mar 30, 2012 at 13:59
Bonus points for pointing out that C changed the row order from Fortran. For scientific computing L2 cache size is everything because if all your arrays fit into L2 then computation can be completed without going to main memory. — Michael Shopsin, Commented Mar 30, 2012 at 15:26
@birryree: The freely-available What Every Programmer Should Know About Memory is also a good read. — caf, Commented Mar 30, 2012 at 22:38
Great answer but I actually imagine array as 0,0 1,0 2,0.. Why wouls you say 0,0 1,0 2,0 ? — Koray Tugay, Commented Oct 14, 2013 at 19:07

Oliver Charlesworth · Accepted Answer · 2012-03-30 02:20:03Z

73

Nothing to do with assembly. This is due to cache misses.

C multidimensional arrays are stored with the last dimension as the fastest. So the first version will miss the cache on every iteration, whereas the second version won't. So the second version should be substantially faster.

See also: http://en.wikipedia.org/wiki/Loop_interchange.

answered Mar 30, 2012 at 2:20

Oliver Charlesworth

271k33 gold badges582 silver badges683 bronze badges

Add a comment |

Oleksi · Accepted Answer · 2012-03-30 02:21:45Z

Version 2 will run much faster because it uses your computer's cache better than version 1. If you think about it, arrays are just contiguous areas of memory. When you request an element in an array, your OS will probably bring in a memory page into cache that contains that element. However, since the next few elements are also on that page (because they are contiguous), the next access will already be in cache! This is what version 2 is doing to get it's speed up.

Version 1, on the other hand, is accessing elements column wise, and not row wise. This sort of access is not contiguous at the memory level, so the program cannot take advantage of the OS caching as much.

With these array sizes, probably the cache manager in the CPU rather than in the OS is responsible here. — krlmlr, Commented Mar 30, 2012 at 8:59

Variable Length Coder · Accepted Answer · 2012-03-30 02:22:38Z

13

The reason is cache-local data access. In the second program you're scanning linearly through memory which benefits from caching and prefetching. Your first program's memory usage pattern is far more spread out and therefore has worse cache behavior.

answered Mar 30, 2012 at 2:22

Variable Length Coder

8,0862 gold badges25 silver badges29 bronze badges

Add a comment |

Yun · Accepted Answer · 2021-10-13 18:40:28Z

11

Besides the other excellent answers on cache hits, there is also a possible optimization difference. Your second loop is likely to be optimized by the compiler into something equivalent to:

for (j=0; j<4000; j++) {
  int *p = x[j];
  for (i=0; i<4000; i++) {
    *p++ = i+j;
  }
}

This is less likely for the first loop, because it would need to increment the pointer "p" with 4000 each time.

EDIT: p++ and even *p++ = .. can be compiled to a single CPU instruction in most CPU's. *p = ..; p += 4000 cannot, so there is less benefit in optimising it. It's also more difficult, because the compiler needs to know and use the size of the inner array. And it does not occur that often in the inner loop in normal code (it occurs only for multidimensional arrays, where the last index is kept constant in the loop, and the second to last one is stepped), so optimisation is less of a priority.

edited Oct 13, 2021 at 18:40

Yun

3,2566 gold badges10 silver badges28 bronze badges

answered Mar 30, 2012 at 11:28

fishinear

6,2923 gold badges38 silver badges87 bronze badges

I don't get what 'because it would need to jump the pointer "p" with 4000 each time' means.
– Veedrac
Commented Mar 6, 2016 at 20:57
@Veedrac The pointer would need to be incremented with 4000 inside the inner loop: p += 4000 i.s.o. p++
– fishinear
Commented Mar 7, 2016 at 8:46
Why would the compiler find that a problem? i is already incremented by a non-unit value, given it's a pointer increment.
– Veedrac
Commented Mar 7, 2016 at 11:16
I've added more explanation
– fishinear
Commented Mar 7, 2016 at 14:55
Try typing int *f(int *p) { *p++ = 10; return p; } int *g(int *p) { *p = 10; p += 4000; return p; } into gcc.godbolt.org. The two seem to compile basically the same.
– Veedrac
Commented Mar 7, 2016 at 17:13

| Show 1 more comment

Nicolas Modrzyk · Accepted Answer · 2012-03-30 02:29:24Z

9

This line the culprit :

x[j][i]=i+j;

The second version uses continuous memory thus will be substantially faster.

I tried with

x[50000][50000];

and the time of execution is 13s for version1 versus 0.6s for version2.

answered Mar 30, 2012 at 2:29

Nicolas Modrzyk

14.1k2 gold badges38 silver badges40 bronze badges

Add a comment |

Sebastian Mach · Accepted Answer · 2012-03-30 15:20:15Z

_{I try to give a generic answer.}

Because i[y][x] is a shorthand for *(i + y*array_width + x) in C (try out the classy int P[3]; 0[P] = 0xBEEF;).

As you iterate over y, you iterate over chunks of size array_width * sizeof(array_element). If you have that in your inner loop, then you will have array_width * array_height iterations over those chunks.

By flipping the order, you will have only array_height chunk-iterations, and between any chunk-iteration, you will have array_width iterations of only sizeof(array_element).

While on really old x86-CPUs this did not matter much, nowadays' x86 do a lot of prefetching and caching of data. You probably produce many cache misses in your slower iteration-order.

Collectives™ on Stack Overflow

Why does the order of the loops affect performance when iterating over a 2D array?

7 Answers 7

Not the answer you're looking for? Browse other questions tagged
c
performance
for-loop
optimization
cpu-cache
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

Not the answer you're looking for? Browse other questions tagged cperformancefor-loopoptimizationcpu-cache or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
c
performance
for-loop
optimization
cpu-cache
or ask your own question.