78
$\begingroup$

This question is an extension of two discussions that came up recently in the replies to "C++ vs Fortran for HPC". And it is a bit more of a challenge than a question...

One of the most often-heard arguments in favor of Fortran is that the compilers are just better. Since most C/Fortran compilers share the same back end, the code generated for semantically equivalent programs in both languages should be identical. One could argue, however, that C/Fortran is more/less easier for the compiler to optimize.

So I decided to try a simple test: I got a copy of daxpy.f and daxpy.c and compiled them with gfortran/gcc.

Now daxpy.c is just an f2c translation of daxpy.f (automatically generated code, ugly as heck), so I took that code and cleaned it up a bit (meet daxpy_c), which basically meant re-writing the innermost loop as

for ( i = 0 ; i < n ; i++ )
    dy[i] += da * dx[i];

Finally, I re-wrote it (enter daxpy_cvec) using gcc's vector syntax:

#define vector(elcount, type)  __attribute__((vector_size((elcount)*sizeof(type)))) type
vector(2,double) va = { da , da }, *vx, *vy;

vx = (void *)dx; vy = (void *)dy;
for ( i = 0 ; i < (n/2 & ~1) ; i += 2 ) {
    vy[i] += va * vx[i];
    vy[i+1] += va * vx[i+1];
    }
for ( i = n & ~3 ; i < n ; i++ )
    dy[i] += da * dx[i];

Note that I use vectors of length 2 (that's all SSE2 allows) and that I process two vectors at a time. This is because on many architectures, we may have more multiplication units than we have vector elements.

All codes were compiled using gfortran/gcc version 4.5 with the flags "-O3 -Wall -msse2 -march=native -ffast-math -fomit-frame-pointer -malign-double -fstrict-aliasing". On my laptop (Intel Core i5 CPU, M560, 2.67GHz) I got the following output:

pedro@laika:~/work/fvsc$ ./test 1000000 10000
timing 1000000 runs with a vector of length 10000.
daxpy_f took 8156.7 ms.
daxpy_f2c took 10568.1 ms.
daxpy_c took 7912.8 ms.
daxpy_cvec took 5670.8 ms.

So the original Fortran code takes a bit more than 8.1 seconds, the automatic translation thereof takes 10.5 seconds, the naive C implementation does it in 7.9 and the explicitly vectorized code does it in 5.6, marginally less.

That's Fortran being slightly slower than the naive C implementation and 50% slower than the vectorized C implementation.

So here's the question: I'm a native C programmer and so I'm quite confident that I did a good job on that code, but the Fortran code was last touched in 1993 and might therefore be a bit out of date. Since I don't feel as comfortable coding in Fortran as others here may, can anyone do a better job, i.e. more competitive compared to any of the two C versions?

Also, can anybody try this test with icc/ifort? The vector syntax probably won't work, but I would be curious to see how the naive C version behaves there. Same goes for anybody with xlc/xlf lying around.

I've uploaded the sources and a Makefile here. To get accurate timings, set CPU_TPS in test.c to the number of Hz on your CPU. If you find any improvements to any of the versions, please do post them here!

Update:

I've added stali's test code to the files online and supplemented it with a C version. I modified the programs to do 1'000'000 loops on vectors of length 10'000 to be consistent with the previous test (and because my machine couldn't allocate vectors of length 1'000'000'000, as in stali's original code). Since the numbers are now a bit smaller, I used the option -par-threshold:50 to make the compiler more likely to parallelize. The icc/ifort version used is 12.1.2 20111128 and the results are as follows

pedro@laika:~/work/fvsc$ OMP_NUM_THREADS=1 time ./icctest_c
3.27user 0.00system 0:03.27elapsed 99%CPU

pedro@laika:~/work/fvsc$ OMP_NUM_THREADS=1 time ./icctest_f
3.29user 0.00system 0:03.29elapsed 99%CPU

pedro@laika:~/work/fvsc$ OMP_NUM_THREADS=2 time ./icctest_c
4.89user 0.00system 0:02.60elapsed 188%CPU

pedro@laika:~/work/fvsc$ OMP_NUM_THREADS=2 time ./icctest_f
4.91user 0.00system 0:02.60elapsed 188%CPU

In summary, the results are, for all practical purposes, identical for both the C and Fortran versions, and both codes parallelize automagically. Note that the fast times compared to the previous test are due to the use of single-precision floating point arithmetic!

Update:

Although I don't really like where the burden of proof is going here, I've re-coded stali's matrix multiplication example in C and added it to the files on the web. Here are the results of the tripple loop for one and two CPUs:

pedro@laika:~/work/fvsc$ OMP_NUM_THREADS=1 time ./mm_test_f 2500
 triple do time   3.46421700000000     
3.63user 0.06system 0:03.70elapsed 99%CPU

pedro@laika:~/work/fvsc$ OMP_NUM_THREADS=1 time ./mm_test_c 2500
triple do time 3.431997791385768
3.58user 0.10system 0:03.69elapsed 99%CPU

pedro@laika:~/work/fvsc$ OMP_NUM_THREADS=2 time ./mm_test_f 2500
 triple do time   5.09631900000000     
5.26user 0.06system 0:02.81elapsed 189%CPU

pedro@laika:~/work/fvsc$ OMP_NUM_THREADS=2 time ./mm_test_c 2500
triple do time 2.298916975280899
4.78user 0.08system 0:02.62elapsed 184%CPU

Note that cpu_time in Fortran measuers the CPU time and not the wall-clock time, so I wrapped the calls in time to compare them for 2 CPUs. There is no real difference between the results, except that the C version does a bit better on two cores.

Now for the matmul command, of course only in Fortran as this intrinsic is not available in C:

pedro@laika:~/work/fvsc$ OMP_NUM_THREADS=1 time ./mm_test_f 2500
 matmul    time   23.6494780000000     
23.80user 0.08system 0:23.91elapsed 99%CPU

pedro@laika:~/work/fvsc$ OMP_NUM_THREADS=2 time ./mm_test_f 2500
 matmul    time   26.6176640000000     
26.75user 0.10system 0:13.62elapsed 197%CPU

Wow. That's absolutely terrible. Can anyone either find out what I'm doing wrong, or explain why this intrinsic is still somehow a good thing?

I didn't add the dgemm calls to the benchmark as they are library calls to the same function in the Intel MKL.

For future tests, can anyone suggest an example known to be slower in C than in Fortran?

Update

To verify stali's claim that the matmul intrinsic is "an order of magnitue" faster than the explicit matrix product on smaller matrices, I modified his own code to multiply matrices of size 100x100 using both methods, 10'000 times each. The results, on one and two CPUs, are as follows:

pedro@laika:~/work/fvsc$ OMP_NUM_THREADS=1 time ./mm_test_f 10000 100
 matmul    time   3.61222500000000     
 triple do time   3.54022200000000     
7.15user 0.00system 0:07.16elapsed 99%CPU

pedro@laika:~/work/fvsc$ OMP_NUM_THREADS=2 time ./mm_test_f 10000 100
 matmul    time   4.54428400000000     
 triple do time   4.31626900000000     
8.86user 0.00system 0:04.60elapsed 192%CPU

Update

Grisu is correct in pointing out that, without optimizations, gcc converts operations on complex numbers to library function calls while gfortran inlines them in a few instructions.

The C compiler will generate the same, compact code if the option -fcx-limited-range is set, i.e. the compiler is instructed to ignore potential over/under-flows in the intermediate values. This option is somehow set by default in gfortran and may lead to incorrect results. Forcing -fno-cx-limited-range in gfortran didn't change anything.

So this is actually an argument against using gfortran for numerical calculations: Operations on complex values may over/under-flow even if the correct results are within the floating-point range. This is actually a Fortran standard. In gcc, or in C99 in general, the default is to do things strictly (read IEEE-754 compliant) unless otherwise specified.

Reminder: Please keep in mind that the main question was whether Fortran compilers produce better code than C compilers. This is not the place for discussions as to the general merits of one language over another. What I would be really interested in is if anybody can find a way of coaxing gfortran to produce a daxpy as efficient as the one in C using explicit vectorization as this exemplifies the problems of having to rely on the compiler exclusively for SIMD optimization, or a case in which a Fortran compiler out-does its C counterpart.

$\endgroup$
24
  • $\begingroup$ One timing issue is that if your processor does frequency stepping/turbo mode, these results could be all over the map. $\endgroup$
    – Bill Barth
    Commented Feb 4, 2012 at 21:37
  • 1
    $\begingroup$ Your daxpy_c.c is currently updating x with a multiple of x and not touching y at all. You may want to fix that to make it fair... $\endgroup$ Commented Feb 4, 2012 at 22:28
  • 1
    $\begingroup$ @JackPoulson: Good catch, fixed and updated the results. $\endgroup$
    – Pedro
    Commented Feb 4, 2012 at 22:45
  • 2
    $\begingroup$ Also, I am fairly certain that the difference is completely due to the manual unrolling in the Fortran version confusing the compiler. When I replace it with the same simple loop that you put into your C version, the performance between the two is almost identical. Without the change, the Fortran version was slower with the Intel compilers. $\endgroup$ Commented Feb 4, 2012 at 22:45
  • 1
    $\begingroup$ @permeakra: Actually, the C99 standard specifies the restrict keyword which tells the compiler exactly that: to assume that an array does not overlap with any other data structure. $\endgroup$
    – Pedro
    Commented Dec 11, 2012 at 9:45

5 Answers 5

38
$\begingroup$

The difference in your timings seems to be due to the manual unrolling of the unit-stride Fortran daxpy. The following timings are on a 2.67 GHz Xeon X5650, using the command

./test 1000000 10000

Intel 11.1 compilers

Fortran with manual unrolling: 8.7 sec
Fortran w/o manual unrolling: 5.8 sec
C w/o manual unrolling: 5.8 sec

GNU 4.1.2 compilers

Fortran with manual unrolling: 8.3 sec
Fortran w/o manual unrolling: 13.5 sec
C w/o manual unrolling: 13.6 sec
C with vector attributes: 5.8 sec

GNU 4.4.5 compilers

Fortran with manual unrolling: 8.1 sec
Fortran w/o manual unrolling: 7.4 sec
C w/o manual unrolling: 8.5 sec
C with vector atrributes: 5.8 sec

Conclusions

  • Manual unrolling helped the GNU 4.1.2 Fortran compilers on this architecture, but hurts the newer version (4.4.5) and the Intel Fortran compiler.
  • The GNU 4.4.5 C compiler is much more competitive with Fortran than for version 4.2.1.
  • Vector intrinsics allow the GCC performance to match the Intel compilers.

Time to test more complicated routines like dgemv and dgemm?

$\endgroup$
10
  • $\begingroup$ Thanks for the results! What version of gcc were you using and can you be a bit more specific regarding the CPU? $\endgroup$
    – Pedro
    Commented Feb 4, 2012 at 23:25
  • 2
    $\begingroup$ Your compiler is older than your CPU... Can you try with gcc-4.5? $\endgroup$
    – Pedro
    Commented Feb 4, 2012 at 23:39
  • 1
    $\begingroup$ I just tried it. The vectorized version with GCC 4.4.5 exactly matches the Intel 11.1 results. $\endgroup$ Commented Feb 5, 2012 at 0:05
  • 1
    $\begingroup$ I just installed gcc/gfortran version 4.4.5 and I can't reproduce the differences w/o unrolling. In fact, in the assembler generated for both cases, the innermost loop is identical except for the register names used, which are interchangeable. Can you re-run your tests just to be sure? $\endgroup$
    – Pedro
    Commented Feb 5, 2012 at 11:27
  • 7
    $\begingroup$ Can we say this kind of settles the age old debate "we keep using fortran because it's more performant", so that we can finally throw it in the dumpster ? $\endgroup$ Commented Feb 17, 2012 at 18:57
18
$\begingroup$

I'm coming late to this party, so it's hard for me to follow the back-and-forth from all above. The question is big, and I think if you are interested it could be broken up into smaller pieces. One thing I got interested in was simply the performance of your daxpy variants, and whether Fortran is slower than C on this very simple code.

Running both on my laptop (Macbook Pro, Intel Core i7, 2.66 GHz), the relative performance of your hand-vectorized C version and the non-hand vectorized Fortran version depend on the compiler used (with your own options):

Compiler     Fortran time     C time
GCC 4.6.1    5408.5 ms        5424.0 ms
GCC 4.5.3    7889.2 ms        5532.3 ms
GCC 4.4.6    7735.2 ms        5468.7 ms

So, it just seems that GCC got better at vectorizing the loop in the 4.6 branch than it was before.


On the overall debate, I think one can pretty much write fast and optimized code in both C and Fortran, almost just like in assembly language. I will point out, however, one thing: just like as assembler is more tedious to write than C but gives you finer control over what is executed by the CPU, C is more low-level than Fortran. Thus, it gives you more control over details, which can help optimizing, where the Fortran standard syntax (or its vendor extensions) may lack functionality. One case is the explicit use of vector types, another is the possibility of specifying alignment of variables by hand, something Fortran is incapable of.

$\endgroup$
1
  • $\begingroup$ welcome to scicomp! I agree that compiler versions are as important as language in this case. Did you mean 'of' instead of 'off in your last sentence? $\endgroup$ Commented Feb 28, 2012 at 13:10
9
$\begingroup$

The way I would write AXPY in Fortran is slightly different. It is the exact translation of the math.

m_blas.f90

 module blas

   interface axpy
     module procedure saxpy,daxpy
   end interface

 contains

   subroutine daxpy(x,y,a)
     implicit none
     real(8) :: x(:),y(:),a
     y=a*x+y
   end subroutine daxpy

   subroutine saxpy(x,y,a)
     implicit none
     real(4) :: x(:),y(:),a
     y=a*x+y
   end subroutine saxpy

 end module blas

Now let's call the above routine in a program.

test.f90

 program main

   use blas
   implicit none

   real(4), allocatable :: x(:),y(:)
   real(4) :: a
   integer :: n

   n=1000000000
   allocate(x(n),y(n))
   x=1.0
   y=2.0
   a=5.0
   call axpy(x,y,a)
   deallocate(x,y)

 end program main

Now let's compile and run it ...

login1$ ifort -fast -parallel m_blas.f90 test.f90
ipo: remark #11000: performing multi-file optimizations
ipo: remark #11005: generating object file /tmp/ipo_iforttdqZSA.o

login1$ export OMP_NUM_THREADS=1
login1$ time ./a.out 
real    0 m 4.697 s
user    0 m 1.972 s
sys     0 m 2.548 s

login1$ export OMP_NUM_THREADS=2
login1$ time ./a.out 
real    0 m 2.657 s
user    0 m 2.060 s
sys     0 m 2.744 s

Notice that I am not using any loops or any explicit OpenMP directives. Would this be possible in C (that is, no use of loops and auto-parallelization)? I don't use C so I don't know.

$\endgroup$
5
  • $\begingroup$ Automatic parallelisation is a feature of the Intel compilers (both Fortran and C), and not of the language. Hence the equivalent in C should also parallelize. Just out of curiosity, how does it perform for a more moderate n=10000? $\endgroup$
    – Pedro
    Commented Feb 4, 2012 at 22:54
  • 4
    $\begingroup$ That was the whole point. Autopar is easier in Fortran due to that fact that Fortran (unlike C) supports whole array operations like matmult, transpose etc. So code optimization is easier for Fortran compilers. GFortran (which you have used) doesnt have the developer resources to optimize the Fortran compiler as their focus is currently to implement Fortran 2003 standard rather than optimization. $\endgroup$
    – stali
    Commented Feb 4, 2012 at 22:59
  • $\begingroup$ Uhmm... The Intel C/C++ compiler icc also does automatic parallelization. I've added a file icctest.c to the other sources. Can you compile it with the same options as you used above, run it, and report the timings? I had to add a printf-statement to my code to avoid gcc optimizing everything out. This is just a quick hack and I hope it is bug-free! $\endgroup$
    – Pedro
    Commented Feb 4, 2012 at 23:17
  • $\begingroup$ I've downloaded the latest icc/ifort compilers and done the tests myself. The the question has been updated to include these new results, i.e. that Intel's autovectorization works in both Fortran and C. $\endgroup$
    – Pedro
    Commented Feb 5, 2012 at 14:56
  • 1
    $\begingroup$ Thanks. Yes I noticed that there is little difference perhaps because loops are simple and operations are Level 1 BLAS. But as I said before due to Fortran's ability to perform whole array operations and use of keywords such as PURE/ELEMENTAL there is more room for compiler optimization. How the compilers uses this information and what it really does is a different thing. You can also try matmul if you want bpaste.net/show/23035 $\endgroup$
    – stali
    Commented Feb 5, 2012 at 15:29
6
$\begingroup$

I think, it is not only interesting how a compiler optimizes the code for modern hardware. Especially between GNU C and GNU Fortran the code generation can be very different.

So let's consider another example to show the differences between them.

Using complex numbers, the GNU C compiler produces a large overhead for nearly very basic arithmetic operation on a complex number. The Fortran compiler gives much better code. Let's have a look at the following small example in Fortran:

COMPLEX*16 A,B,C
C=A*B

gives (gfortran -g -o complex.f.o -c complex.f95; objdump -d -S complex.f.o):

C=A*B
  52:   dd 45 e0                fldl   -0x20(%ebp)
  55:   dd 45 e8                fldl   -0x18(%ebp)
  58:   dd 45 d0                fldl   -0x30(%ebp)
  5b:   dd 45 d8                fldl   -0x28(%ebp)
  5e:   d9 c3                   fld    %st(3)
  60:   d8 ca                   fmul   %st(2),%st
  62:   d9 c3                   fld    %st(3)
  64:   d8 ca                   fmul   %st(2),%st
  66:   d9 ca                   fxch   %st(2)
  68:   de cd                   fmulp  %st,%st(5)
  6a:   d9 ca                   fxch   %st(2)
  6c:   de cb                   fmulp  %st,%st(3)
  6e:   de e9                   fsubrp %st,%st(1)
  70:   d9 c9                   fxch   %st(1)
  72:   de c2                   faddp  %st,%st(2)
  74:   dd 5d c0                fstpl  -0x40(%ebp)
  77:   dd 5d c8                fstpl  -0x38(%ebp)

Which are 39 bytes machine code. When we consider the same in C

 double complex a,b,c; 
 c=a*b; 

and take a look at the output ( done in the same way like above), we get:

  41:   8d 45 b8                lea    -0x48(%ebp),%eax
  44:   dd 5c 24 1c             fstpl  0x1c(%esp)
  48:   dd 5c 24 14             fstpl  0x14(%esp)
  4c:   dd 5c 24 0c             fstpl  0xc(%esp)
  50:   dd 5c 24 04             fstpl  0x4(%esp)
  54:   89 04 24                mov    %eax,(%esp)
  57:   e8 fc ff ff ff          call   58 <main+0x58>
  5c:   83 ec 04                sub    $0x4,%esp
  5f:   dd 45 b8                fldl   -0x48(%ebp)
  62:   dd 5d c8                fstpl  -0x38(%ebp)
  65:   dd 45 c0                fldl   -0x40(%ebp)
  68:   dd 5d d0                fstpl  -0x30(%ebp)

Which are 39 bytes machine code too, but the function step 57 refer to, does the proper part of the work and perform the desired operation. So we have 27 byte machine code to run the multi operation. The function behind is muldc3 provided by libgcc_s.so and has a footprint of 1375 byte in machine code. This slows down the code dramatically and gives an interesting output when using a profiler.

When we implement the BLAS examples above for zaxpy and perform the same test, the Fortran compiler should gives better results than the C compiler.

(I used GCC 4.4.3 for this experiment, but I noticed this behavior an other GCC releases to.)

So in my opinion we do not only think about parallelization and vectorization when we think about which is the better compiler we also have to look how basic things are translated to the assembler code. If this translation gives bad code the optimization can only use this things as input.

$\endgroup$
5
  • 2
    $\begingroup$ I just cooked-up an example along the lines of your code, complex.c and added it to the code online. I had to add all the input/output to make sure nothing is optimized out. I only get a call to __muldc3 if I don't use -ffast-math. With -O2 -ffast-math I get 9 lines of inlined assembler. Can you confirm this? $\endgroup$
    – Pedro
    Commented Feb 6, 2012 at 18:21
  • $\begingroup$ I've found a more specific cause for the difference in the generated assembler and have added this to my question above. $\endgroup$
    – Pedro
    Commented Feb 6, 2012 at 19:01
  • $\begingroup$ Using -O2 lead the compiler to compute every what is possible at runtime, that's why such constructs are sometimes lost. The -ffast-math option should not be used in scientific computing when you want to rely on the outputs. $\endgroup$ Commented Feb 7, 2012 at 8:25
  • 2
    $\begingroup$ Well, by that argument (no -ffast-math) you shouldn't use Fortran for your complex-valued computations. As I describe in the update to my question, -ffast-math or, more generally, -fcx-limited-range, forces gcc to use the same non-IEEE, restricted range computations as are standard in Fortran. So if you want the full range of complex values and correct Infs and NaNs, you shouldn't use Fortran... $\endgroup$
    – Pedro
    Commented Feb 7, 2012 at 9:04
  • 2
    $\begingroup$ @Pedro: If you want GCC to behave like GFortran wrt. complex multiplication and division, you should use the -fcx-fortran-rules. $\endgroup$
    – janneb
    Commented Mar 2, 2012 at 13:50
5
$\begingroup$

Folks,

I found this discussion very interesting, but I was surprised to see that re-ordering the loops in the Matmul example changed the picture. I don't have an intel compiler available on my current machine, so I am using gfortran, but a rewrite of the loops in the mm_test.f90 to

call cpu_time(start)  
do r=1,runs  
  mat_c=0.0d0  
     do j=1,n  
        do k=1,n  
  do i=1,n  
           mat_c(i,j)=mat_c(i,j)+mat_a(i,k)*mat_b(k,j)  
        end do  
     end do  
  end do  
end do  
call cpu_time(finish)  

changed the entire results for my machine.

The previous version timing results were:

#time ./mm_test_f 10000 100
 matmul    time   6.3620000000000001     
 triple do time   21.420999999999999     

whereas with the triple loops re-arranged as above yeilded:

#time ./mm_test_f 10000 100
 matmul    time   6.3929999999999998     
 triple do time   3.9190000000000005    

This is gcc/gfortran 4.7.2 20121109 on a Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz

Compiler flags used were those from the Makefile I got here...

$\endgroup$
2
  • 6
    $\begingroup$ That's not surprising, since the matrix storage in memory favors one order, i.e., if rows are stored contiguously, it's better to loop over rows innermost, since then you can load each row once into fast local memory compared to repeatedly loading (a slice of) it to access a single element. See stackoverflow.com/questions/7395556. $\endgroup$ Commented Jul 16, 2013 at 9:27
  • $\begingroup$ I guess I was surprised that the "intrinsic matmul" wouldn't be coded to do things this way. It's substantially faster with the triple do ordered in the second way. It does seem to be in this compiler set, as earlier gfortran versions I can get to were more "flat" in their timing - it didn't matter which way you did the mult - it took nearly the same time. $\endgroup$
    – Schatzi
    Commented Jul 24, 2013 at 3:42

Not the answer you're looking for? Browse other questions tagged or ask your own question.