1

This double loop is 50 times slower in Chez Scheme than in C++ (compiled with --optimize-level 3 and -O3, respectively)

(import
  (rnrs)
  (rnrs r5rs))


(let* ((n (* 1024 16))
       (a (make-vector n))
       (acc 0))

  (do ((i 0 (+ i 1)))
    ((= i n) #f)
    (vector-set! a i (cons (cos i) (sin i))))

  (do ((i 0 (+ i 1)))
    ((= i n) #f)
    (do ((j 0 (+ j 1)))
      ((= j n) #f)
      (let ((ai (vector-ref a i))
            (aj (vector-ref a j)))
        (set! acc (+ acc (+ (* (car ai) (cdr aj))
                            (* (cdr ai) (car aj))))))))

  (write acc)
  (newline))

(exit)

vs

#include <iostream>
#include <cmath>
#include <vector>
#include <algorithm>

typedef std::pair<double, double> pr;

typedef std::vector<pr> vec;

double loop(const vec& a)
{
    double acc = 0;
    const int n = a.size();

    for(int i = 0; i < n; ++i)
        for(int j = 0; j < n; ++j)
        {
            const pr& ai = a[i];
            const pr& aj = a[j];
            acc += ai .first * aj.second + 
                   ai.second * aj .first;
        }

    return acc;
}

int main()
{
    const int n = 1024 * 16;

    vec v(n);

    for(int i = 0; i < n; ++i)
        v[i] = pr(std::cos(i), std::sin(i));

    std::cout << loop(v) << std::endl;
}

I realize that there is more memory indirection in Scheme than in C++ here, but still the performance difference is surprising...

Is there a simple way to speed up the Scheme version? (Without changing the memory layout to something totally unidiomatic)

1 Answer 1

2

So while these programs do look the same they are not the same. You are using fixnum arithmetic in the C version while the Scheme version uses the standard numeric tower. To make a C version more like Scheme try using a bignum library for your calculations.

As a test I replaced the arithmetic with (rnrs arithmetic flonums) and (rnrs arithmetic fixnums) and it halved the execution time in DrRacket. I expect the same to happen in any implementation.

Now my initial tests showed that the C code executed about 25 times faster and not 50 as expected and by changing to floating point arithmetic I'm down to C being about 15 times faster.

I think I can make it even faster by using unsafe procedures, since Scheme does check the type of each argument at runtime it does operations before each procedure which doesn't happen in the C version. As a test I changed it to use unsafe procedures in my implementation and now it's only 10 times slower.

Hope it helps also in Chez :)

EDIT

Here is my modified source which improves the speed 2 times:

#!r6rs
(import
 (rnrs)
 ;; import the * and + that only work on floats (which are faster, but they still check their arguments)
 (only (rnrs arithmetic flonums) fl+ fl*))


(let* ((n (* 1024 16))
       (a (make-vector n))
       (acc 0.0)) ; We want float, lets tell Scheme about that!

  ;; using inexact f instead of integer i
  ;; makes every result of cos and sin inexact
  (do ((i 0 (+ i 1))
       (f 0.0 (+ f 1)))
    ((= i n) #f)
    (vector-set! a i (cons (cos f) (sin f))))

  (do ((i 0 (+ i 1)))
    ((= i n) #f)
    (do ((j 0 (+ j 1)))
      ((= j n) #f)
      (let ((ai (vector-ref a i))
            (aj (vector-ref a j)))
        ;; use float versions of + and *
        ;; since this is where most of the time is used
        (set! acc (fl+ acc
                       (fl+ (fl* (car ai) (cdr aj))
                            (fl* (cdr ai) (car aj))))))))

  (write acc)
  (newline))

And the implementation specific (lock in) just to tell that the type checking done in runtime does have an impact this code runs 30% faster than the previous optimization:

#lang racket

;; this imports import the * and + for floats as unsafe-fl* etc. 
(require racket/unsafe/ops)

(let* ((n (* 1024 16))
       (a (make-vector n))
       (acc 0.0)) ; We want float, lets tell Scheme about that!

  (do ((i 0 (+ i 1))
       (f 0.0 (+ f 1)))
    ((= i n) #f)
    ;; using inexact f instead of integer i
    ;; makes every result of cos and sin inexact
    (vector-set! a i (cons (cos f) (sin f))))

  (do ((i 0 (+ i 1)))
    ((= i n) #f)
    (do ((j 0 (+ j 1)))
      ((= j n) #f)
      ;; We guarantee argument is a vector
      ;; and nothing wrong will happen using unsafe accessors
      (let ((ai (unsafe-vector-ref a i))
            (aj (unsafe-vector-ref a j)))
        ;; use unsafe float versions of + and *
        ;; since this is where most of the time is used
        ;; also use unsafe car/cdr as we guarantee the argument is
        ;; a pair.
        (set! acc (unsafe-fl+ acc
                              (unsafe-fl+ (unsafe-fl* (unsafe-car ai) (unsafe-cdr aj))
                                          (unsafe-fl* (unsafe-cdr ai) (unsafe-car aj))))))))

  (write acc)
  (newline))

I have made an effort to keep the style of the original code. It's not very idiomatic Scheme. Eg. I would't have used set! at all, but it does not affect the speed.

5
  • Are you using -O3 with g++?
    – MWB
    Commented Sep 27, 2018 at 21:52
  • @MaxB Yes. C ran in 0.25s and compiled Scheme in 2.5s while your original code ran in about 6.5s (I took the average of 20 calls using time)
    – Sylwester
    Commented Sep 27, 2018 at 23:30
  • Could you paste your Scheme code into the answer? I'm assuming you changed all of the arithmetic function calls to their non-generic versions?
    – MWB
    Commented Sep 28, 2018 at 2:01
  • @MaxB Of course. The last is not Scheme though, but it does not represent the most speed improvement either.
    – Sylwester
    Commented Sep 28, 2018 at 16:05
  • The most direct analogy of C++ std::vector is not scheme vector (which is heterogenous) but scheme bytevector. The most exact comparison for speed would compare std::vector with the fixed-sized integer R6RS bytevector operations. That will almost certainly be faster, because it avoids the pointer indirection present in scheme vector.
    – Chris Vine
    Commented Sep 29, 2018 at 9:02

Not the answer you're looking for? Browse other questions tagged or ask your own question.