Scala parallel collections

Question

I am very naively trying to use Scala .par, and the result turns out to be slower than the non-parallel version, by quite a bit. What is the explanation for that?

Note: the question is not to make this faster, but to understand why this naive use of .par doesn't yield an immediate speed-up.

Note 2: timing method: I ran both methods with N = 10000. The first one returned in about 20s. The second one I killed after 3 minutes. Not even close. If I let it run longer I get into a Java heap space exception.

def pi_random(N: Long): Double = {
  val count = (0L until N * N)
    .map { _ =>
      val (x, y) = (rng.nextDouble(), rng.nextDouble())
      if (x*x + y*y <= 1) 1 else 0
    }
    .sum
  4 * count.toDouble / (N * N)
}

def pi_random_parallel(N: Long): Double = {
  val count = (0L until N * N)
    .par
    .map { _ =>
      val (x, y) = (rng.nextDouble(), rng.nextDouble())
      if (x*x + y*y <= 1) 1 else 0
    }
    .sum
  4 * count.toDouble / (N * N)
}

Side-note: change the name of the variable to n. Initial capitals are understood to be class or type names in Scala. — Bob Dalgleish, Commented Apr 12, 2018 at 18:41

Joe K · Accepted Answer · 2018-04-12 16:56:57Z

Hard to know for sure without doing some actual profiling, but I have two theories:

First, you may be losing some benefits of the Range class, specifically near-zero memory usage. When you do (0L until N * N), you create a Range object, which is lazy. It does not actually create any object holding every single number in the range. Neither does map, I think. And sum calculates and adds numbers one at a time, so also allocates barely any memory.

I'm not sure the same is all true about ParRange. Seems like it would have to allocate some amount per split, and after map is called, perhaps it might have to store some amount of intermediate results in memory as "neighboring" splits wait for the other to complete. Especially the heap space exception makes me think something like this is the case. So you'll lose a lot of time to GC and such.

Second, probably the calls to rng.nextDouble are by far the most expensive part of that inner function. But I believe both java and scala Random classes are essentially single-threaded. They synchronize and block internally. So you won't gain that much from parallelism anyway, and in fact lose some to overhead.

From the docs: "Instances of java.util.Random are threadsafe. However, the concurrent use of the same java.util.Random instance across threads may encounter contention and consequent poor performance. Consider instead using ThreadLocalRandom in multithreaded designs." — Bob Dalgleish, Commented Apr 12, 2018 at 18:45
@JoeK If by the last paragraph you meant that the Random is the bottleneck and parallelization won't help, this is not the case. Parallelizing it with coarser granularity easily gives speedup of an order of magnitude. In general, experiments that involve sampling of independent random variables parallelize quite nicely. — Andrey Tyukin, Commented Apr 13, 2018 at 3:44

Andrey Tyukin · Accepted Answer · 2018-05-27 23:58:02Z

There is not enough work per task, the task granularity is too fine-grained.

Creating each task requires some overhead:

Some object representing the task must be created
It must be ensured that only one thread executes one task at a time
In the case that some threads become idle, some job-stealing procedure must be invoked.

For N = 10000, you instantiate 100,000,000 tiny tasks. Each of those tasks does almost nothing: it generates two random numbers and performs some basic arithmetic and an if-branch. The overhead of creating a task is not comparable to the work that each task is doing.

The tasks must be much larger, so that each thread has enough work to do. Furthermore, it's probably faster if you make each RNG thread local, so that the threads can do their job in parallel, without permanently locking the default random number generator.

Here is an example:

import scala.util.Random

def pi_random(N: Long): Double = {
  val rng = new Random
  val count = (0L until N * N)
    .map { _ =>
      val (x, y) = (rng.nextDouble(), rng.nextDouble())
      if (x*x + y*y <= 1) 1 else 0
    }
    .sum
  4 * count.toDouble / (N * N)
}

def pi_random_parallel(N: Long): Double = {
  val rng = new Random
  val count = (0L until N * N)
    .par
    .map { _ =>
      val (x, y) = (rng.nextDouble(), rng.nextDouble())
      if (x*x + y*y <= 1) 1 else 0
    }
    .sum
  4 * count.toDouble / (N * N)
}


def pi_random_properly(n: Long): Double = {
  val count = (0L until n).par.map { _ =>
    val rng = ThreadLocalRandom.current
    var sum = 0
    var idx = 0
    while (idx < n) {
      val (x, y) = (rng.nextDouble(), rng.nextDouble())
      if (x*x + y*y <= 1.0) sum += 1
      idx += 1
    }
    sum
  }.sum
  4 * count.toDouble / (n * n)
}

Here is a little demo and timings:

def measureTime[U](repeats: Long)(block: => U): Double = {
  val start = System.currentTimeMillis

  var iteration = 0
  while (iteration < repeats) {
    iteration += 1
    block
  }

  val end = System.currentTimeMillis
  (end - start).toDouble / repeats
}

// basic sanity check that all algos return roughly same result
println(pi_random(2000))
println(pi_random_parallel(2000))
println(pi_random_properly(2000))

// time comparison (N = 2k, 10 repetitions for each algorithm)
val N = 2000
val Reps = 10
println("Sequential:  " + measureTime(Reps)(pi_random(N)))
println("Naive:       " + measureTime(Reps)(pi_random_parallel(N)))
println("My proposal: " + measureTime(Reps)(pi_random_properly(N)))

Output:

3.141333
3.143418
3.14142
Sequential: 621.7
Naive:      3032.6
My version: 44.7

Now the parallel version is roughly an order of magnitude faster than the sequential version (result will obviously depend on the number of cores etc.).

I couldn't test it with N = 10000, because the naively parallelized version crashed everything with an "GC overhead exceeded"-error, which also illustrates that the overhead for creating the tiny tasks is too large.

In my implementation, I've additionaly unrolled the inner while: you need only one counter in one register, no need to create a huge collection by mapping on the range.

Edit: Replaced everything by ThreadLocalRandom, it now shouldn't matter whether your compiler versions supports SAM or not, so it should work with earlier versions of 2.11 too.

I couldn't run your code. It gives: Error:(41, 53) type mismatch; found : () => scala.util.Random required: java.util.function.Supplier[_ <: scala.util.Random] val rngs = ThreadLocal.withInitial[Random] { () => new Random }` — Frank, Commented Apr 13, 2018 at 2:06
@Frank I've added a version with explicit new java.util.function.Supplier, should work for 2.11 too. — Andrey Tyukin, Commented Apr 13, 2018 at 2:17
-2 now, that's getting interesting? If someone spotted a bug: constructive criticism still welcome. — Andrey Tyukin, Commented May 28, 2018 at 3:46

Collectives™ on Stack Overflow

Scala parallel collections

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
scala
concurrency
scala-collections
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Not the answer you're looking for? Browse other questions tagged scalaconcurrencyscala-collections or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
scala
concurrency
scala-collections
or ask your own question.