12
\$\begingroup\$

This golf requires a factorial calculation be split up among multiple threads or processes.

Some languages make this easier to coordinate than others, so it is lang agnostic. Ungolfed example code is provided, but you should develop your own algorithm.

The goal of the contest is to see who can come up with the shortest (in bytes, not seconds) multicore factorial algorithm for calculating N! as measured by votes when the contest closes. There should be a multicore advantage, so we'll require that it should work for N ~ 10,000. Voters should vote down if the author fails to provide a valid explanation of how it spreads out the work among processors/cores and vote up based on golf conciseness.

For curiosity, please post some performance numbers. There may be a performance vs golf score tradeoff at some point, go with golf so long as it meets the requirements. I'd be curious to know when this occurs.

You may use normally available single core big integer libraries. For instance, perl is usually installed with bigint. However, note that simply calling a system provided factorial function won't normally split the work across multiple cores.

You must accept from STDIN or ARGV the input N and output to STDOUT the value of N!. You may optionally use a 2nd input parameter to also provide the number of processors/cores to the program so it doesn't do what you'll see below :-) Or you may design explicitly for 2, 4, whatever you have available.

I will post my own oddball perl example below, previously submitted on Stack Overflow under Factorial Algorithms in Different Languages. It is not golf. Numerous other examples were submitted, many of them golf but many not. Because of share-alike licensing, feel free to use the code in any examples in the link above as a starting point.

Performance in my example is lackluster for a number of reasons: it uses too many processes, too much string/bigint conversion. As I said it is an intentionally oddball example. It will calculate 5000! in under 10 seconds on a 4 core machine here. However, a more obvious two liner for/next loop can do 5000! on one of the four processors in 3.6s.

You'll definitely have to do better than this:

#!/usr/bin/perl -w                                                              
use strict;
use bigint;
die "usage: f.perl N (outputs N!)" unless ($ARGV[0] > 1);
print STDOUT &main::rangeProduct(1,$ARGV[0])."\n";
sub main::rangeProduct {
    my($l, $h) = @_;
    return $l    if ($l==$h);
    return $l*$h if ($l==($h-1));
    # arghhh - multiplying more than 2 numbers at a time is too much work       
    # find the midpoint and split the work up :-)                               
    my $m = int(($h+$l)/2);
    my $pid = open(my $KID, "-|");
      if ($pid){ # parent                                                       
        my $X = &main::rangeProduct($l,$m);
        my $Y = <$KID>;
        chomp($Y);
        close($KID);
        die "kid failed" unless defined $Y;
        return $X*$Y;
      } else {
        # kid                                                                   
        print STDOUT &main::rangeProduct($m+1,$h)."\n";
        exit(0);
    }
}

My interest in this is simply (1) alleviating boredom; and (2) learning something new. This isn't a homework or research problem for me.

Good luck!

\$\endgroup\$
10
  • 10
    \$\begingroup\$ You can't count shortest code by votes, and the golfing and multi-threaded requirement seem to go poorly together. \$\endgroup\$ Commented Mar 15, 2011 at 12:39
  • \$\begingroup\$ My ancient single core notebook can do 10000! in less than 0.2 sec in Python. \$\endgroup\$
    – gnibbler
    Commented Mar 15, 2011 at 15:12
  • \$\begingroup\$ Multi-threading a CPU-bound process will almost always slow it down. All you are doing is adding overhead with little or no performance gain. Multi-threading is for I/O-wait. \$\endgroup\$
    – mellamokb
    Commented Mar 15, 2011 at 19:15
  • 2
    \$\begingroup\$ @mellamokb: I beg to differ for multi-core systems. \$\endgroup\$
    – Joey
    Commented Mar 15, 2011 at 21:44
  • \$\begingroup\$ @Joey: Ah. Missed that small detail :s Agreed \$\endgroup\$
    – mellamokb
    Commented Mar 15, 2011 at 23:36

9 Answers 9

7
\$\begingroup\$

Mathematica

A parallel-capable function :

 f[n_, g_] := g[Product[N@i, {i, 1, n, 2}] Product[N@i, {i, 2, n, 2}]]

Where g is Identity or Parallelize depending on the type of process required

For the timing test, we'll slightly modify the function, so it returns the real clock time.

f[n_, g_] := First@AbsoluteTiming[g[Product[N@i,{i,1,n,2}] Product[N@i,{i,2,n,2}]]]

And we test both modes (from 10^5 up to 9*10^5): (only two kernels here)

ListLinePlot[{Table[f[i, Identity],    {i, 100000, 900000, 100000}], 
              Table[f[i, Parallelize], {i, 100000, 900000, 100000}]}]   

Result: enter image description here

\$\endgroup\$
3
  • \$\begingroup\$ Are you missing a ] in the first code line? It looks unbalanced. \$\endgroup\$ Commented Mar 20, 2011 at 14:51
  • \$\begingroup\$ @Peter Thanks, the last "]" didn't make its way through the copy buffer. Corrected. \$\endgroup\$ Commented Mar 20, 2011 at 15:05
  • 1
    \$\begingroup\$ This appears to be the shortest. It also looks like the fastest, unless I am misreading something. I no longer subscribe to Mathematica so can't verify. Thanks for participating. \$\endgroup\$
    – Paul
    Commented Mar 31, 2011 at 5:48
7
\$\begingroup\$

Haskell: 209 200 198 177 chars

176 167 source + 33 10 compiler flag

This solution is pretty silly. It applies product in parallel to a value of type [[Integer]], where the inner lists are at most two items long. Once the outer list is down to at most 2 lists, we flatten it and take the product directly. And yes, the type checker needs something annotated with Integer, otherwise it won't compile.

import Control.Parallel.Strategies
s(x:y:z)=[[x,y::Integer]]++s z;s x=[x]
p=product
f n=p$concat$(until((<3).length)$s.parMap rseq p)$s[1..n]
main=interact$show.f.read

(Feel free to read the middle part of f between concat and s as "until I heart the length")

Things seemed like they were going to be pretty good since parMap from Control.Parallel.Strategies makes it pretty easy to farm this out to multiple threads. However, it looks like GHC 7 requires a whopping 33 chars in command line options and environment vars to actually enable the threaded runtime to use multiple cores (which I included in the total). Unless I am missing something, which is possible definitely the case. (Update: the threaded GHC runtime appears to use N-1 threads, where N is the number of cores, so no need to fiddle with run time options.)

To compile:

ghc -threaded prog.hs

Runtime, however, was pretty good considering the ridiculous number of parallel evaluations being sparked and that I didn't compile with -O2. For 50000! on a dual-core MacBook, I get:

SPARKS: 50020 (29020 converted, 1925 pruned)

INIT  time    0.00s  (  0.00s elapsed)
MUT   time    0.20s  (  0.19s elapsed)
GC    time    0.12s  (  0.07s elapsed)
EXIT  time    0.00s  (  0.00s elapsed)
Total time    0.31s  (  0.27s elapsed)

Total times for a few different values, the first column is the golfed parallel, the second is the naive sequential version:

          Parallel   Sequential
 10000!      0.03s        0.04s
 50000!      0.27s        0.78s
100000!      0.74s        3.08s
500000!      7.04s       86.51s

For reference, the naive sequential version is this (which was compiled with -O2):

factorial :: Integer -> Integer
factorial n = product [1..n]
main = interact $ show.factorial.read
\$\endgroup\$
2
  • 1
    \$\begingroup\$ IMO, you don't have to count the args for the compiler and interpreter. \$\endgroup\$
    – FUZxxl
    Commented Mar 19, 2011 at 16:10
  • \$\begingroup\$ @FUZxxl: Normally I would agree, but this problem specifically requested that it run in multiple threads or processes, and those flags are required to make that happen (with GHC 7.0.2, from the latest Haskell Platform, at least). \$\endgroup\$
    – user1011
    Commented Mar 20, 2011 at 1:11
6
\$\begingroup\$

Ruby - 111 + 56 = 167 characters

This is a two file script, the main file (fact.rb):

c,n=*$*.map(&:to_i)
p=(0...c).map{|k|IO.popen("ruby f2.rb #{k} #{c} #{n}")}
p p.map{|l|l.read.to_i}.inject(:*)

the extra file (f2.rb):

c,h,n=*$*.map(&:to_i)
p (c*n/h+1..(c+1)*n/h).inject(:*)

Simply takes the number of processes and number to calculate as args and splits the work up into ranges that each process can calculate individually. Then multiplies the results at the end.

This really shows how much slower Rubinius is to YARV:

Rubinius:

time ruby fact.rb 5 5000 #=> 61.84s

Ruby1.9.2:

time ruby fact.rb 5 50000 #=> 3.09s

(Note the extra 0)

\$\endgroup\$
2
  • 1
    \$\begingroup\$ inject can take a symbol as argument, so you can save a character by using inject(:+). Here's the example from the docs: (5..10).reduce(:+). \$\endgroup\$ Commented Mar 17, 2011 at 7:36
  • \$\begingroup\$ @Michael: Thanks :). Also just noticed that I had an 8 where there should have been a * if anyone had problems running this. \$\endgroup\$
    – Nemo157
    Commented Mar 18, 2011 at 11:08
6
\$\begingroup\$

Java, 523 519 434 430 429 chars

import java.math.*;public class G extends Thread{BigInteger o,i,r=BigInteger.ONE,h;G g;G(BigInteger O,int
I,int n){o=O;i=new BigInteger(""+I);if(n>1)g=new G(O.subtract(r),I,n-1);h=n==I?i:r;start();}public void
run(){while(o.signum()>0){r=r.multiply(o);o=o.subtract(i);}try{g.join();r=r.multiply(g.r);}catch(Exception
e){}if(h==i)System.out.println(r);}public static void main(String[] args){new G(new BigInteger(args[0]),4,4);}}

The two 4s in the final line are the number of threads to use.

50000! tested with the following framework (ungolfed version of the original version and with a handful fewer bad practices - although it's still got plenty) gives (on my 4-core Linux machine) times

7685ms
2338ms
1361ms
1093ms
7724ms

Note that I repeated the test with one thread for fairness because the jit might have warmed up.

import java.math.*;

public class ForkingFactorials extends Thread { // Bad practice!
    private BigInteger off, inc;
    private volatile BigInteger res;

    private ForkingFactorials(int off, int inc) {
        this.off = new BigInteger(Integer.toString(off));
        this.inc = new BigInteger(Integer.toString(inc));
    }

    public void run() {
        BigInteger p = new BigInteger("1");
        while (off.signum() > 0) {
            p = p.multiply(off);
            off = off.subtract(inc);
        }
        res = p;
    }

    public static void main(String[] args) throws Exception {
        int n = Integer.parseInt(args[0]);
        System.out.println(f(n, 1));
        System.out.println(f(n, 2));
        System.out.println(f(n, 3));
        System.out.println(f(n, 4));
        System.out.println(f(n, 1));
    }

    private static BigInteger f(int n, int numThreads) throws Exception {
        long now = System.currentTimeMillis();
        ForkingFactorials[] th = new ForkingFactorials[numThreads];
        for (int i = 0; i < n && i < numThreads; i++) {
            th[i] = new ForkingFactorials(n-i, numThreads);
            th[i].start();
        }
        BigInteger f = new BigInteger("1");
        for (int i = 0; i < n && i < numThreads; i++) {
            th[i].join();
            f = f.multiply(th[i].res);
        }
        long t = System.currentTimeMillis() - now;
        System.err.println("Took " + t + "ms");
        return f;
    }
}

Java with bigints is not the right language for golfing (look at what I have to do just to construct the wretched things, because the constructor which takes a long is private), but hey.

It should be completely obvious from the ungolfed code how it breaks up the work: each thread multiplies an equivalence class modulo the number of threads. The key point is that each thread does roughly the same amount of work.

\$\endgroup\$
5
\$\begingroup\$

CSharp - 206 215 chars

using System;using System.Numerics;using System.Threading.Tasks;class a{static void Main(){var n=int.Parse(Console.ReadLine());var r=new BigInteger(1);Parallel.For(1,n+1,i=>{lock(this)r*=i;});Console.WriteLine(r);}}

Splits up the calculation with the C# Parallel.For() functionality.

Edit; Forgot lock

Execution times:

n = 10,000, time: 59ms.
n = 20,000, time: 50ms.
n = 30,000, time: 38ms.
n = 40,000, time: 100ms.
n = 50,000, time: 139ms.
n = 60,000, time: 164ms.
n = 70,000, time: 222ms.
n = 80,000, time: 266ms.
n = 90,000, time: 401ms.
n = 100,000, time: 424ms.
n = 110,000, time: 501ms.
n = 120,000, time: 583ms.
n = 130,000, time: 659ms.
n = 140,000, time: 832ms.
n = 150,000, time: 1143ms.
n = 160,000, time: 804ms.
n = 170,000, time: 653ms.
n = 180,000, time: 1031ms.
n = 190,000, time: 1034ms.
n = 200,000, time: 1765ms.
n = 210,000, time: 1059ms.
n = 220,000, time: 1214ms.
n = 230,000, time: 1362ms.
n = 240,000, time: 2737ms.
n = 250,000, time: 1761ms.
n = 260,000, time: 1823ms.
n = 270,000, time: 3357ms.
n = 280,000, time: 2110ms.
\$\endgroup\$
4
\$\begingroup\$

Perl, 140

Takes N from standard input.

use bigint;$m=<>;open A,'>',
undef;$i=$p=fork&&1;$n=++$i;
{$i+=2;$n*=$i,redo if$i<=$m}
if($p){wait;seek A,0,0;$_=<A
>;print$n*$_}else{print A$n}

Features:

  • computation split: evens on one side and odds on the other (anything more complex than that would need a lot of characters to balance the computation load appropriately.
  • IPC using a shared anonymous file.

Benchmark:

  • 10000! is printed in 2.3s forked, 3.4s unforked
  • 100000! is printed in 5'08.8 forked, 7'07.9 unforked
\$\endgroup\$
4
\$\begingroup\$

Scala (345 266 244 232 214 characters)

Using Actors:

object F extends App{import actors.Actor._;var(t,c,r)=(args(1).toInt,self,1:BigInt);val n=args(0).toInt min t;for(i<-0 to n-1)actor{c!(i*t/n+1 to(i+1)*t/n).product};for(i<-1 to n)receive{case m:Int=>r*=m};print(r)}

Edit - removed references to System.currentTimeMillis(), factored out a(1).toInt, changed from List.range to x to y

Edit 2 - changed the while loop to a for, changed the left fold to a list function that does the same thing, relying on implicit type conversions so the 6 char BigInt type appears only once, changed println to print

Edit 3 - found out how to do multiple declarations in Scala

Edit 4 - various optimizations I've learned since I first did this

Ungolfed version:

import actors.Actor._
object ForkingFactorials extends App
{
    var (target,caller,result)=(args(1).toInt,self,1:BigInt)
    val numthreads=args(0).toInt min target
    for(i<-0 to numthreads-1)
        actor
        {
            caller ! (i*target/numthreads+1 to(i+1)*target/numthreads+1).product
        }
    for(i<-1 to numthreads)
        receive
        {
            case m:Int=>result*=m
        }
    print(result)
}
\$\endgroup\$
3
\$\begingroup\$

Scala-2.9.0 170

object P extends App{
def d(n:Int,c:Int)=(for(i<-1 to c)yield(i to n by c)).par
println((BigInt(1)/: d(args(0).toInt,args(1).toInt).map(x=>(BigInt(1)/: x)(_*_)))(_*_))}

ungolfed:

object ParallelFactorials extends App
{
  def distribute (n: Int, cores: Int) = {
    val factorgroup = for (i <- 1 to cores) 
      yield (i to n by cores)
    factorgroup.par
  }
  
  val parallellist = distribute (args(0).toInt, args(1).toInt)

  println ((BigInt (1) /: parallellist.map (x => (BigInt(1) /: x) (_ * _)))(_ * _))

}

The factorial of 10 is computed on 4 cores by generating 4 Lists:

  • 1 5 9
  • 2 6 10
  • 3 7
  • 4 8

which are multiplied in parallel. There would have been a simpler approach to distribute the numbers:

 (1 to n).sliding ((n/cores), (n/cores) 
  • 1 2 3
  • 4 5 6
  • 7 8 9
  • 10

But the distribution wouldn't be that good - the smaller numbers would all end in the same list, the highest in another one, leading to the longer computation in the last list (for high Ns, the last thread wouldn't be almost empty, but at least contain (N/cores)-cores elements.

Scala in version 2.9 contains parallel Collections, which handle the parallel invocation themselves.

\$\endgroup\$
2
\$\begingroup\$

Erlang - 295 characters.

First thing I've ever written in Erlang so I wouldn't be surprised if someone could easily halve this:

-module(f).
-export([m/2,f/4]).
m(N,C)->g(N,C,C,[]).
r([],B)->B;
r(A,B)->receive{F,V}->r(lists:delete(F,A),V*B)end.
s(H,L)->spawn(f,f,[self(),H,L,1]).
g(N,1,H,A)->r([s(N div H,1)|A],1);
g(N,C,H,A)->g(N,C-1,H,[s(N*C div H,N*(C-1) div H)|A]).
f(P,H,H,A)->P!{self(),A};
f(P,H,L,A)->f(P,H-1,L,A*H).

Uses the same threading model as my earlier Ruby entry: splits the range up into sub-range and multiplies the ranges together in seperate threads then multiplies the results back in the main thread.

I wasn't able to figure out how to get escript working so just save as f.erl and open up erl and run:

c(f).
f:m(NUM_TO_CALC, NUM_OF_PROCESSES).

with appropriate substitutions.

Got around 8s for 50000 in 2 processes and 10s for 1 process, on my MacBook Air (dual-core).

Note: Just noticed it freezes if you attempt more processes than the number to factorialize.

\$\endgroup\$

Not the answer you're looking for? Browse other questions tagged or ask your own question.