0

Below code is implementation of running some code for every user in a list. The code is just comparing each user and concatenating their attributes :

  case class UserObj(id: String, nCoordinate : String)

  val userList = List(UserObj("a1" , "1234"),UserObj("a2" , "1234"), UserObj("a3" , "1234"))
val map1 = new java.util.concurrent.ConcurrentHashMap[String, Double]

    userList.par.map(xUser => {
        userList.par.map(yUser => {
        if (!xUser.id.isEmpty() && !yUser.id.isEmpty()) {
          println("Total is "+xUser.id+yUser.id+","+xUser.nCoordinate+yUser.nCoordinate)
          map1.put(xUser.id + "," + yUser.id , getJaccardDistance(xUser.nCoordinate, yUser.nCoordinate))
        }

      })
        println("")
  })                                              //> Total is a1a1,12341234
                                                  //| Total is a3a1,12341234
                                                  //| Total is a2a1,12341234
                                                  //| Total is a3a2,12341234
                                                  //| Total is a1a2,12341234
                                                  //| Total is a3a3,12341234
                                                  //| Total is a2a2,12341234
                                                  //| 
                                                  //| Total is a1a3,12341234
                                                  //| Total is a2a3,12341234
                                                  //| 
                                                  //| 
                                                  //| res0: scala.collection.parallel.immutable.ParSeq[Unit] = ParVector((), (), (
                                                  //| ))

  def getJaccardDistance(str1: String, str2: String) = {

    val zipped = str1.zip(str2)
    val numberOfEqualSequences = zipped.count(_ == ('1', '1')) * 2

    val p = zipped.count(_ == ('1', '1')).toFloat * 2
    val q = zipped.count(_ == ('1', '0')).toFloat * 2
    val r = zipped.count(_ == ('0', '1')).toFloat * 2
    val s = zipped.count(_ == ('0', '0')).toFloat * 2

    (q + r) / (p + q + r)

  }

This was previously an imperative solution :

     for (xUser <- userList) {

         for (yUser <- userList) {
        if (!xUser.id.isEmpty() && !yUser.id.isEmpty()) {
          println("Total is "+xUser.id+yUser.id+","+xUser.nCoordinate+yUser.nCoordinate)
        }

      }
        println("")
   } 

But I want to make use of Scala's parallel collections and I think using map the recommended method of achieving this. As imperative code above could result in multiple threads running same code. Note : the above code being executed : println("Total is "+xUser.id+yUser.id+","+xUser.nCoordinate+yUser.nCoordinate) is just a simpler version of the algorithm being actually run.

The functional solution posted at beginning of question behaves as expected but once the list contains more thatn 3000 elements it almost crawls to a halt. Why is this occurring ? Is my implementation correct ?

1 Answer 1

1

Unless you provide your actual algorithm we can only guess. I tried it with 3000 elements and it worked fine, although slower than a simple map.

Why is it slower? Because println is synchronized. Take a look at java.io.PrintStream:

public void println(String x) {
    synchronized (this) {
        print(x);
        newLine();
    }
}

So obviously parallelizing a println doesn't make too much sense. Either you share your algorithm so we can see what's happening under the cover or dive deep into the code to make sure nothing's being synchronized (e.g. if you're println-ing somewhere then consider using asynchronous logging instead).

The code I'm using to test is:

case class UserObj(id: String, nCoordinate : String)

val userList = (1 to 3000).map(i => UserObj("a"+i.toString, "1234"))

var timings = new mutable.StringBuilder
def time[R](block: => R): R = {
  val t0 = System.nanoTime()
  val result = block
  val t1 = System.nanoTime()
  timings.append("Elapsed time: " + (t1 - t0) + "ns\n")
  result
}


time {
  userList.map(xUser => {
    userList.map(yUser => {
      if (!xUser.id.isEmpty && !yUser.id.isEmpty) {
        println("Total is " + xUser.id + yUser.id + "," + xUser.nCoordinate + yUser.nCoordinate)
      }
    })
  })
}

time {
  userList.par.map(xUser => {
    userList.par.map(yUser => {
      if (!xUser.id.isEmpty && !yUser.id.isEmpty) {
        println("Total is " + xUser.id + yUser.id + "," + xUser.nCoordinate + yUser.nCoordinate)
      }
    })
  })
}

println(timings.toString())

And returned the following timings:

Elapsed time: 29066452631ns
Elapsed time: 37031631461ns
4
  • @blue-sky Is the println for the Total included in the loop? Commented Apr 8, 2014 at 21:10
  • yes it is, but it does'nt need to be. I just included it to make it simpler to read, seems this has had opposite effect
    – blue-sky
    Commented Apr 8, 2014 at 21:12
  • Does the issue still persist if you exclude it? Commented Apr 8, 2014 at 21:14
  • its seems populating a map within a map function is grinding performance to a halt as removing 'map1.put(xUser.id + "," + yUser.id , getJaccardDistance(xUser.nCoordinate, yUser.nCoordinate)) ' increases performance sustantially
    – blue-sky
    Commented Apr 8, 2014 at 21:21

Not the answer you're looking for? Browse other questions tagged or ask your own question.