7

Today I found an article about Java8's Fork/Join-Framework and its usage for the parallel streams implementation. While I do understand the article, I'm not entirely sure what I should think of it.

Basically what it says it that F/J in conjunction with streams is next to useless, and especially so in the context of JEE applications. Quite a few specific arguments are listed, such as:

  • it needs a massive volume of easily separable data (aggregate),
  • creates copious threads without regard for others,
  • has a high potential for stack overflows,
  • has a high potential for massive memory usage,
  • has a very, very narrow performance window,
  • is only designed for one request at a time.

Moreover, it gives these arguments against F/J's recursive decompostion approach:

Recursive decomposition has an even narrower performance window. In addition to the above dynamic decomposition, recursive decomposition optimized for dyadic recursive division only works well:

  • on balanced tree structures (Directed Acyclic Graphs)
  • where there are no cyclic dependencies
  • where the computation duration is neither too short nor too long
  • where there is no blocking.

Since this is the only source I could find which complains about FJ, I'm not sure if this can be taken seriously. Are the above-cited, or other similar points a valid concern?

More specifically, does Oracle have an official position regarding the limitations of the F/J Framework as applied to the parallelization of streams processing? If so, does it have plans to do something about them?

24
  • Since I wrote the articles (there is a part I and a pdf now) perhaps you should ask me the question directly. There is an email address with the article. In any case, Streams work beautifully. Parallel support relies on a dyadic recursive division method that is usually unsuitable for general-purpose usage. The alternative is paraquental processing which significantly slows down the process and cannot scale. That's about all I can say in this comment.
    – edharned
    Commented Jul 10, 2014 at 22:26
  • @edharned Why do you refrain from writing an answer, which would be of interest to the general public? I've read the PDF some time ago and the problem is that it isn't very accessible. If you could write an answer which would be short enough and didn't use terms unfamiliar to the general reader, it would be a great contribution. Commented Jul 11, 2014 at 5:19
  • 1
    @edharned Because C# is nowhere better in this regard. Actually almost all C# + Parallel demos run slower (much slower!) instead of faster on my machine. But this may be related to mono, poor mono =/. So please: If you continue posting this stuff please give us a little bit more detail about what's wrong without being the anti-research-guy. Be objective and friendly, do not pursue other people's work. And make your point clear without dancing around it. Thanks.
    – Kr0e
    Commented Jul 11, 2014 at 8:33
  • 1
    It is noteworthy that three of the five voters to close this question come from the core Oracle Java team---even though that same Oracle team does have an official position on this subject, by definition making the question not "primarily opinion-based". Commented Jul 11, 2014 at 9:03
  • 1
    @edharned 1. So you want a JDK-provided thread container, after all? Can you please pinpoint exactly how ForkJoinPool's common pool misses the mark? 2. If you mark the thread as unusable, leaving it to die on its own, then clearly you must start a new thread to keep on going. Or do you propose to go on crippled, with N-1 threads? 3. Could you explain in plain language what you mean by "marking its management structure as expunged?" Commented Jul 14, 2014 at 5:03

1 Answer 1

10

This is the gist of the problems with the application of F/J in the implementation of the Streams API:

  1. F/J is good for the parallelization of in-memory, random-access structures: it wants to be able to divide the full problem top-down, by recursively halving it into two subproblems of equal size;
  2. the stream paradigm is primarily about the processing of lazily materialized, sequential data sources, which can only be divided into a sequence of chunks, and the number of chunks is usually not known in advance.

While F/J can be bent somewhat to support sequential chunking, this is perceived by it as "anomalous" and "lopsided", eventually giving rise to insurmountable issues when combined with the unpredictable I/O latency in reading those chunks1.

Streams API excels at the parallelization of in-memory structures and is usually helpful with the processing of lazy, I/O-backed streams, but it fails when you try to combine these two features in a single use case.

If you have a loop in your code which introduces a CPU-bound bottleneck, it is fairly likely that this loop is iterating over the contents of some file, network request, or rows of an SQL result set. None of these targets for parallelization get support from the Streams API.

The official position is that this use case is not supported because the Streams API has a different, equally legitimate focus. In the department of lazy parallel streams, this focus amounts to stream sources which are calculated from data existing within working memory, with the additional constraint that these sources must be unordered—that each member can be calculated independently, without the need to first calculate any other. An example of such a stream is a range of integers, but a stream of random numbers from an LCG is already outside of the area being focused on by the API, because these random numbers can only be generated sequentially.


1Keep in mind that this is the official statement. I have personally not yet hit this issue, having instead successfully parallelized the processing of my I/O sources.

2
  • Actually, SplittableRandom has great parallel random number generation -- and a good Stream implementation.
    – Brian Goetz
    Commented Jul 10, 2014 at 21:56
  • That's because there is much more to SplittableRandom than an LCG. Commented Jul 10, 2014 at 21:59

Not the answer you're looking for? Browse other questions tagged or ask your own question.