3
\$\begingroup\$

I would like to know how representative the following benchmark is? Can we infer from it that findFirst is much slower than findAny for non-parallel streams? Or is this particular use case too specific?

In addition, I would like to know whether the benchmark has been implemented correctly from a technical point of view.

import java.util.Random;

public class BMTest {
    public static void main(String[] args) {
        int findFirstFaster = 0;
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < 10; i++) {
            int n = 1000 * (i + 1);
            long t1 = getTimeFindFirst(n);
            long t2 = getTimeFindAny(n);
            int faster = t1 < t2 ? 1 : 0;
            findFirstFaster += faster;
            sb.append(String.format("| %02d | %05d | %010d | %010d | %f | %b |%n", i + 1, n, t1, t2, (double) t1 / (t1 + t2), faster == 1));
        }
        System.out.println(sb);
        System.out.println("findFirstFaster = " + findFirstFaster);
    }

    public static long getTimeFindFirst(int n) {
        Random rand = new Random();
        long sum = 0;

        long time = System.nanoTime();
        for (int i = 0; i < n; i++) {
            sum += rand.ints(10_000).filter(x -> x % 7 == 0).findFirst().orElseThrow();
        }
        time = System.nanoTime() - time;
        System.out.println("sum = " + sum);
        return time;
    }

    public static long getTimeFindAny(int n) {
        Random rand = new Random();
        long sum = 0;

        long time = System.nanoTime();
        for (int i = 0; i < n; i++) {
            sum += rand.ints(10_000).filter(x -> x % 7 == 0).findAny().orElseThrow();
        }
        time = System.nanoTime() - time;
        System.out.println("sum = " + sum);
        return time;
    }
}

Example output:

01 01000 0010918901 0002353900 0,822652 false
02 02000 0001526700 0001113200 0,578317 false
03 03000 0001618900 0001715000 0,485587 true
04 04000 0002009100 0001740900 0,535760 false
05 05000 0002041899 0001900000 0,517999 false
06 06000 0002077800 0002103601 0,496915 true
07 07000 0002630900 0002274900 0,536284 false
08 08000 0002108900 0002100399 0,501010 false
09 09000 0002266100 0002248400 0,501960 false
10 10000 0004679400 0002199700 0,680234 false
\$\endgroup\$
1
  • 1
    \$\begingroup\$ No. Use JMH with warmups. You aren't comparing the methods acting on same input, and you shouldn't bother timing rand.ints(10_000). Pass same set of ints to both tests (calc upfront or use same random seed). \$\endgroup\$
    – DuncG
    Commented Jul 4 at 14:42

1 Answer 1

6
\$\begingroup\$

whether the benchmark has been implemented correctly

I'm afraid, no.

  1. JVM does quite a bit of optimizations at runtime, compiling certain frequently called methods into highly optimized native code. But for it to happen JVM needs to profile the code running it in the interpreted mode, only then JIT compiler kicks in *. This process is called JVM warmup.

When writing benchmarks, it's important to take JVM warmup into account because performance is constantly changing during this phase.

  1. These tests are nondeterministic since a newly created instance of Random obtained via no-args constructor serves as a stream source in each method (i.e. in each case stream elements will be different). If instead you were providing a predefined seed while constructing Random, then it'll guaranteed that the predicate of the filter() will be evaluated to true on the same element in both streams.

Don't reinvent the wheel, use trustworthy tools specifically designed for this purpose, such as Java Microbenchmark Harness.

* That's a very simplified description. Modern JVMs are using tiered compilation, which employs two JIT compilers C1 and C2 to optimize hot code faster. And sometimes a compiled piece of code might even be deoptimized and start running in the interpreted mode again because a speculative assumption on which optimization was based proved to be incorrect.

Can we infer from it that findFirst() is much slower than findAny() for non-parallel streams?

This might sound boring, but findFirst() and findAny() operations represent different use cases: 1) you need exactly the first element, 2) you're OK with any element.

It'll be a mistake to assume that if a stream is sequential, then you're free to employ findAny() when in fact findFirst() is needed. This premature optimization is not likely to generate any tangible performance gain, but might lead to incorrect results if you or your colleagues will switch the stream to run in parallel.

Premature optimization is evil

Code style

Long chains of method calls are hard to follow and comprehend. Consider breaking them into multiple lines to improve readability:

sum += rand.ints(10_000)
    .filter(x -> x % 7 == 0)
    .findFirst()
    .orElseThrow();
\$\endgroup\$
7
  • 2
    \$\begingroup\$ @TobiasGrothe If there is no requirement for the first matching element to be returned, I'd use findAny() even if it did not provide a performance gain, because it documents the lack of that requirement. This might help with understanding the code. But a comment about why this is not a requirement will always be better (unless it's obvious from the use case). \$\endgroup\$ Commented Jul 3 at 9:47
  • 1
    \$\begingroup\$ @TobiasGrothe the nature of the Stream source has a far bigger impact on the performance than anything else. That’s especially true for a source as expensive as a random number generator. Given that the stream from Random is unordered to begin with, there’s obviously no difference between findFirst() and findAny(). There simply is no “first” and the documentation of findFirst() states: “If the stream has no encounter order, then any element may be returned. \$\endgroup\$
    – Holger
    Commented Jul 3 at 10:54
  • 1
    \$\begingroup\$ @TobiasGrothe findAny() method is a better choice when there might be several suitable results in the stream, and there is no preference for which element to return. This is exactly what you described. On the other hand, findFirst() is used when it is necessary to leverage the ordering of the stream source to find the earliest element that meets the criteria. Example: elements in the source are chronologically ordered by timestamp, and we need the earliest event in a specific area dedicated to a particular topic, findFirst() is the appropriate tool for this job. \$\endgroup\$ Commented Jul 4 at 14:31
  • 1
    \$\begingroup\$ @TobiasGrothe Holger brought up an interesting point. Some stream sources do not have an encounter order (HashSet, Random). As a result, the streams they produce are considered to be unordered. Such stream does not care about ordering related constraints while executing findFirst(), takeWhile(), dropWhile(), limit(), skip(), etc. You can verify that streams produced by an instance of Random are unordered by checking the characteristics of the stream's spliterator. \$\endgroup\$ Commented Jul 4 at 14:36
  • 1
    \$\begingroup\$ @TobiasGrothe rand.ints(10_000).spliterator().characteristics() & Spliterator.ORDERED will be evaluated to 0 because bit flag ORDERED is not present. \$\endgroup\$ Commented Jul 4 at 14:37

Not the answer you're looking for? Browse other questions tagged or ask your own question.