Factorial Calculation or the Power of the Stream API

The other day, Article 5nw appeared. Two ways to quickly calculate factorial , in which the idea of speeding up the calculation of factorial by grouping multiplied numbers into a tree according to the principle of "divide and conquer." Looking at this, I immediately realized that here the parallel threads of Java would manifest themselves in all their glory: after all, they divide the task into subtasks using splitters in this way. It turns out that a quick implementation will also be beautiful:

public static BigInteger streamedParallel(int n) { if(n < 2) return BigInteger.valueOf(1); return IntStream.rangeClosed(2, n).parallel().mapToObj(BigInteger::valueOf).reduce(BigInteger::multiply).get(); }

Unfortunately, the article 5nw no details of measuring performance. How many tests were done? Was it warming up? Was time measurement error estimated? Did the JIT compiler fail to do some of the calculations because their results were not used? And if they were used (for example, the resulting number was output to a file), is the fact of use excluded from the calculation of time? In this regard, I want to say thanks to Alexei Shipilev, who, with his JMH library, as well as numerous presentations and articles, has instilled some kind of benchmarking culture in the Java community.

I will have the following benchmark code:

 import org.openjdk.jmh.infra.Blackhole; import org.openjdk.jmh.annotations.*; import java.util.concurrent.TimeUnit; import java.util.stream.IntStream; import java.math.BigInteger; @Warmup(iterations=5) @Measurement(iterations=10) @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.MICROSECONDS) @State(Scope.Benchmark) @Fork(2) public class Factorial { @Param({"10", "100", "1000", "10000", "50000"}) private int n; public static BigInteger naive(int n) { BigInteger r = BigInteger.valueOf(1); for (int i = 2; i <= n; ++i) r = r.multiply(BigInteger.valueOf(i)); return r; } public static BigInteger streamed(int n) { if(n < 2) return BigInteger.valueOf(1); return IntStream.rangeClosed(2, n).mapToObj(BigInteger::valueOf).reduce(BigInteger::multiply).get(); } public static BigInteger streamedParallel(int n) { if(n < 2) return BigInteger.valueOf(1); return IntStream.rangeClosed(2, n).parallel().mapToObj(BigInteger::valueOf).reduce(BigInteger::multiply).get(); } @Benchmark public void testNaive(Blackhole bh) { bh.consume(naive(n)); } @Benchmark public void testStreamed(Blackhole bh) { bh.consume(streamed(n)); } @Benchmark public void testStreamedParallel(Blackhole bh) { bh.consume(streamedParallel(n)); } }

')
I compared three implementations - naive, on a regular stream and on a parallel stream. It is reasonable to check the algorithm for different values of n - 10, 100, 1000, 10000 and 50,000, in order to present the dynamics of change in the results. Five iterations of warm-up and ten iterations with measurement are performed, all this is repeated twice (with restarting the Java-machine), that is, 20 measurements are made for each test. Pay attention to the black hole (Blackhole): it is needed so that the JIT compiler does not delete the result of the calculations, deciding that it is not used anyway.

I tested it on a laptop with a Core i7-4702MQ processor (8 hardware threads). Here are the results:

 Benchmark (n) Mode Cnt Score Error Units Factorial.testNaive 10 avgt 20 0.298 ± 0.005 us/op Factorial.testNaive 100 avgt 20 5.113 ± 0.195 us/op Factorial.testNaive 1000 avgt 20 278.601 ± 9.586 us/op Factorial.testNaive 10000 avgt 20 32232.618 ± 889.512 us/op Factorial.testNaive 50000 avgt 20 972369.158 ± 29287.950 us/op Factorial.testStreamed 10 avgt 20 0.339 ± 0.021 us/op Factorial.testStreamed 100 avgt 20 5.432 ± 0.246 us/op Factorial.testStreamed 1000 avgt 20 268.366 ± 4.859 us/op Factorial.testStreamed 10000 avgt 20 30429.526 ± 435.611 us/op Factorial.testStreamed 50000 avgt 20 896719.207 ± 7995.599 us/op Factorial.testStreamedParallel 10 avgt 20 6.470 ± 0.515 us/op Factorial.testStreamedParallel 100 avgt 20 11.280 ± 1.094 us/op Factorial.testStreamedParallel 1000 avgt 20 74.174 ± 2.647 us/op Factorial.testStreamedParallel 10000 avgt 20 2826.643 ± 52.831 us/op Factorial.testStreamedParallel 50000 avgt 20 49196.070 ± 464.634 us/op

The naive test generally confirms the theoretical assumption about the quadratic complexity of the algorithm. Divide time by n²:

 n = 10: 0.002980 n = 100: 0.000511 n = 1000: 0.000279 n = 10000: 0.000322 n = 50000: 0.000389

Increase at large values is probably associated with an increase in memory consumption and increased garbage collection.

An un-split stream for small n is expected to be somewhat longer (by 13% for n = 10 and by 6% for n = 100): the Stream API binding does eat something. However, it is surprising that for large n the streaming version works 4-8% faster, although it looks the same with algorithm. Another confirmation that it is dangerous to talk about performance theoretically, without taking measurements. JIT compiler optimizations, processor caching, branch prediction and other factors are very difficult to take into account in theory.

Parallelized flow is expected to be significantly slower for very short operations. However, the speed increase is observed already at n = 1000, when the full calculation takes about 270 μs in the sequential mode and only 75 in the parallel one. This fits perfectly with the Stream Parallel Guidance (thanks to apangin for the link), where it says that parallelizing tasks that run for longer than 100 µs makes sense. For large values of n, there is a parallel flow on a horse: we get a speed increase of 18 times. In general, this is consistent with the increase in 5nw multiplied by the number of flows (1.7 / 0.8 * 8 = 17).

ForkJoinPool has a very small overhead, so even a millisecond task gets a speed gain. In this case, divide-and-conquer algorithms are naturally shifted to parallel streams, so that a parallel stream can turn out to be significantly faster than a sequential one. Many people criticize the Stream API, but concurrency is the future.

Source: https://habr.com/ru/post/255813/

All Articles

Factorial Calculation or the Power of the Stream API

More articles: