Task about forEach (ps :: println) from SKB Kontur

At the JBreak conference , I did not read the tasks of the sponsors on purpose. Well, of course, except for hell from Excelsior : these guys have all set the heat. And then they brought me a piece of special design bureau Kontur, look, they say, laugh. I laughed: the first task really looked so naively shaped and undetermined that I didn’t even want to go to the booth and convince the company employees of this. I almost forgot about it, but here the author's analysis of this task appeared on Habré, not without some depth. Even about modCount wrote. So, I laughed in vain?

I recall the task sounded like this:

Sort the methods according to the speed of their work and enter the numbers of the methods in the appropriate cell. If the methods give slightly different results, then write them in one cell. For each task, write an explanation why some methods are faster than others. It is assumed that the code runs in a HotSpot 64-bit VM (JRE 1.8.0_161).
 void forEach(List<Integer> values, PrintStream ps) { // 1 values.forEach(ps::println); } void forEach(List<Integer> values, PrintStream ps) { // 2 values.stream().forEach(ps::println); } void forEach(List<Integer> values, PrintStream ps) { // 3 values.parallelStream().forEach(ps::println); } 

According to the authors of the problem, the correct answer is "depends on": 1 or 2, depending on the location of the stars in the sky. At the same time, the form for the decisions did not intend to include such an option, and it was not enough to explain this place. In general, I think that it is incorrect to give a task without an unambiguous correct answer. I am not against problems where there are definitely two correct answers: 1 and 2 (for example, both are always and demonstrably equally fast). But here is the answer: either this or that, depending on the circumstances. This is an ugly task, ill-conceived. It seems that the authors had the right answer alone, but then (perhaps after talking with the conference participants) they themselves realized that they were wrong.

However, the biggest problem is that the third option can also be the fastest depending on the circumstances.

Let me remind you: the authors' argument against the parallelStream option is that the consumer ( PrintStream::println ) is synchronized, which means the data consumption process is linearized: the happens-before relationship is established between the consumption of any two elements. Accordingly, we will not receive any benefit from parallelization, we will only get an overhead to create threads and synchronize. Nevertheless, we should not forget that the stream consists not only of the data consumer, but also of the source. And the source may well not be linearized. If you choose a source and a consumer so that the source turns out to be significantly slower, then parallelism will provide an increase arbitrarily close to the number of processors.

According to the condition of the problem, the source is some implementation of the List interface. The task does not introduce further restrictions, so we are free to bring in any implementation that complies with the contract. We can even restrict ourselves to “reasonably reasonable implementations”, which look logical, without obviously excessive delays.

Let's first speed up the consumer. Of course, PrintStream is not the final class, and we could easily extend it by redefining the println method and removing synchronization altogether. However, I think this is cheating: redefining existing behavior should not be considered a correct solution in such tasks. Nevertheless, we remember that PrintStream is just a wrapper over the OutputStream , and here we have the right to slip into any implementation of this abstract class, which, for example, does nothing:

 private static PrintStream getStream() { return new PrintStream(new OutputStream() { public void write(int b) {} public void write(byte[] b, int off, int len) {} }); }

Of course, this is not necessary, but only to enhance the effect. For standard console output, we can also make a source that will be significantly slower.

Now, actually, the source. Lists that give out items slowly are quite possible and are quite common in life. This can be created, say, through Lists.transform from the Guava library, selecting the appropriate function. It is also easy to write it manually: it is enough to extend the standard AbstractList by defining two methods, get() and size() . Since our list will be a random access list, we will mark it with the RandomAccess marker interface.

What will we do with our list? Let it for each element with index i , for example, contain the low byte of the SHA-512 hash sum from the string representation of the index in UTF-16 encoding. Why not? Maybe we need such a pseudo-random, but stable thing:

 static List<Integer> getList(int size) { class MyList extends AbstractList<Integer> implements RandomAccess { @Override public Integer get(int index) { // ,     Java 9,  ! Objects.checkIndex(index, size); try { MessageDigest md = MessageDigest.getInstance("SHA-512"); md.update(String.valueOf(index).getBytes(StandardCharsets.UTF_16)); return (int) md.digest()[0]; } catch (NoSuchAlgorithmException e) { throw new RuntimeException(e); } } @Override public int size() { return size; } } return new MyList(); }

If you want to see the full source code of the benchmark, click here.

 @BenchmarkMode(Mode.AverageTime) @Measurement(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS) @Warmup(iterations = 5, time = 500, timeUnit = TimeUnit.MILLISECONDS) @OutputTimeUnit(TimeUnit.MICROSECONDS) @Fork(5) @State(Scope.Benchmark) public class ListBenchmark { @Param({"100", "10000", "1000000"}) private int size; @Benchmark public void testForEach() { forEach1(getList(size), getStream()); } @Benchmark public void testStreamForEach() { forEach2(getList(size), getStream()); } @Benchmark public void testParallelStreamForEach() { forEach3(getList(size), getStream()); } void forEach1(List<Integer> values, PrintStream ps) { // 1 values.forEach(ps::println); } void forEach2(List<Integer> values, PrintStream ps) { // 2 values.stream().forEach(ps::println); } void forEach3(List<Integer> values, PrintStream ps) { // 3 values.parallelStream().forEach(ps::println); } static PrintStream getStream() { return new PrintStream(new OutputStream() { public void write(int b) {} public void write(byte[] b, int off, int len) {} }); } static List<Integer> getList(int size) { class MyList extends AbstractList<Integer> implements RandomAccess { @Override public Integer get(int index) { // ,     Java 9,  ! Objects.checkIndex(0, size); try { MessageDigest md = MessageDigest.getInstance("SHA-512"); md.update(String.valueOf(index).getBytes(StandardCharsets.UTF_16)); return (int) md.digest()[0]; } catch (NoSuchAlgorithmException e) { throw new RuntimeException(e); } } @Override public int size() { return size; } } return new MyList(); } }

Now we are driving the benchmark on a four core machine and see these results:

 # JMH version: 1.20 # VM version: JDK 9.0.4, VM 9.0.4+11 # VM invoker: C:\Program Files\Java\jdk-9.0.4\bin\java.exe ... Benchmark (size) Mode Cnt Score Error Units testForEach 100 avgt 50 317,811 ± 48,847 us/op testForEach 10000 avgt 50 11944,452 ± 308,630 us/op testForEach 1000000 avgt 50 1650511,656 ± 489838,860 us/op testParallelStreamForEach 100 avgt 50 89,836 ± 2,507 us/op testParallelStreamForEach 10000 avgt 50 9629,607 ± 935,511 us/op testParallelStreamForEach 1000000 avgt 50 890954,996 ± 64176,572 us/op testStreamForEach 100 avgt 50 171,991 ± 48,877 us/op testStreamForEach 10000 avgt 50 22063,496 ± 5965,278 us/op testStreamForEach 1000000 avgt 50 1646935,273 ± 485037,889 us/op

The result of ForEach and StreamForEach looks a bit strange. For example, for a list of 100 items, the stream wins, and for a list of 10,000 - the opposite. However, the desired result is obvious: parallel stream runs faster everywhere.

Where, by the way, are such fluctuations? They can easily occur if you do not repeat the developers' error from SKB Kontur: you cannot drive a benchmark on one fork, especially if you suspect that the result is significantly affected by the operation of the JIT compiler. The JIT compiler is completely non-deterministic: the result of the compilation depends on the compiled profile, but the profile is assembled asynchronously, so at the time of the start of compilation there may be different numbers and the code will be different. It does not make sense to increase the number of iterations, because when you exit to a stable state, the hot code is no longer recompiled. See, for example, the output for ForEach / size = 100:

Conclusion

 # Run progress: 0,00% complete, ETA 00:05:37 # Fork: 1 of 5 # Warmup Iteration 1: 585,124 us/op # Warmup Iteration 2: 390,473 us/op # Warmup Iteration 3: 388,228 us/op # Warmup Iteration 4: 387,198 us/op # Warmup Iteration 5: 384,001 us/op Iteration 1: 386,163 us/op Iteration 2: 376,344 us/op Iteration 3: 374,520 us/op Iteration 4: 364,211 us/op Iteration 5: 364,389 us/op Iteration 6: 363,269 us/op Iteration 7: 364,266 us/op Iteration 8: 364,123 us/op Iteration 9: 363,860 us/op Iteration 10: 364,035 us/op # Run progress: 2,22% complete, ETA 00:05:48 # Fork: 2 of 5 # Warmup Iteration 1: 426,376 us/op # Warmup Iteration 2: 386,556 us/op # Warmup Iteration 3: 382,808 us/op # Warmup Iteration 4: 378,838 us/op # Warmup Iteration 5: 377,987 us/op Iteration 1: 377,438 us/op Iteration 2: 374,149 us/op Iteration 3: 370,854 us/op Iteration 4: 361,177 us/op Iteration 5: 362,603 us/op Iteration 6: 361,055 us/op Iteration 7: 361,445 us/op Iteration 8: 361,564 us/op Iteration 9: 364,283 us/op Iteration 10: 359,619 us/op # Run progress: 4,44% complete, ETA 00:05:39 # Fork: 3 of 5 # Warmup Iteration 1: 582,933 us/op # Warmup Iteration 2: 144,197 us/op # Warmup Iteration 3: 140,762 us/op # Warmup Iteration 4: 135,908 us/op # Warmup Iteration 5: 132,585 us/op Iteration 1: 132,768 us/op Iteration 2: 132,491 us/op Iteration 3: 129,447 us/op Iteration 4: 119,316 us/op Iteration 5: 119,427 us/op Iteration 6: 119,018 us/op Iteration 7: 119,305 us/op Iteration 8: 119,361 us/op Iteration 9: 119,130 us/op Iteration 10: 119,105 us/op # Run progress: 6,67% complete, ETA 00:05:31 # Fork: 4 of 5 # Warmup Iteration 1: 569,460 us/op # Warmup Iteration 2: 389,247 us/op # Warmup Iteration 3: 385,768 us/op # Warmup Iteration 4: 381,309 us/op # Warmup Iteration 5: 379,797 us/op Iteration 1: 381,618 us/op Iteration 2: 372,816 us/op Iteration 3: 371,384 us/op Iteration 4: 361,205 us/op Iteration 5: 361,428 us/op Iteration 6: 361,174 us/op Iteration 7: 360,579 us/op Iteration 8: 360,488 us/op Iteration 9: 359,859 us/op Iteration 10: 361,365 us/op # Run progress: 8,89% complete, ETA 00:05:23 # Fork: 5 of 5 # Warmup Iteration 1: 544,560 us/op # Warmup Iteration 2: 390,766 us/op # Warmup Iteration 3: 388,537 us/op # Warmup Iteration 4: 383,953 us/op # Warmup Iteration 5: 382,270 us/op Iteration 1: 384,325 us/op Iteration 2: 377,098 us/op Iteration 3: 371,531 us/op Iteration 4: 362,499 us/op Iteration 5: 362,045 us/op Iteration 6: 363,345 us/op Iteration 7: 361,930 us/op Iteration 8: 362,357 us/op Iteration 9: 362,452 us/op Iteration 10: 362,302 us/op

Here there are two separate modes as a result: one is about 360 microseconds, and the other is about 120. At the same time, within the fork, the result is significantly more stable than between the forks. Are you starting to feel the meaninglessness of the task? What kind of performance difference can you ask, if you restart the same code, you find that it works three times slower. JIT often throws such tricks. It will be unpleasant if you draw conclusions based on an experiment in which you were just lucky and JIT made a successful decision, which it takes in one case out of ten. As a rule, I do not make less than three forks and if I notice a suspicious difference in speed, I increase their number in order to study the phenomenon in more detail.

However, after examining the conclusion, I can say with confidence that even the slowest parallel fork is not worse than the fastest sequential fork. This means that the third option can win, although according to the authors, this is “the obvious wrong answer.”

The attentive reader will say: let me, what is Java 9? The terms were Java 1.8.0_161. Yes, the remark is correct: I really use features here, which appeared only in the nine: List.spliterator should optimize for RandomAccess lists . In Java 8, the default splitter implementation for the list is more stupid, and paralleling is much worse (and there is nothing to use outdated Java versions!) Therefore, with this code in Java 8, we hardly get any growth. But this does not make our reasoning fundamentally wrong, you just need to write a little more code. Writing a full-fledged good splitter is quite long, but in some cases (including here) you can skal and instead implement the stream() and parallelStream() methods again:

 class MyList extends AbstractList<Integer> implements RandomAccess { @Override public Integer get(int index) { //   checkIndex   Java 8 :-( if(index < 0 || index >= size) throw new IndexOutOfBoundsException(); ...      ... } @Override public int size() { return size; } @Override public Stream<Integer> stream() { return IntStream.range(0, size).mapToObj(this::get); } @Override public Stream<Integer> parallelStream() { return stream().parallel(); } }

This is not considered cheating. For example, the stream for the list returned by Collections.nCopies implemented in a similar way. If I make my list, which is planned to be used as a source of parallel stream, then, of course, I have to take care that it is normally parallel.

The results are quite expected:

 # JMH version: 1.20 # VM version: JDK 1.8.0_161, VM 25.161-b12 # VM invoker: C:\Program Files\Java\jre1.8.0_161\bin\java.exe Benchmark (size) Mode Cnt Score Error Units testForEach 100 avgt 50 134,974 ± 2,900 us/op testForEach 10000 avgt 50 13075,674 ± 332,211 us/op testForEach 1000000 avgt 50 1315215,358 ± 9702,119 us/op testParallelStreamForEach 100 avgt 50 113,029 ± 1,934 us/op testParallelStreamForEach 10000 avgt 50 9711,281 ± 113,490 us/op testParallelStreamForEach 1000000 avgt 50 1002479,362 ± 11129,949 us/op testStreamForEach 100 avgt 50 134,033 ± 2,806 us/op testStreamForEach 10000 avgt 50 13247,368 ± 273,218 us/op testStreamForEach 1000000 avgt 50 1320319,189 ± 16277,564 us/op

And here the parallel stream wins. By the way, in the eight the results are much more stable. Probably, in the nine something broke in the JIT compiler. Interestingly, Java 10 is also stable and faster than Java 8.

The conclusion can be drawn as follows. Speed - too thin area to make it a reliable puzzle. As a last resort, you can formulate the task not “which is faster”, but “explain the effect”, as, for example, did apangin in the old competition (see the second question). And with the Stream API in general, everything is not easy, there are too many free variables and subtle effects to say something definitely. It is better to invent tasks for correctness and semantics, everything is much more strict there.

I look forward to parsing the remaining tasks from SKB Kontur!

Source: https://habr.com/ru/post/350808/

All Articles

Task about forEach (ps :: println) from SKB Kontur

More articles: