Java Stream API: what's good and what's not

Is Java 8 Stream API so “energetic”? Is it possible to "turn" the processing of complex operations on collections into simple and clear code? Where is the benefit of parallel operations, and when is it worth stopping? This is one of the many questions encountered by readers. Let's try to make out the Stream API pitfalls with Tagir Valeev aka @ lany . Many readers are already familiar with our interlocutor on articles, research in the field of Java, expressive presentations at conferences. So, without delay, we begin the discussion.

- Tagir, you have excellent performance on the StackOverflow resource ( gold status in the “java-stream” branch ). Do you think the dynamics of the use of the Java 8 Stream API and the complexity of the designs has grown (based on the questions and answers on this resource)?

- True, I spent a lot of time on StackOverflow one time, constantly tracking issues on the Stream API. Now I look periodically, because, in my opinion, most interesting questions already have answers. Of course, it is felt that people tried out the Stream API, it would be strange if this were not so. The first questions on this topic appeared before the release of Java 8, when people experimented with early builds. The heyday came at the end of 2014 and 2015 year.
Many interesting questions are connected not only with what can be done with the Stream API, but also with what can normally not be done without third-party libraries. Users, constantly asking and discussing, sought to push the framework of the Stream API. Some of these questions were sources of ideas for my StreamEx library, which extends the functionality of the Java 8 Stream API.
')
- You mentioned StreamEx. Tell us what prompted you to create? What were your goals?

- Motives were purely practical. When at work we switched to Java 8, the first euphoria of beauty and convenience quickly changed to a series of stumbling: I wanted to do certain things with the help of the Stream API, which should be done like, but in fact did not work. I had to lengthen the code or deviate from the specification. I started adding auxiliary classes and methods to work projects to solve these problems, but it looked ugly. Then I guessed to wrap the standard streams in my classes, which offer a number of additional operations, and it became much more pleasant to work. I singled out these classes into a separate open source project and began to develop it.

- In your opinion, what types of calculations and operations and what data should really be implemented using the Stream API, and what is not very suitable for processing?

- Stream API loves immutable data. If you want to change existing data structures and not create new ones, you need something else. Look towards the new standard methods (for example, List.replaceAll).

Stream API likes independent data. If you need to use several elements from the input set at the same time to get the result, it will be very clumsy without third-party libraries. But libraries like StreamEx often solve this problem.

Stream API likes to solve one problem per pass. If you want to solve several different problems in a single data crawl, get ready to write your own collectors. And not the fact that it all happens.

Stream API does not like checked exceptions. It will not be very convenient for you to throw them out of Stream API operations. Again, there are libraries that try to alleviate this (say, jOOλ ), but I would recommend to abandon the checked exceptions.

The standard Stream API lacks some operations that are very necessary. For example, takeWhile, will only appear in Java 9. It may turn out that you want something quite reasonable and uncomplicated, but this will not work. Again, it's worth noting that libraries like jOOλ and StreamEx solve most of these problems.

- Do you think it makes sense to use parallelStream always? What problems can arise when “switching” methods from stream to parallelStream?

- In no case should parallelStream be used at all times. It should be used exclusively rarely, and you should have a good reason for this.

First, most of the tasks solved with the help of the Stream API are too fast compared to the overhead of task distribution across ForkJoinPool and their synchronization. Famous article by Doug Lee (Doug Lea) " When to use parallel streams " gives the rule of thumb: on modern machines, it usually makes sense to parallelize tasks whose execution time exceeds 100 microseconds. My tests show that sometimes the 20-microsecond task speeds up from parallelization, but this depends on many factors.

Secondly, even if your task is performed for a long time, it is not a fact that parallelism will speed it up. It depends on the quality of the source, and on intermediate operations (for example, limit for an ordered stream can work for a long time), and on terminal operations (for example, forEachOrdered can sometimes negate the benefits of parallelism). The best intermediate operations are stateless operations (filter, map, flatMap and peek), and the best terminal operations are the reduce / collect family, which are associative, that is, they can effectively divide the task into subtasks and then combine their results. And the merging procedure is sometimes not very optimal (for example, for complex chainingBy chains).

Third, many people use the Stream API incorrectly, breaking the specification. For example, passing lambdas with an internal state (stateful) in an operation like filter and map. Or by violating the requirements for a unit and associativity in reduce. Not to mention how many wrong collectors are writing. This is often excusable for sequential streams, but completely unacceptable for parallel streams. Of course, this is not a reason to write incorrectly, but the fact is obvious: it is harder to use parallel streams, it’s not just to add parallel () somewhere.

And finally, even if your stream runs for a long time, operations in it will easily parallel and you do everything correctly, you should think about whether the processor cores are idle, that you are ready to give them to parallel streams? If you have a web service that is constantly loaded with requests, it is quite possible that it will be more reasonable to process each request as a separate thread. Only if you have a lot of cores, or the system is not fully loaded, you can think about parallel streams. Perhaps, by the way, it is worthwhile to install java.util.concurrent.ForkJoinPool.common.parallelism to limit parallel streams.

For example, if you have 16 cores and usually 12 loaded, try setting the level of parallelism to 4 to take the remaining cores with streams. General advice, of course, no: we must always check.

- Continuing the conversation about parallelization, is it possible to say that the performance is affected by the amount and structure of the data, the number of processor cores? Which data sources (for example, LinkedList) should not be processed in parallel?

- LinkedList is not the worst source yet. He, at least, knows his size, which allows the Stream API to more successfully split up tasks. For parallelism, the worst sources are sources that are essentially sequential (like LinkedList) and do not state their size. Usually this is what is created through Spliterators.spliteratorUnknownSize (), or through AbstractSpliterator without specifying the size. Examples from JDK are Stream.iterate (), Files.list (), Files.walk (), BufferedReader.lines (), Pattern.splitAsStream (), and so on. I talked about this in the Strangeness Stream API report at the JPoint this year. There is a very poor implementation, which leads, for example, to the fact that if this source contains 1024 elements or less, then it does not parallel at all. And even then parallels pretty bad. For more or less normal parallelism, you need to have tens of thousands of elements in it. In StreamEx, the implementation is better. For example, StreamEx.ofLines (reader) (analogue of BufferedReader.lines ()) will parallel well, even for small files. If you have a bad source and you want to parallelize it, it is often better to first consistently collect it into a list (for example, Stream.iterate (...) .collect (toList ()). ParallelStream () ...)

Most standard JDK data structures are good sources. Be wary of structures and wrappers from third-party libraries that are compatible with Java 7. They cannot have the spliterator () method redefined (because there are no splitters in Java 7), so they will use the Collection.spliterator () or List.spliterator () implementation by default, which, of course, badly parallel, because it knows nothing about your data structure and just wraps an iterator. In the nine, this will improve for random access lists.

- When using intermediate operations, in your opinion, what is their threshold value in the Stream pipeline and how is it determined? Are there any limitations (explicit and implicit)?

- I would not say that there are severe restrictions. Most intermediate Stream API operations increase the stack depth by one call. On the one hand, this can reduce the efficiency of inlining (by default, MaxInlineLevel = 9 in HotSport JVM). But usually it only plays a role in isolated benchmarks. In real-world applications, the calls are still too polymorphic to get the most benefit from inlining. Long lines can clog logs, slow down the creation of exception objects, or just frighten newbies. But, in general, they are harmless. No problem to have dozens of intermediate operations, if you really need. Of course, one should not increase the stream uncontrollably, for example, in a while loop (something) stream = stream.map (x -> ...). The next time you perform a terminal operation, there is a risk of falling with a StackOverflowError.

Having methods to organize collections during processing (an intermediate sorted () operation) or an ordered data source and then working with it using map, filter and reduce operations can lead to improved performance?

No, it is unlikely. Only the distinct () operation uses the fact that the input is sorted. It changes the algorithm by comparing the element with the previous one, and without sorting you have to keep the HashSet. However, for this the source must report that it is sorted. All sorted sources from JDK (BitSet, TreeSet, IntStream.range) already contain unique elements, so distinct () is useless for them. Well, theoretically, the filter operation may win something because of the better prediction of the branches in the processor, if it is true in the first half of the set, and false in the second half. But if the data is already sorted by predicate, it is more efficient not to use the Stream API, but to find the border using a binary search. And the sorting itself is slow, if the input data is poorly sorted. Therefore, say, sorted (). Distinct () for random data will be slower than just distinct (), although distinct () itself will accelerate.

- It is necessary to address important issues related to debugging code. Do you use the peek () method for intermediate results? Is it possible that you have your own testing secrets? Please share them with readers.

- For some reason I do not use peek () for debugging. If the stream is quite complex and something incomprehensible happens in the process, you can break it up into several (with an intermediate list) and look at this list. In general, you can get used to bypass the stream in the usual step-by-step debugger in the IDE. At first it's scary, but then you get used to it.

When I develop new splitters and collectors, I use assistive methods in tests that subject them to extensive testing, testing various invariants and running them under different conditions. Say, I not only compare that the result of a parallel and a serial stream is the same, but I can also insert an artificial splitter into the parallel stream that will spawn empty fragments when creating parallel tasks. They should not affect the result and help to find non-trivial bugs. Or when testing splitters, I randomly crush them into subtasks, which I perform in random order (but in one thread) and compare the result with the sequential one. This is a stable, reproducible test, which, although single-threaded, allows you to catch most of the errors in parallelized splitters. In general, a cool test system that comprehensively checks every brick of code and, in case of errors, produces a sane report, usually completely replaces debugging.

- What development of the Stream API do you see in the future?

- Difficult question, I do not know how to predict the future. Now a lot rests on the presence of four Stream API specializations (Stream, IntStream, LongStream, DoubleStream), so many code has to be duplicated four times, which few people want. Everyone is looking forward to the specialization of generics, which will probably be completed in Java 10. Then it will be easier.

There are also problems with the Stream API extension. As you know, Stream is an interface, not some final class. On the one hand, it allows you to extend the Stream API to third-party developers. On the other hand, adding new methods to the Stream API is no longer so easy: it’s not necessary to break all those classes that have already implemented this interface in Java 8. Each new method should provide a default implementation, expressed in terms of existing methods, which is not always possible and easy. Therefore, an explosive growth of functionality is hardly to be expected.

The most important thing that will appear in Java 9 is the takeWhile and dropWhile methods. There will be nice little things - Stream.ofNullable, Optional.stream, iterate with three arguments and several new collectors - flatMapping, filtering. But, in general, much will still be missed. But there will be additional methods in the JDK that create stream: the new APIs are now being developed with an eye to the streams, and the old ones are pulling up.

- Many people remembered your performance in 2015 with the report “What do we measure?”. This year you are planning to come up with a new theme on Joker ? What are we talking about?

- I decided to make a new report, which I’m not very creative in calling "Fads Stream API." This will be in a sense a continuation of the Strangeness Stream API report with the JPoint: I’ll talk about unexpected performance effects and slippery places in the Stream API, focusing on what will be fixed in Java 9.

- Thank you very much for the interesting and detailed answers. We look forward to your new presentation.

Touch the world of Stream API and other Java hardcore can be at the conference Joker 2016 . In the same place - questions to speakers, discussions around reports and endless networking.

Source: https://habr.com/ru/post/307938/

All Articles

Java Stream API: what's good and what's not

More articles: