[Translation] When to use parallel streams

A source
Authors: Doug Lea with Brian Goetz, Paul Sandoz, Aleksey Shipilev, Heinz Kabutz, Joe Bowbeer, ...

The java.util.streams framework contains data-driven operations on collections and other data sources. Most stream methods perform the same operation on each of the elements. Using the parallelStream() collection method, if there are several cores, data-driven can be turned into data-parallel . But when is it worth doing?

Consider using S.parallelStream().operation(F) instead of S.stream().operation(F) , provided that the operations are independent of each other, and either costly in terms of computation, or applied to a large number of elements effectively split. (splittable) data structures, or both. More precisely:

F : the function for working with one element, usually lambda, is independent, i.e. performing an operation on any of the elements does not depend on and does not affect operations on other elements (for recommendations on the use of independent (non-interfering) stateless functions, see the documentation on the stream package ).
S : The original collection is effectively split. In addition to the collections, there are also other, suitable for parallelization, stream data sources, for example, java.util.SplittableRandom (for which parallelization you can use the stream.parallel() method). But most of the sources with I / O at the core are designed mainly for sequential operation.
The total time in the sequential mode exceeds the minimum allowable limit. Today, for most platforms, the limit is roughly equal (within x10) to 100 microseconds. Accurate measurements, in this case, are not required. For practical purposes, it is sufficient to simply multiply N (the number of elements) by Q (the running time of one F ), and Q can be roughly estimated by the number of operations or the number of lines of code. Then you need to check that N * Q is at least less than 10000 (if you are a coward, then add one or a couple of zeros). So, if F is a small function like x -> x + 1 , then parallel execution will have meaning for N >= 10000 . Conversely, if F is a weighty calculation, like finding the next best move in a game of chess, then the value of Q so great that N can be neglected, but as long as the collection is completely split.

The stream processing framework will not (and cannot) insist on any of the above. If the calculations are dependent on each other, then their parallel execution does not make sense, or will be harmful at all and will lead to errors. Other criteria derived from the above engineering constraints (issues) and tradeoffs (tradeoffs) criteria include:

Start (Start-up)
The emergence of additional cores in processors, in most cases, was accompanied by the addition of a power management mechanism, which can be the cause of slower launch of cores, sometimes with additional overlays from the JVM, the operating system and the hypervisor. In this case, the limit at which parallel mode makes sense roughly corresponds to the time required for a subtask to start processing a sufficient number of cores. After that, parallel computing may be more energy efficient than sequential (depending on the details of the processors and systems. For an example, see the article ).
Detailing (Granularity)
It rarely makes sense to split up and so small calculations. The framework usually divides the task in such a way that the individual parts can work on all the available cores of the system. If, after the start, there is practically no work for each core, then (usually sequential) efforts to organize parallel computing will be wasted. Considering that in practice the number of nuclei ranges from 2 to 256 threshold, also prevents the undesirable effect of excessive division of the task.
Severability (Splittability)
The most efficiently split collections include ArrayList and {Concurrent}HashMap , as well as regular arrays ( T[] , which are divided into parts using static methods java.util.Arrays ). The least efficiently splittable are LinkedList , BlockingQueue and most sources with I / O at the base. The rest are somewhere in the middle (data structures that support random access and / or efficient search are usually effectively split). If data fragmentation takes longer than processing, then efforts are vain. If Q is large enough, then you can get a gain due to paralleling even for LinkedList , but this is quite a rare case. In addition, some sources cannot be split into a single element, and thus there may be a limit on the degree of decomposition of the problem.

Obtaining the exact characteristics of these effects can be difficult (although, if you try, and doable using tools like JMH ). But the cumulative effect is fairly easy to notice. To experience it yourself - conduct an experiment. For example, on a 32-core test machine, when small functions are started, like max() or sum() , the break-even point is approximately 10,000 over an ArrayList . For more elements, acceleration is noted up to 20 times. Working time for collections with less than 10,000 items is not much less than for 10,000, and therefore slower than sequential processing. The worst result occurs with less than 100 elements - in this case, the involved threads stop doing nothing useful, because calculations are completed before they start. On the other hand, when operations on elements are time-consuming, in the case of using efficiently and fully split collections, such as ArrayList , the benefit is immediately visible.

To paraphrase all of the above, the use of parallel() in the case of an unnecessarily small amount of computation can cost about 100 microseconds, and the use otherwise should save at least this time itself (or perhaps the clock for very large tasks). The specific cost and benefits will vary over time and for different platforms, and, also, depending on the context. For example, the launch of small computations in parallel within a sequential cycle enhances the effect of rises and falls (performance micro tests in which this manifests itself may not reflect the real situation).

Questions and answers

Why does the JVM itself not know when to perform operations in parallel mode?

She might try, but too often the decision would be wrong. The search for fully automatic multi-core concurrency has not led to a universal solution over the past thirty years, and therefore, the framework uses a more robust approach, requiring the user to only choose between yes or no . This choice is based on engineering problems that are constantly encountered in sequential programming, which are unlikely to completely disappear ever. For example, you may encounter a 100-fold slowdown when searching for the maximum value in a collection containing a single element by comparison using this value directly (without a collection). Sometimes the JVM can optimize such cases for you. But this rarely happens in sequential cases, and never in the case of parallel mode. On the other hand, it can be expected that, as it develops, the tools will help users make better decisions.

What if I don’t have enough knowledge about the parameters ( F , N , Q , S ) to make a good decision?

This, too, is similar to problems often encountered in sequential programming. For example, the S.contains(x) method of the Collection class usually executes quickly if S is a HashSet , slowly if LinkedList , and moderately in other cases. Usually, for the author of a component using a collection, the best way out of this situation is to encapsulate it and publish only a specific operation on it. Then users will be isolated from the need to choose. The same applies to parallel operations. For example, a component with an internal price collection can define a method that checks its size to reach the limit, which will make sense until elementwise calculations become too expensive. Example:

 public long getMaxPrice() { return priceStream().max(); } private Stream priceStream() { return (prices.size() < MIN_PAR) ? prices.stream() : prices.parallelStream(); }

This idea can be extended to other considerations about when and how to use parallelism.

What if my function might do I / O or synchronized operations?

At one extreme are functions that do not meet the independence criteria, including successive I / O operations in nature, access to blocking synchronized resources, and cases where an error in one parallel subtask that performs I / O affects others. Their parallelization does not make much sense. On the other hand, there are calculations that occasionally perform I / O or rarely blocked synchronization (for example, most logging cases, and using such competitive collections as ConcurrentHashMap ). They are harmless. What lies between them requires more research. If each subtask can block for a significant amount of time while waiting for I / O or access, CPU resources will be idle without being able to be used by the program or the JVM. From such a bad all. In these cases, parallel stream processing is not always the right choice. But there are good alternatives - for example, asynchronous I / O and the CompletableFuture approach.

What if my source is based on I / O?

Currently, I / O using JDK Stream s generators (for example, BufferedReader.lines() ) are mainly adapted for use in sequential mode, processing elements one by one as they arrive. Support for high-performance mass (bulk) processing of buffered I / O is possible, but, at the moment, this requires the development of special generators Stream s, Spliterator s and Collector . Support for some common cases may be added in future JDK releases.

What if my program runs on a booted computer and all the cores are busy?

Machines usually have a fixed number of cores, and cannot magically create new ones when performing parallel operations. However, as long as the criteria for choosing a parallel mode are clearly spoken for , there is nothing to doubt. Your parallel tasks will compete for others with the CPU and you will notice less acceleration. In most cases, it is still more effective than other alternatives. The underlying mechanism is designed so that if there are no available cores, you will notice only a slight slowdown in comparison with the sequential option, unless the system is so overloaded that it spends all its time switching the context instead of doing some real work, or configured with the expectation that all processing is performed sequentially. If you have such a system, then perhaps the administrator has already disabled the use of multi-threading / kernel in the JVM settings. And if you are the system administrator yourself, then it makes sense to do it.

Do all operations parallelize when using parallel mode?

Yes. At least to some extent. But it is necessary to take into account that the stream-framework takes into account the limitations of sources and methods when choosing how to do it. In general, the less constraints, the greater the potential for concurrency. On the other hand, there are no guarantees that the framework will identify and apply all available opportunities for parallelism. In some cases, if you have the time and expertise, your own solution can make better use of the possibilities of parallelism.

What acceleration do I get from concurrency?

If you follow these tips, it is usually enough to make sense. Predictability is not the strength of modern hardware and systems, and therefore there is no universal answer. The cache locality, GC characteristics, JIT compilation, memory access conflicts, data location, OS dispatch policies, and the presence of the hypervisor are some of the factors that have significant influence. The performance of the sequential mode is also affected by them, which, when using parallelism, is often exacerbated: a problem causing a 10 percent difference in the case of sequential execution can lead to a 10-fold difference in parallel processing.

Stream-framework includes some features that help increase the chances of acceleration. For example, using specialization for primitives, such as IntStream , usually has a greater effect for parallel mode than for serial mode. The reason is that in this case not only the consumption of resources (and memory) is reduced, but also the locality of the cache is improved. Using ConcurrentHashMap instead of HashMap , in the case of the parallel operation of the collect operation, reduces internal costs. New tips and recommendations will appear as you gain experience with the framework.

All this is too scary! Can we just come up with rules for using JVM properties to turn off concurrency?

We don't want to tell you what to do. The appearance for programmers of new ways to do something wrong may be scary. Errors in code, architecture, and evaluations will of course occur. Decades ago, some people predicted that having concurrency at the application level would lead to more trouble. But it never came true.

Source: https://habr.com/ru/post/420805/

All Articles

[Translation] When to use parallel streams

Questions and answers

More articles: