In this article, I’ll talk about how the data bus load affects the
scalability of applications. By scalability, we will understand not only the ability of a multi-threaded application to reduce its execution time as the number of threads increases. We will also add here the ability of a single-threaded application running simultaneously in several instances to run in the same period of time as one copy. Although the last example would be more correctly described by such a property as
throughput , as it relates to the “server” mode of launching applications. Those. This is the mode in which a single-threaded application is launched on the server each time a new client connects to it. The main task in developing such applications is to reduce their dependence on shared resources, one of which may be a data bus.
Below is a picture that shows the position of the memory bus in the system. On the left is a diagram for the “antediluvian” architecture of Core 2, on the right for the less old - Nehalem. All subsequent Intel architectures have a similar circuit with Nehalem (with the exception of the Intel MIC).

So, why do we need to know the state of the bus at runtime? But why? Sometimes it happens that the program seems to be written in accordance with the canons of parallel programming: the percentage of single-threaded code is insignificant, the threads are loaded evenly, and there is almost no synchronization, and so on, but something still prevents it from linearly scaling with increasing number of threads . In such cases, experts analyze the performance of the application at the architectural level. At this level, you can find problems specific to a particular processor model or system configuration. These problems include the data bus load.
')
Let's see how bus load affects scalability. To do this, we write a simple program that will read and write elements of a one-dimensional array in a loop.

We will run this program with a different number of threads and a different parameter STEP. The STEP parameter corresponds to the cache line utilization. We remember that the processor exchanges portions of 64 bytes with memory, which are called CASH lines. If we need to read only one byte, the processor will still download 64 bytes from memory. This data exchange is due to the principle of
spatial locality . This is the first principle that underlies the cache. The processor as if assumes that if we considered some array value from memory, then in the next step we need to read the next value from the same array. Therefore, it is important to place the data as close as possible to each other in order to reduce the load on the tire. Thus, with STEP = 1, the utilization of the CASH line is 100%, with STEP = 4, recycling - 25%, STEP = 8, recycling - 12.5% and with STEP = 64, recycling - 1.56%. In fact, the last parameter means downloading a new CASH line at each iteration of the internal loop.
Another note: the test program was compiled by the Intel compiler with the –no-vec option to get a scalar code instead of a vector one. This was done in order to obtain "beautiful data" to facilitate understanding of the theory.

This graph shows the execution time of our application, depending on the parameters tested. We see that as the cache-line utilization deteriorates (parameter STEP), scalability, i.e. the ratio of time for a smaller number of streams to time for a larger number of streams also becomes worse.
Now let's see how the load on the data bus varies depending on the tested parameters. We will measure the load using VTune Amplifer using Bandwidth analysis.

We see that simultaneously with the deterioration of scalability, the load on the tire increases. The explanation here is simple - streams are increasingly required to cache lines and because of the limited bus they have to idle longer and longer while waiting for data. This is the reason for the deterioration of scalability. It is also important to note that the load value ceases to change significantly at some point and gradually approaches a certain value, which is called
peak load . In our case, the peak load is 19 Gb / s.
Now consider what the principle of
temporal locality . This is another principle that underlies the CASH and says the following: if we considered some element from memory, then most likely we will turn to this element again after some time. To demonstrate this principle, take the worst case, where the utilization of the cache line is 1.56%. For this case, we apply a loop bypass in blocks, without violating the integrity of the data and preserving the semantics of the program.

This optimization allows us to process the required number of times the data that is currently in the cache. After the data has been processed, we move to the next batch, pumping them through the bus, and repeat the summation. It is important to note that the serving volume corresponds to the second level cache.

This approach not only reduced the application execution time, but also, what is most important for us, significantly improved scalability. This optimization has reduced the dependence of flows on a shared resource, i.e. from the data bus, and switch them to a second-level cache, which is its own resource for each core, and also faster. We also see that the bus load has become scanty.
So, in order to improve the scalability of our application, we must use one of the principles of locality. And if we also want to significantly reduce the execution time, then we must use both principles.
Someone will say that the case of using movnti instructions intended for tire unloading has not been considered, but I will note that we will discuss this in the next article.
Now let's answer the main question of this article: “And how to understand that the weak scalability of the application is caused by the high data bus load?”.
To answer this question, we must do the following steps using VTune Amplifer:
- Measure peak load for data bus in our system
- Figure out how the bus load changes depending on the increase in the number of threads in our application.
- If we see that with an increase in the number of threads, the load on the bus quickly reaches its peak values (measured in item 1), then this is the cause of our troubles (poor scalability). At the same time, we must understand that there are still other reasons (for example, false-sharing), which we have already checked.
To determine the peak load, we take the test program from the first example with the parameter STEP = 64.

Just in case, I recommend to collect this program without the option of interprocedural analysis. It will be enough to compile it simply with the –O2 option. Here you need to take into account that the size of the arrays should not exceed the size of the RAM, otherwise paging of the operating system may affect the measurement. The number of threads should be no less than the number of cores, and if Hyper-Threading is enabled, then it should be no more than the number of hardware threads. The number of repetitions (REPEAT) can be any, as long as the test runs for substantial time and VTune gives the same bandwidth from start to start.
And now we will consider an example from real life. Take the 470.lbm application from the SPEC CPU2006 package. This is one of the versions of the well-known method for solving problems of hydrodynamics (the full name is Lattice Boltzmann Method). This version is written in such a way as to shift the load balance from the processor to the memory bus. Run the application on a two-socket server based on Nehalem and look at scalability.

We see that already on four streams the scaling worsens, and the load on the bus increases significantly and on eight streams it reaches a peak value. At the same time, I already performed checks for other reasons for poor scalability, and they were not confirmed, so I conclude that it is this kind of bus load that is the main reason.
Now take a look at the hot loop of this application.

We see that the principle of "spatial locality" is violated in it, i.e. Modified elements from the srcGrid array are written to 19 arrays (writing to the dstGrid array with large offsets for the processor is just the same as writing to different arrays). The main problem with this application is inconsistent recording in increments of 20 elements. Such a complex record is due to a specific data structure. The fact is that during the execution of an application, one cube is transformed into another, and each element of this cube is a structure of 20 double elements. Those. in fact, we are dealing with an array of structures, although they are not explicitly declared.

In order to make the record linear, we need to apply the classical optimization, which is called “transformation of an array of structures into an array structure”. How to apply this optimization can be found in the article “Optimization Study for Multicores. Muneeb Anwar Khan. After applying the optimization and breaking the record into blocks (to improve the hardware prefetcher's work), we have the following cycle:

And the result:

We see that scalability has improved due to reduced load on the tire. Although you need to recognize that the execution time of the application in one thread has increased slightly. This is due to the fact that during data transformation we had to add another 19 arrays for the srcGrid, and this increased the load on the hardware prefetcher. An interesting result is obtained when running a single-threaded version of this application in eight copies, i.e. in the "server" mode. (The application was compiled without paralleling options.)

The simultaneous execution of eight single-threaded copies on eight cores takes 252 seconds, which is less than eight consecutive launches of the multi-threaded version, which are performed 8 * 37 = 296 seconds. This suggests that in a multi-threaded version there are some
algorithmic problems associated with parallelization. But that's another story.