The future of hardware accelerator programming

Many of the newest supercomputers are based on hardware accelerators of calculations (accelerator). including the two fastest systems according to TOP500 from 11/2013. Accelerators are also distributed on conventional PCs and even appear in portable devices, which further contributes to the growth of interest in accelerator programming.

Such widespread use of accelerators is the result of their high performance, energy efficiency and low cost. For example, if you compare the Xeon E5-2687W and GTX 680, released in March 2012, we see that the GTX 680 is four times cheaper, has 8 times greater performance of single-precision operations and 4 times more memory bandwidth, and also provides over 30 times more performance per dollar and 6 times more performance per watt. Based on such comparative results, accelerators should be used everywhere and always. Why is this not happening?

There are two main difficulties. First, accelerators can efficiently execute only certain classes of programs, in particular programs with sufficient parallelism, data reuse, control flow continuity and memory access structures. Secondly, it is more difficult to write efficient accelerator programs than for ordinary CPUs due to architectural differences, such as very large parallelism, an open (without hardware caches) memory hierarchy, the rigidity of the execution procedure, and the merging of memory access operations. Therefore, several programming languages and extensions have been proposed in order to hide these aspects to varying degrees and thus make programming accelerators easier.

Initial attempts to use GPUs, which are currently the most well-known type of accelerator, to speed up non-graphical applications were cumbersome and required the presentation of calculations in the form of a shader code that supported limited control flow and did not support integer operations. Gradually, these restrictions were lifted, which contributes to a wider distribution of computing on graphics chips and allows specialists from non-graphics areas to program them. The most important step in this direction was taken with the release of the CUDA programming language. It expands C / C ++ with the help of additional specifiers and keywords, as well as libraries of functions and the mechanism for launching parts of code called the kernel (GPU).
')
The rapid adoption of CUDA, combined with the fact that it is a proprietary product as well as the complexity of writing high-quality code on CUDA, leads to the creation of other approaches to accelerator programming, including OpenCL, C ++ AMP and OpenACC. OpenCL is the non-proprietary counterpart of CUDA and is supported by many large companies. It is not limited only to NVidia chips, but also supports AMD graphics processors, multi-core CPUs, MIC (Intel Xeon Phi), DSP and FPGA, which makes it portable. However, like CUDA, it is very low-level. It requires the programmer to directly control the movement of data, requires directly determining where the variables are stored in the memory hierarchy and manually implementing parallelism in the code. C ++ Accelerated Massive Parallelism (C ++ AMP) works at the middle level. It allows you to describe parallel algorithms in C ++ itself and hides all the low-level code from the programmer. The operator “for each” encapsulates parallel code. C ++ AMP is tied to Windows, does not yet support the CPU, and suffers from a large startup overhead, which makes it practically impractical to accelerate with the help of short-running code.

OpenACC is already a very high-level approach to accelerator programming, which allows programmers to supply code with directives, thereby informing the compiler which parts of the code need to be accelerated, for example, by shipping them to a graphics processor. The idea is similar to how OpenMP is used to parallelize CPU programs. In fact, efforts are being made to combine these two approaches. OpenACC is at the maturation stage and is currently supported by only a few compilers.

To understand how the field of programming of hardware accelerators will develop in the future, it is worthwhile to study how a similar process proceeded in the past with other hardware accelerators. For example, early advanced PC had an additional processor - a co-processor that performs floating-point calculations. Later it was integrated on a chip with a central processor - a CPU - and now it is an integral part. They have only different registers and arithmetic logic units (ALU). Later SIMD processor expansions (MMX, SSE, AltiVec and AVX) were not released as separate chips, but now they are also fully integrated into the processor core. Just like floating point operations, SIMD instructions are calculated on separate ALUs and using their own registers.

Surprisingly, these two types of instructions differ significantly from the point of view of the programmer. Real types and operations with them have long been standardized (IEEE 754) and are used everywhere today. They are available in high-level programming languages through ordinary arithmetic operations and the built-in real data types: 32 bits for real numbers of single precision and 64 bits for double precision. On the contrary, there are no standards for SIMD instructions and their very existence is largely hidden for the programmer. The use of these instructions for computing vectorization has been delegated to the compiler. Developers who want to explicitly use these instructions should contact the compiler using special non-cross-platform macros.

Since the performance of GPU and MIC accelerators is due to their SIMD nature, we think that their development will follow the path of previous SIMD accelerators. Another similarity with SIMD and a key feature of CUDA that made it successful is that CUDA hides the SIMD essence characteristic of graphics processors and allows the programmer to think in terms of streams operating with scalar data rather than in terms of warps (warp) operating with vectors. Therefore, undoubtedly, the accelerators will also be transferred to the chip with the processor, but we believe that their software code will not be sufficiently sewn into the usual CPU code as well as the GPU hardware data types will not be directly accessible to programmers.

Some accelerators have already been combined on chip with traditional processors, these are the AMD APU (used in the Xbox One), Intel processors with integrated HD graphics, and Tegra SoC from NVIDIA. However, the accelerators will probably remain a separate core because it is difficult to combine them with the traditional processor core to the same extent as it was done with the mathematical coprocessor and with the SIMD extensions, that is, cut to a set of registers and a separate ALU in the central processor . In the end, accelerators are so fast, parallel and energy efficient, precisely because of architectural solutions other than CPU, such as uncoupled cache, a completely different implementation of the pipeline, GDDR5 memory and an order of magnitude large number of registers and multithreading. Consequently, the complexity of running the code on the accelerators still remains. Since even the processor cores made on one chip, as a rule, have only common lower levels of the memory hierarchy, therefore the speed of data exchange between the CPU and accelerators will probably increase, but will still remain a bottleneck.

The need to explicitly manage the data exchange processes between devices is a significant source of errors and a heavy burden falls on programmers. It often happens for small algorithms that you have to write more code to organize the exchange of data, rather than the calculations themselves. Removing this burden is one of the main advantages of high-level programming approaches, such as C ++ AMP and OpenACC. Even low-level implementations are aimed at solving this problem. For example, a well-functioning and unified memory access is one of the major improvements that are carried out in the latest versions of CUDA, OpenCL and NVIDIA GPU hardware solutions. Still, to achieve good performance, programmer assistance is usually required, even in very high-level solutions such as OpenACC. In particular, the allocation of memory in the required places and the transfer of data often need to be done manually.

Unfortunately, all the simplifications offered by such approaches may be only a partial solution. Considering that future processors will be close to today's (small) supercomputers, it is likely that they will have more cores than they can serve in their shared memory. Instead, we think that on each crystal there will be clusters of cores and each cluster will have its own memory, possibly located above these cores in three-dimensional space. Clusters will be connected to each other by a network made on the same crystal using a protocol like MPI. And this is not so far from the truth because Intel has just announced that network functions will be added to future Xeon chips, and this is a step in this direction. Consequently, it is likely that in the future chips will become increasingly heterogeneous, combining cores optimized for latency and throughput; network adapters, compression and encoding centers, FPGAs, etc.

This raises the crucial question of how to program such devices. We believe that the answer to this question is surprisingly similar to how it has been solved today for multi-core CPUs, SIMD extensions, and currently existing hardware accelerators. It happens on three levels, which we call libraries, automation tools, and do-it-yourself. Libraries - the simplest approach, based on a simple function call from the library, which is already optimized by someone for an accelerator. Many modern math libraries belong to this class. If most program calculations are performed in these library functions, then this approach will be fully justified. It allows several specialists to write one good library to speed up the multitude of applications in which this library will be used.

In C ++ AMP and OpenACC, a different approach is used - automation tools. With this approach, hard work is shifted to the compiler. Its success depends on the quality and complexity of existing software tools and, as has been said, often requires the intervention of a programmer. However, most programmers can quickly achieve good results using this approach, which is not limited to using predefined functions from libraries. This is similar to how several groups of specialists implement the "insides" of SQL, which allows ordinary developers to continue using ready-made optimized code.

Finally, the do-it-yourself approach is used in CUDA and OpenCL. It gives the programmer full control over access to almost all accelerator resources. With a good implementation, the resulting code is superior in performance to any of the two previous ones. But this is given by significant efforts to study this approach, writing a large amount of additional code and more room for possible errors. All sorts of improvements in development and debugging environments can mitigate all these difficulties, but only to a certain extent. So this approach is useful primarily for experts. Those who are engaged in the development of the methods mentioned in the previous two approaches.

Because of the ease of use of libraries, programmers are able to use it wherever possible. But this is possible only if the corresponding library functions exist. In popular areas, such libraries usually exist. For example, matrix operations (BLAS). But in adjacent areas or where calculations are not structured, it is difficult to implement accelerator libraries. In the absence of appropriate libraries, programmers choose automation tools unless of course they are sufficiently developed. Calculations inaccessible in the form of libraries, not very demanding of performance and supported by the compiler, are most likely implemented using automation tools. In other cases, use the "do it yourself" method. Since OpenCL integrates successful solutions presented in CUDA is not proprietary and supports different hardware solutions, we think that it or its derivative solutions will prevail in this area just as MPI became the de facto standard for programming distributed memory systems. .

Taking into account the hardware features and the evolution process outlined above, we can say that future processor chips will contain many clusters with their own memory. Each cluster will consist of many cores, and not all cores will be functionally identical. Each multithreaded core will consist of a set of computational links (i.e., functional blocks or ALUs) and each computational link will execute SIMD commands. Even if future chips will not include all of this at once, they will all have one key similarity, namely, a hierarchy of levels of parallelism. To create efficient and portable programs for such systems, we offer what we call the technique of "extensive parallelism" ("copious-parallelism" technique). It is a generalization of how a programmer adapts MPI programs to different numbers of compute nodes or how OpenMP code implicitly adapts to different numbers of cores or threads.

The main idea of extensive parallelism and the reason for such a name is to provide extensive, parameterizable possibilities for parallelism at each level. Parameterization will allow reducing the degree of program parallelism at any level in order to bring it into line with the degree of hardware parallelism at this level. For example, the system with shared memory does not require the highest level of parallelism, it must be installed in one "cluster". Similarly, in the kernel, where the computational units cannot execute SIMD instructions, the parameter determining the width of the SIMD must be set to one. Using such a technique, one can realize the capabilities of multi-core CPUs, GPUs, MICs, and other devices, as well as likely future hardware architectures. Writing programs in this way is undoubtedly more difficult, but extensive parallelism allows you to extract high performance from a wide class of devices using a single codebase.

We tested this approach on the problem of direct modeling of the n-body. We wrote a single implementation of the algorithm with extensive parallelism using OpenCL and made measurements on four completely different hardware architectures: NVIDIA GeForce Titan GPU, AMD Radeon 7970 GPU, Intel Xeon E5-2690 CPU and Intel Xeon Phi 5110P MIC. Considering that 54% of all floating point operations were not FMA operations (FMA - multiply operations with accumulation), extensive parallelism allowed achieving a performance of 75% of the theoretical peak for NVIDIA Titan, 95% for Radeon, 80.5% for CPU and 80% for MIC. This is just a separate example, but its results are very encouraging. In fact, we believe that extensive parallelism will be for some time the only approach to writing portable high-performance programs for existing and future systems on hardware accelerators.

[ Source ]
January 9, 2014
Camille Rocky (Kamil Rocki) and Martin Burtscher (Martin Burtscher)

Source: https://habr.com/ru/post/209606/

All Articles

The future of hardware accelerator programming

More articles: