An example of optimizing computing on CUDA

Introduction

I describe the results of applying methods to optimize calculations on CUDA when modeling plasma. Calculations are performed using Java bindings to CUDA (JCUDA) [1] on the GT630 (Kepler). Simulation occurs as a solution to the Cauchy problem — setting the values of the parameters at the initial moment of time, then incrementing the time and recalculating all equations, and so on. many times. Calculations occur in double precision (double). The correctness of the results is confirmed by calculations on the CPU without JCUDA.

The model of parametric instability of Langmuir waves in a plasma consists of the equations of amplitude, phase, ion density, electric field strength (the number of equations of each type is 400), the equations of motion of ions, the number of which is 20000, and two pump equations. Libraries of functions (cublas, etc.) are not used, the code of equations for calculations on the GPU is written by hand in the .cu file.

1 Application of the main methods of optimization of calculations on CUDA

1.1. It is important to minimize the transfer of data between the GPU memory and the RAM, even if it means running code sections on the GPU that do not show acceleration compared to the CPU [2]. Thus, the pumping parameters on the GPU are not computed in parallel, since they are described by only two equations. Their calculation on the CPU would be faster, but it would take time to exchange data.
')
1.2. The initial data for the calculation are loaded into the memory of the GPU once before the start of the simulation. Further, the exchange between the RAM and the GPU is absent except for the moments of time, the results of which need to be saved.

1.3. Calculations of all equations, except for the calculation of pumping, occur in parallel (a single flow is created to calculate one equation). The dimensions of the grid and the block are selected, at which the calculation speed is maximum. The dimension of the block (the number of threads in the block) should not be small and large, from personal experience, should be approximately equal to the number of cores in the GPU (at least in the GPU up to 500 cores).

1.4. All GPUs loaded into memory are stored in global memory, but the GPU has fast types of memory — shared and constant. However, attempts to use them gave the effect of only 1%.

1.5. In some places in the code, the same calculation is performed many times. For example, if a repeating a * b is present, the variable c = a * b is created, which is then used. Several optimizations have been made in a similar way, but their effect is 1%.

2 Optimization of use of trigonometric functions

When calculating on a GPU, 85% of the time is the calculation of trigonometric functions, therefore, optimization of the use of trigonometric functions for this task is relevant.

2.1. There are functions sinpi, cospi, however, in the model only 4 functions have a suitable appearance, and the effect of their use is 2%.

2.2. There is also a sincos function that simultaneously calculates sine and cosine. The effect of its use is 50%. However, a significant drawback is the need for each calculation to allocate memory for storing sine and cosine values, which complicates its use.

2.3. An attempt was made to preliminarily calculate the sines and cosines at each moment in time (ie, create a table of values), and then use the calculated values in the equations. A quarter of the functions have a similar function with the same argument, but the use of this optimization method has an effect of up to 5%.

2.4. In CUDA every mat. the function has several implementations that differ in accuracy - double-functions (sin, cos), float-functions (sinf, cosf), lower-precision functions (__sinf, __cosf). Thus, the use of the float functions of sine and cosine allows you to speed up the calculation of equations by 60%, and the use of sine and cosine functions of reduced accuracy - by 70%. At the same time, other calculations continue to be performed in double precision, and the accuracy of the results is maintained.

findings

The simulation time on the GPU to optimize the use of trigonometric functions was 20 minutes. The simulation time before applying clauses 1.1, 1.2, 1.3 was not measured, since provisions of these items were implemented initially.

Simulation time after optimization is 7 minutes, 90% of them are calculations on the GPU (CUDA), 10% are additional calculations on the CPU related to storing the results (Java), data exchange between the GPU memory and RAM is 0.01%.

Successful methods - 1.1, 1.2, 1.3, 2.4 - minimized the exchange of data between GPU memory and RAM, basic calculations were transferred to the GPU and paralleled, optimal paralleling parameters — grid and block size, and lower precision trigonometric functions were used.

The methods that had to be abandoned due to low efficiency or code complication - 1.4, 1.5, 2.1, 2.2, 2.3 - preliminary calculation of values in order to reduce the number of calculations, attempts to use fast types of GPU memory.

The results of applying various optimization methods can be explained by the specifics of the task being modeled.

In the study, modeling was performed several hundred times, incl. sometimes with a large number of particles (an increase in the number of particles by 2 times increases the simulation time by 4 times). Therefore, a small saving of time for one simulation gives a big saving in the end. There is no opportunity to carry out modeling on a more powerful video card. The comparison is not too correct, but the acceleration at JCUDA compared to Java calculations on a single 2.2 GHz processor core is 35 to a hundred times, depending on the selected counting parameters. Also developments are used in other tasks.

Literature

1. Marco Hutter. JCUDA. jcuda.org .
2. CUDA C Best Practices Guide. NVIDIA Corporation. docs.nvidia.com/cuda/cuda-c-best-practices-guide .

Works included in one study with this article (added on 02/05/2014)

1. More details about optimization are written in the article it-visnyk.kpi.ua/wp-content/uploads/2014/01/Issue-58.pdf (pages 125-130).
The numbers are slightly different, but the essence is the same.
Priymak A.V. Optimization of calculations on CUDA when modeling the instability of Langmuir waves in a plasma. // Visnyk NTUU "KPI". Informatics, control and technical technology: . sciences. Ave. - K .: Century +, - 2013. - No. 58. - C. 125-131.
2. Description of the simulated task vant.kipt.kharkov.ua/TABFRAME.html (select 2013 No. 4 (86))
Belkin EV, Kirichok AV, Kuklin VM, Pryjmak AV, Zagorodny AG Dynamics of ions during development of the parametric instability of Langmuir waves // Atomic science and technology. Plasma Electronics and New Acceleration Methods Series. 2013, No. 8, pp. 260-266.
3. About using JCUDA csconf.donntu.edu.ua/arxiv (materials have not yet been posted)
Priymak A.V. Using JCUDA technology to simulate ion dynamics during the development of parametric instability of Langmuir waves. - Informatics and computer technologies / Collection of works of the IX international scientific and technical conference of students, graduate students and young scientists. - November 5-6, 2013, Donetsk, DonNTU. - 2013. - pp. 200-204.
4. About modeling software conferences.neasmo.org.ua/node/2924
Priymak A.V. Software development for hybrid modulation instability models of Langmuir waves in a plasma. Proceedings of the XVIIth International Scientific and Practical Internet Conference "Problems and Prospects for the Development of Science at the Beginning of the Third Millennium in the CIS Countries" // Collected Scientific Works. - November 29-30, 2013, Pereyaslav-Khmelnitsky, Pereyaslav-Khmelnitsky DPU іm.G.Skoporodi. - P.155-159.

Source: https://habr.com/ru/post/211194/

All Articles