We continue to get acquainted with Intel Xeon Phi: "native" code

In the last article , an acquaintance with the Intel Xeon Phi co-processor was described using offload - the main code runs on the host, and individual blocks are uploaded to the coprocessor. In this article we will look at the compilation and use of the “native” code in order to find out what it gives and what it threatens. At the end of the post there will be four suggestions regarding the use of Fortran and sample programs.

This article is not advertising or anti-advertising of any software or hardware product, but only describes the personal experience of the author.
Like last time, we will consider the body interaction problem (n-body problem). We will take the solution of the problem on the CPU from the previous article, and then, if necessary, modify the code to run on the MIC (hereinafter referred to as MIC) we will call Intel Xeon Phi).

Parallel code using OpenMP

/*---------------------------------------------------------*/ /* N-Body simulation benchmark */ /* written by MSOzhgibesov */ /* 04 July 2015 */ /*---------------------------------------------------------*/ #include <stdio.h> #include <stdlib.h> #include <math.h> #include <string.h> #include <time.h> #include <omp.h> #define HOSTLEN 50 int numProc; // Initial conditions void initCoord(float *rA, float *vA, float *fA, \ float initDist, int nBod, int nI); // Forces acting on each body void forces(float *rA, float *fA, int nBod); // Calculate velocities and update coordinates void integration(float *rA, float *vA, float *fA, int nBod); int main(int argc, const char * argv[]) { int const nI = 32; // Number of bodies in X, Y and Z directions int const nBod = nI*nI*nI; // Total Number of bodies int const maxIter = 20; // Total number of iterations (time steps) float const initDist = 1.0; // Initial distance between the bodies float *rA; // Coordinates float *vA; // Velocities float *fA; // Forces int iter; double startTime0, endTime0; char host[HOSTLEN]; rA = (float*)malloc(3*nBod*sizeof(float)); fA = (float*)malloc(3*nBod*sizeof(float)); vA = (float*)malloc(3*nBod*sizeof(float)); gethostname(host, HOSTLEN); printf("Host name: %s\n", host); numProc = omp_get_num_procs(); printf("Available number of processors: %d\n", numProc); // Setup initial conditions initCoord(rA, vA, fA, initDist, nBod, nI); startTime0 = omp_get_wtime(); // Main loop for ( iter = 0; iter < maxIter; iter++ ) { forces(rA, fA, nBod); integration(rA, vA, fA, nBod); } endTime0 = omp_get_wtime(); printf("\nTotal time = %10.4f [sec]\n", endTime0 - startTime0); free(rA); free(vA); free(fA); return 0; } // Initial conditions void initCoord(float *rA, float *vA, float *fA, \ float initDist, int nBod, int nI) { int i, j, k; float Xi, Yi, Zi; float *rAx = &rA[ 0]; //---- float *rAy = &rA[ nBod]; // Pointers on X, Y, Z components of coordinates float *rAz = &rA[2*nBod]; //---- int ii = 0; memset(fA, 0.0, 3*nBod*sizeof(float)); memset(vA, 0.0, 3*nBod*sizeof(float)); for (i = 0; i < nI; i++) { Xi = i*initDist; for (j = 0; j < nI; j++) { Yi = j*initDist; for (k = 0; k < nI; k++) { Zi = k*initDist; rAx[ii] = Xi; rAy[ii] = Yi; rAz[ii] = Zi; ii++; } } } } // Forces acting on each body void forces(float *rA, float *fA, int nBod) { int i, j; float Xi, Yi, Zi; float Xij, Yij, Zij; // X[j] - X[i] and so on float Rij2; // Xij^2+Yij^2+Zij^2 float invRij2, invRij6; // 1/rij^2; 1/rij^6 float *rAx = &rA[ 0]; //---- float *rAy = &rA[ nBod]; // Pointers on X, Y, Z components of coordinates float *rAz = &rA[2*nBod]; //---- float *fAx = &fA[ 0]; //---- float *fAy = &fA[ nBod]; // Pointers on X, Y, Z components of forces float *fAz = &fA[2*nBod]; //---- float magForce; // Force magnitude float const EPS = 1.E-10; // Small value to prevent 0/0 if i==j #pragma omp parallel for num_threads(numProc) private(Xi, Yi, Zi, \ Xij, Yij, Zij, magForce, invRij2, invRij6, j, i) for (i = 0; i < nBod; i++) { Xi = rAx[i]; Yi = rAy[i]; Zi = rAz[i]; fAx[i] = 0.0; fAy[i] = 0.0; fAz[i] = 0.0; for (j = 0; j < nBod; j++) { Xij = rAx[j] - Xi; Yij = rAy[j] - Yi; Zij = rAz[j] - Zi; Rij2 = Xij*Xij + Yij*Yij + Zij*Zij; invRij2 = Rij2/((Rij2 + EPS)*(Rij2 + EPS)); invRij6 = invRij2*invRij2*invRij2; magForce = 6.f*invRij2*(2.f*invRij6 - 1.f)*invRij6; fAx[i]+= Xij*magForce; fAy[i]+= Yij*magForce; fAz[i]+= Zij*magForce; } } } // Integration of coordinates an velocities void integration(float *rA, float *vA, float *fA, int nBod) { int i; float const dt = 0.01; // Time step float const mass = 1.0; // mass of a body float const mdthalf = dt*0.5/mass; float *rAx = &rA[ 0]; float *rAy = &rA[ nBod]; float *rAz = &rA[2*nBod]; float *vAx = &vA[ 0]; float *vAy = &vA[ nBod]; float *vAz = &vA[2*nBod]; float *fAx = &fA[ 0]; float *fAy = &fA[ nBod]; float *fAz = &fA[2*nBod]; #pragma omp parallel for num_threads(numProc) private(i) for (i = 0; i < nBod; i++) { rAx[i]+= (vAx[i] + fAx[i]*mdthalf)*dt; rAy[i]+= (vAy[i] + fAy[i]*mdthalf)*dt; rAz[i]+= (vAz[i] + fAz[i]*mdthalf)*dt; vAx[i]+= fAx[i]*dt; vAy[i]+= fAy[i]*dt; vAz[i]+= fAz[i]*dt; } }

The code on the coprocessor can be run in two ways:

Compile the entire program into native code for the MIC architecture using the -mmic option
Run individual subroutines / functions through offload, so part of the code will run on the host and part on Xeon Phi

The last time work was considered through the unloading, this time we will try to collect and run the "native" code for MIC.
This method allows you to run an existing program on the coprocessor with minimal changes. However, the following points should be considered:

MIC usually has much less RAM than the host;
The algorithm should have as few “serial” sections as possible;
The number of I / O operations should be reduced to zero - each such operation is an appeal to the host, and this, as in the case of CUDA, is a very “expensive” pleasure.

The created executable file for MIC is copied to the coprocessor using scp (Intel Xeon Phi has its Linux-based micro-OS) and starts.

Create / add user on MIC

Under the user (let it be micuser), which we want to add, create ssh keys:
```
 $ ssh-keygen 
```
We remember the way where they were saved: /home/micuser/.ssh/

Under the root, create a new user for MIC:

 $ micctrl –-useradd=micuser –-uid=500 –-gid=500 –-sshkeys=/home/micuser/.ssh/

where uid and gid are user ID and group ID.

If you do not specify a directory with ssh keys, then login under the user will not work - will ask for a password which we do not know. A detailed description of the administration process Xeon Phi. An alternative way to create a user on MIC : log in as root to the coprocessor (by default, only root has access to MIC via ssh) and create a user via useradd. The second method did not check - I want to follow the official manual, and not to deal with possible glitches.

Go to "MIKE"

To verify the statement that the program for the CPU can be used on the MIC with minimal changes, we will use the CPU-wide version of the program shown at the very beginning. Compile for MIC, copy and run:

 $ icc nbody_CPU.c -mmic -openmp -O3 -o nbdMIC.run $ scp nbdMIC.run mic0: $ ssh mic0 $ ./nbdMIC.run ./nbdMIC.run: error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory

It's not funny at all - they've got it all right! In fact, almost nowhere - the point is that Xeon Phi is a separate device, with its own file system and, by default, it does not know much! The solution is simple: you need to copy them to MIC as the executable program. We exit to the host and copy (note that we copy not just any library, but compiled for MIC):

 $ scp /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/mic/libiomp5.so mic0:/tmp/ $ ssh mic0 $ echo $LD_LIBRARY_PATH $ export LD_LIBRARY_PATH=/tmp $ ./nbdMIC.run Host name: mic0.local Available number of processors: 240 Total time = 1.0823 [sec]

Here we see two interesting things:

The number of available threads is 240 ( Intel Xeon 5110P has 60 physical cores ), not 236 as when using unloading;
The “native” code works ~ 1.3x times faster than the one being unloaded (1.08 seconds vs. 1.44 seconds).

The results of the program with the upload

In the case of unloading, one kernel is given offload daemon to provide interaction with the host, while the “native” code is executed with all available resources.
The increase in speed is due to the almost complete absence of data exchange between the host and the MIC (except for printing), as well as through an additional computational core (not so much, but still).
As noted above, the coprocessor is a separate device, and therefore copied libraries, the executable file itself must be stored somewhere, but the coprocessor does not have its SSD / HDD (at least 5110P). Where, then, is everything copied? The answer is simple: in RAM and copied. Thus, each copied file reduces the amount of RAM available to run the program. And if the output of the program is a file in a couple of gigabytes? For such purposes, you can mount a folder from the host to the MIC.
Tiring and copying all the necessary libraries is also a tedious task, fortunately there is a micnativeloadex utility that allows you to determine all the dependencies of a compiled program. Description of the application of this utility, as well as how to mount the directory can be found here .
')

Four sentences about Fortran

In the last article, the first acquaintance with the Intel Xeon Phi co-processor was described, which took place exclusively under C. At the same time, the possibility of using the Fortran language was mentioned, but without describing exactly how to do it, as a result, a request was received to correct the situation. The basic idea that in the case of using Fortran, that the C language remains unchanged, only the syntax of directives changes. Therefore, below are only the source code Fortran programs.

Fortran program for CPU

 !---------------------------------------------------------! ! N-Body simulation benchmark ! ! written by MSOzhgibesov ! ! 14 July 2015 ! !---------------------------------------------------------! program nbody_CPU use omp_lib implicit none integer, parameter:: nI = 32 ! Number of bodies in X, Y and Z directions integer, parameter:: nBod = nI**3 ! Total Number of bodies integer, parameter:: maxIter = 20 ! Total number of iterations (time steps) integer:: numProc ! Number of available processors integer:: iter character(len=50):: host real(4), parameter:: initDist = 1.0 ! Initial distance between the bodies real(4), allocatable:: rA(:) ! Coordinates real(4), allocatable:: vA(:) ! Velocities real(4), allocatable:: fA(:) ! Forces real(8):: startTime0, endTime0 common/ourCommonData/numProc allocate(rA(3*nBod), vA(3*nBod), fA(3*nBod)) call hostnm(host) write(*,'(A11,A50)')"Host name: ", host numProc = omp_get_num_procs() write(*,'(A32,I4)')"Available number of processors: ",numProc ! Setup initial conditions call initCoord(rA, vA, fA, initDist, nBod, nI) ! Main loop startTime0 = omp_get_wtime() do iter = 1, maxIter call forces(rA, vA, nBod) call integration(rA, vA, fA, nBod) enddo endTime0 = omp_get_wtime() write(*,'(A13,F10.4,A6)'), "Total time = ", endTime0 - startTime0," [sec]" deallocate(rA, vA, fA) end program ! Initial conditions subroutine initCoord(rA, vA, fA, initDist, nBod, nI) implicit none integer:: i, j, k, ii integer:: nI, nBod integer:: initDist integer:: numProc real(4):: Xi, Yi,Zi real(4):: rA(*), fA(*), vA(*) fA(1:3*nBod) = 0.E0 vA(1:3*nBod) = 0.E0 ii = 1 do i = 1, nI Xi = i*(initDist - 1) do j = 1, nI Yi = j*(initDist - 1) do k = 1, nI Zi = k*(initDist - 1) rA(ii ) = Xi rA(ii+ nBod) = Yi rA(ii+2*nBod) = Zi ii = ii + 1 enddo enddo enddo end subroutine initCoord ! Forces acting on each body subroutine forces(rA, fA, nBod) use omp_lib implicit none integer:: i, j integer:: nI, nBod integer:: numProc real(4):: Xi, Yi, Zi real(4):: Xij, Yij, Zij ! X[j] - X[i] and so on real(4):: Rij2 ! Xij^2+Yij^2+Zij^2 real(4):: invRij2, invRij6 ! 1/rij^2; 1/rij^6 real(4):: rA(*), fA(*) real(4):: magForce ! Force magnitude real(4):: fAix, fAiy, fAiz real(4), parameter:: EPS = 1.E-10 ! Small value to prevent 0/0 if i==j common/ourCommonData/numProc !$OMP PARALLEL NUM_THREADS(numProc) & !$OMP PRIVATE(Xi, Yi, Zi, Xij, Yij, Zij, magForce, invRij2, invRij6, i, j)& !$OMP PRIVATE(fAix, fAiy, fAiz) !$OMP DO do i = 1, nBod Xi = rA(i ) Yi = rA(i+ nBod) Zi = rA(i+2*nBod) fAix = 0.E0 fAiy = 0.E0 fAiz = 0.E0 do j = 1, nBod Xij = rA(j ) - Xi Yij = rA(j+ nBod) - Yi Zij = rA(j+2*nBod) - Zi Rij2 = Xij*Xij + Yij*Yij + Zij*Zij invRij2 = Rij2/((Rij2 + EPS)**2) invRij6 = invRij2*invRij2*invRij2 magForce = 6.0*invRij2*(2.0*invRij6 - 1.0)*invRij6 fAix = fAix + Xij*magForce fAiy = fAiy + Yij*magForce fAiz = fAiz + Zij*magForce enddo fA(i ) = fAix fA(i+ nBod) = fAiy fA(i+2*nBod) = fAiz enddo !$OMP END PARALLEL end subroutine forces subroutine integration(rA, vA, fA, nBod) use omp_lib implicit none integer:: i integer:: nI, nBod integer:: numProc real(4), parameter:: dt = 0.01 ! Time step real(4), parameter:: mass = 1.0 ! mass of a body real(4), parameter:: mdthalf = dt*0.5/mass real(4):: rA(*), vA(*), fA(*) common/ourCommonData/numProc !$OMP PARALLEL NUM_THREADS(numProc) PRIVATE(i) !$OMP DO do i = 1, 3*nBod rA(i) = (rA(i) + fA(i)*mdthalf)*dt vA(i) = fA(i)*dt enddo !$OMP END PARALLEL end subroutine integration

Fortran program with unloading on Xeon Phi

 !---------------------------------------------------------! ! N-Body simulation benchmark ! ! written by MSOzhgibesov ! ! 14 July 2015 ! !---------------------------------------------------------! program nbody_XeonPhi use omp_lib implicit none integer, parameter:: nI = 32 ! Number of bodies in X, Y and Z directions integer, parameter:: nBod = nI**3 ! Total Number of bodies integer, parameter:: maxIter = 20 ! Total number of iterations (time steps) integer:: numProc integer:: iter character(len=50):: host real(4), parameter:: initDist = 1.0 ! Initial distance between the bodies real(4), allocatable:: rA(:) ! Coordinates real(4), allocatable:: vA(:) ! Velocities real(4), allocatable:: fA(:) ! Forces real(8):: startTime0, endTime0 common/ourCommonData/numProc allocate(rA(3*nBod), vA(3*nBod), fA(3*nBod)) ! Mark variable numProc as needing to be allocated ! on both the host and device !DIR$ ATTRIBUTES OFFLOAD:mic::numProc, hostnm !DIR$ OFFLOAD BEGIN TARGET(mic) OUT(host, numProc) call hostnm(host) numProc = omp_get_num_procs() !DIR$ END OFFLOAD write(*,'(A11,A50)')"Host name: ", host write(*,'(A32,I4)')"Available number of processors: ",numProc ! Setup initial conditions call initCoord(rA, vA, fA, initDist, nBod, nI) ! Mark routines integration and forces as needing both ! host and coprocessor version !DIR$ ATTRIBUTES OFFLOAD:mic::integration, forces ! Main loop startTime0 = omp_get_wtime() !DIR$ OFFLOAD BEGIN TARGET(mic) INOUT(rA,fA,vA:length(3*nBod)) do iter = 1, maxIter call forces(rA, vA, nBod) call integration(rA, vA, fA, nBod) enddo !DIR$ END OFFLOAD endTime0 = omp_get_wtime() write(*,'(A13,F10.4,A6)'), "Total time = ", endTime0 - startTime0," [sec]" deallocate(rA, vA, fA) end program nbody_XeonPhi ! Initial conditions subroutine initCoord(rA, vA, fA, initDist, nBod, nI) implicit none integer:: i, j, k, ii integer:: nI, nBod integer:: initDist integer:: numProc real(4):: Xi, Yi,Zi real(4):: rA(*), fA(*), vA(*) fA(1:3*nBod) = 0.D0 vA(1:3*nBod) = 0.D0 ii = 1 do i = 1, nI Xi = i*(initDist - 1) do j = 1, nI Yi = j*(initDist - 1) do k = 1, nI Zi = k*(initDist - 1) rA(ii ) = Xi rA(ii+ nBod) = Yi rA(ii+2*nBod) = Zi ii = ii + 1 enddo enddo enddo end subroutine initCoord ! Forces acting on each body !DIR$ ATTRIBUTES OFFLOAD:mic:: forces subroutine forces(rA, fA, nBod) implicit none integer:: i, j integer:: nI, nBod integer:: numProc real(4):: Xi, Yi, Zi real(4):: Xij, Yij, Zij ! X[j] - X[i] and so on real(4):: Rij2 ! Xij^2+Yij^2+Zij^2 real(4):: invRij2, invRij6 ! 1/rij^2; 1/rij^6 real(4):: rA(*), fA(*) real(4):: magForce ! Force magnitude real(4):: fAix, fAiy, fAiz real(4), parameter:: EPS = 1.E-10 ! Small value to prevent 0/0 if i==j common/ourCommonData/numProc !$OMP PARALLEL NUM_THREADS(numProc) & !$OMP PRIVATE(Xi, Yi, Zi, Xij, Yij, Zij, magForce, invRij2, invRij6, i, j)& !$OMP PRIVATE(fAix, fAiy, fAiz) !$OMP DO do i = 1, nBod Xi = rA(i ) Yi = rA(i+ nBod) Zi = rA(i+2*nBod) fAix = 0.E0 fAiy = 0.E0 fAiz = 0.E0 do j = 1, nBod Xij = rA(j ) - Xi Yij = rA(j+ nBod) - Yi Zij = rA(j+2*nBod) - Zi Rij2 = Xij*Xij + Yij*Yij + Zij*Zij invRij2 = Rij2/((Rij2 + EPS)**2) invRij6 = invRij2*invRij2*invRij2 magForce = 6.0*invRij2*(2.0*invRij6 - 1.0)*invRij6 fAix = fAix + Xij*magForce fAiy = fAiy + Yij*magForce fAiz = fAiz + Zij*magForce enddo fA(i ) = fAix fA(i+ nBod) = fAiy fA(i+2*nBod) = fAiz enddo !$OMP END PARALLEL end subroutine forces !DIR$ ATTRIBUTES OFFLOAD:mic::integration subroutine integration(rA, vA, fA, nBod) implicit none integer:: i integer:: nI, nBod integer:: numProc real(4), parameter:: dt = 0.01 ! Time step real(4), parameter:: mass = 1.0 ! mass of a body real(4), parameter:: mdthalf = dt*0.5/mass real(4):: rA(*), vA(*), fA(*) common/ourCommonData/numProc !$OMP PARALLEL NUM_THREADS(numProc) PRIVATE(i) !$OMP DO do i = 1, 3*nBod rA(i) = (rA(i) + fA(i)*mdthalf)*dt vA(i) = fA(i)*dt enddo !$OMP END PARALLEL end subroutine integration

Instead of conclusion

Working with the “native” code, in some way, is even simpler than with unloading - you can use the program available for the CPU, moreover, the “native” program worked even faster than offload. At the same time, it should be taken into account that if the program depends on third-party libraries, then they will have to be recompiled for MIC or alternatively should be sought. Also, it should be noted that any files copied to the coprocessor are stored in RAM, which is not much.
In one of the comments to the previous article, the question of comparing the performance of Xeon Phi and CUDA GPU was raised, on the one hand everything depends on the task, and on the other hand it is interesting to compare. In the next article we will see who is faster, and also try to combine the efforts of the devices.

Source: https://habr.com/ru/post/263121/

All Articles

We continue to get acquainted with Intel Xeon Phi: "native" code

Create / add user on MIC

Go to "MIKE"

Four sentences about Fortran

Instead of conclusion

More articles: