📜 ⬆️ ⬇️

FPGA Accelerator Programming Example



Not so long ago, we talked about the new Selectel service - cloud high-performance computing on FPGA accelerators . In a new article on this topic, we consider an example of programming FPGA to build a Mandelbrot set, a well-known mathematical algorithm for visualizing fractal images. The article uses material from the Euler Project website.



Instead of the preface


First, a few terms. A computer system with an FPGA accelerator - as a rule, it is a PCIe adapter with an FPGA chip comprising x64 server. The accelerator takes on a separate resource-intensive task in which parallel computations can be used and performs it many orders of magnitude faster than the x64 processor, unloading it and increasing the performance of the entire computing system. For example, a calculation cycle with 100 thousand repetitions can be performed on an FPGA in just one pass instead of sequentially performing 100 thousand times on a classic x64 processor. Logic elements, hardware resources, communication links of the FPGA chip are programmed by the user directly under the task itself, which allows to realize the task as implementation of the algorithm in silicon - Algorithm in Silicon and thereby achieve high performance, and with very modest power consumption.
')
Today, the threshold for entering FPGA technology is quite affordable even for startups - a server with an FPGA accelerator and all necessary software (SDK) can be rented in the Selectel cloud for reasonable money (the so-called “cloud FPGA”), and the support of the Open CL standard in FPGA leads to that a programmer who knows how to work with the C language is able to prepare and run a program on the FPGA.

Looking ahead: try FPGA at work


The programming example described below for building the Mandelbrot set is already implemented on a test server in the Selectel Lab , where anyone can rate it (registration will be required).


The project is provided in code and prepared for compilation. Selectel offers remote access to the server with the Intel Arria 10 FPGA accelerator. On the server side, SDK and BSP tools are deployed for developing, debugging and compiling OpenCL code, Visual Studio for preparing host applications (control applications for the server’s central processor).
Note that the example itself does not have any practical significance; it is chosen for reasons of a visual demonstration of acceleration methods using the principles of parallelism. In this example, the reader becomes familiar with the route of designing an application in a heterogeneous computing system with FPGA, - later this route can be used to develop your own applications with parallel computing.
UPDATE : In the spring of 2018, Intel introduced the high-performance hybrid Xeon Gold 6138P processor with an integrated Arria 10 FPGA chip. It is expected that by the end of 2018 serial processors of this type will be available to customers through partners Intel. We at Selectel are looking forward to this chip, and we hope that we will be the first in Russia to provide our customers with the opportunity to test this unique novelty.

About the OpenCL standard for FPGA programming


The OpenCL standard was developed by the Khronos Group - the world's leading chip and software makers such as Intel, AMD, Apple, ARM, Nvidia, Sony Computer Entertainment, and others. It is intended for writing applications that use parallel computing on various types of processors, including FPGAs. The OpenCL standard includes the C programming language based on the C99 version of the language (the latest C99 version is ISO / IEC 9899: 1999 / Cor 3: 2007 from 2007-11-15) and the application programming environment.

The popularity of using OpenCL for programming high-performance computing is based on the fact that it is an open standard and does not require a license to use it. Moreover, OpenCL does not limit the range of supported devices to any particular brand, allowing the use of hardware from different manufacturers on one software platform.

In addition about OpenCL: Introduction to OpenCL on Habr .


A bit of history - the FPGA design route, which existed before the OpenCL standard, was extremely specific and time-consuming, while even exceeding the design complexity of custom-made microchips (ASIC, application-specific integrated circuit, “special-purpose integrated circuit”). It required a rigorous understanding of the hardware structure of the FPGA, the configuration of which had to be carried out in a low-level hardware description language (HDL). Possession of this design and verification route has been and remains an art that, due to its extreme complexity, is available to a limited circle of developers.

The emergence of the OpenCL support toolkit for FPGA from Intel has partially resolved the problem of the availability of FPGA programming for software developers. The programmer independently selects that part of his algorithm that is suitable for processing by the method of parallel computing and describes it in C, then the OpenCL compiler for FPGA from Intel creates a binary configuration file for running this fragment of the algorithm on the accelerator.
Using the familiar Visual Studio environment or the standard gcc compiler, a host application is prepared (an .exe type application running on the x64 main processor), with all the necessary support libraries included in the SDK. When the host application starts, the FPGA firmware is loaded, the data is loaded into the chip's core, and processing starts according to the designed algorithm.

The FPGA (FPGA) chip is a user-reproducible, massively parallel hardware structure with millions of logic elements, thousands of DSP signal blocks and tens of megabytes of cache memory for onboard calculations without accessing the server’s main memory modules. Fast I / O interfaces (10GE, 40GE, 100GE, PCIe Gen 3, etc.) allow you to effectively communicate with the main processor of the server.


The OpenCL standard is a medium for running heterogeneous software. The environment consists of two separate parts:

  1. Host software is an application running on the main CPU of the server, written in C / C ++ and using the OpenCL API feature set. The server of the host organizes the whole process of calculations, the supply of the initial and the output data, and the interaction of all the server systems with the FPGA accelerator.
  2. Accelerator software is a program written in the OpenCL C language (C language with a number of restrictions), compiled for execution on an FPGA chip.

A typical server for parallel computing is an x64-based computer (for running host applications), which has a hardware FPGA accelerator, most often connected via a PCI-Express bus. By the way, such a system is presented in the Selectel Lab.

The sequence of programming and compiling code for an FPGA accelerator consists of two stages. The host application code is compiled with a standard compiler (Visual C ++, GCC) to obtain an executable file in the server’s operating system (for example, * .exe). The source code of the FPGA accelerator (kernel, kernel) is prepared by the AOC compiler as part of the SDK, to obtain a binary file (* .aocx). This file is just designed for programming accelerator.

Architecture of compilation environment for OpenCL program
Fig. Architecture of the compilation environment of the program on OpenCL

Consider some sample code for calculating a large vector in two variants.
( PS Do not shoot the pianist - hereinafter the code from the Euler Project site is used ):

void inc (float *a, float c, int N) { for (int i = 0; i<N; i++) a[i] = a[i] + c; } void main() { ... inc(a,c,N); ... } 

 _kernel void inc (_global float *a, float c) { int i = get_global_id(0); a[i] = a[i] + c; } void main() { ... clEnqueueNDRangeKernel(...,&N,...) ... } 

The code at the beginning is an example of how a single-threaded C implementation can look like using the method of sequential calculation of scalar elements.

The second version of the code is a possible implementation of the algorithm on OpenCL as a function calculated on an FPGA accelerator. There is no loop, and the calculation takes place in one iteration of the loop. The calculation of the vector array occurs as the execution of N copies of this function. Each copy has its own index, which is substituted into the iterator in a loop, and the number of repetitions is set from the host when the code is executed. The iterator action is provided by the get_global_id () function, which works with an index within 0 ≤ index <N.

Get to the point: building a fractal


The Mandelbrot set is an array of “c” points on the complex plane, for which the recurrence relation Zn + 1 = Zn² + c for Z0 = 0 defines a bounded sequence.

We define Zn = Zn + IYn, and also with = p + iq.
For each point, the following sequence is calculated:

Xn + 1 = Xn² + Yn² + p
Yn + 1 = 2XnYn + q


The calculation of the point belonging to the set at each iteration is performed as an equation
Xn² + Yn² <4.

To display the Mandelbrot set on the screen, we define the rule:

  1. If the inequality is satisfied at any iterations, then the point enters the set and will be shown in black.
  2. If the inequality is not satisfied, starting with a certain value of iterations n = N, then the color is determined by the number of iterations N.

The calculation process on the host will be as follows:


Let's go to the code:

 inline unsigned int mandel_pixel( double x0, double y0, unsigned int maxIterations ) { // variables for the calculation double x = 0.0; double y = 0.0; double xSqr = 0.0; double ySqr = 0.0; unsigned int iterations = 0; // perform up to the maximum number of iterations to solve // the current point in the image while ( xSqr + ySqr < 4.0 &&iterations < maxIterations ) { // perform the current iteration xSqr = x*x; ySqr = y*y; y = 2*x*y + y0; x = xSqr - ySqr + x0; // increment iteration count iterations++; } // return the iteration count return iterations; } 

 int softwareCalculateFrame( double aStartX, double aStartY, double aScale, unsigned int* aFrameBuffer ) { // temporary pointer and index variables unsigned int * fb_ptr = aFrameBuffer; unsigned int j, k, pixel; // window position variables double x = aStartX; double y = aStartY; double cur_x, cur_y; double cur_step_size = aScale; // for each pixel in the y dimension window for ( j = 0, cur_y = y; j < theHeight; j++, cur_y -= cur_step_size ) { // for each pixel in the x dimension of the window for ( cur_x = x, k = 0; k< theWidth; k++, cur_x += cur_step_size ) { // set the value of the pixel in the window pixel = mandel_pixel(cur_x, cur_y, theSoftColorTableSize); if ( pixel == theSoftColorTableSize ) *fb_ptr++ = 0x0; else *fb_ptr++ = theSoftColorTable[pixel]; } } return 0; } 

Each pixel is calculated independently of the other, and therefore it is possible to parallelize this process. When the algorithm is implemented, a SIMD instruction is created for the FPGA accelerator to calculate the number for each pixel of iterations (determining the color code on the palette). The implementation of two nested loops on the image buffer is framed via OpenCL by running the operation (theWidth * theHeight).

The kernel instances in the listing below are called the work-item, and the set of all instances is called the index space. The features of the hardware function include the following:

Source: https://habr.com/ru/post/418403/


All Articles