FPGA accelerators go to the clouds

The appearance on the market of FPGA accelerators, which can be reprogrammed as many times as necessary, moreover, in a high level language like "C", was a real breakthrough in the niche of high performance computing. But no less a breakthrough was the ability to use FPGA technology without buying these very expensive adapters (the price in Russia is from 250 thousand rubles), but simply by renting a dedicated server with an accelerator in the provider’s cloud.

Introduction or about FPGA chips in 3 paragraphs

An FPGA chip (a field-programmable gate array), also known as a user-programmable gate array (FPGA), is an integrated circuit chip (IC) that can be reconfigured for any complex computational problems. In the industry, there is a need for specialized chips (ASIC, application-specific integrated circuit, “special purpose integrated circuit”) - from control of spacecraft to calculations on financial models. However, before the advent of FPGA, the strong and at the same time weak point of specialized ICs was the hard functionality embedded in the microcircuit, as well as the high complexity of design and the cost of launching into production. If then the functional was required then to change at least a little bit, or errors occurred at the design stage, then it was necessary to create an essentially new IC.

FPGA accelerator with Intel Altera Arria 10 chip for PCI Express

FPGA accelerator with Intel Altera Arria 10 chip and 10GE ports

The appearance on the market of FPGA-accelerators, which can be reprogrammed as many times as possible, and in a high-level C language, was a real breakthrough in the niche of high-performance computing. This allowed us to speed up development time, time to market products. There are completely new opportunities for hardware developers, incl. working on programming specialized integrated circuits such as ASIC.

FPGA-processors have already passed 2 stages in terms of the availability of this technology and today they are actively entering the third stage. The first FPGAs appeared in 1985, but their programming still required knowledge of a low level language such as assembler. At the second stage, which began around 2013, and thanks to the efforts of Altera, it became possible to program in a high-level C-like language. This dramatically expanded the applicability of the FPGA, but the high cost of the chips still held back the expansion of the circle of customers who could afford this technology.

Traditionally, the FPGA design and verification route is extremely time-consuming and requires high specialization; in its complexity, the route approaches the ASIC design. This limits the use of FPGAs by developers. This is especially true of computational applications, where the participants in the process — the programmer, the mathematician, the algorithmist — want to focus on their task, and not on its hardware implementation. Solving this problem, Altera in 2013 introduced for its FPGA support for the open standard of OpenCL heterogeneous computing platforms, which expanded the applicability of equipment by developers of computational applications that are not familiar (unfamiliar) with FPGA hardware, HDL languages, the design and verification route. But the problem remained - expensive equipment and design tools.

And finally, somewhere from 2016, we can talk about the third stage, which was marked by the availability for a wide range of clients of fully ready servers (physical and virtual) with FPGA processors in the clouds of the largest data centers - Amazon Web Services (AWS), Cloud Alibaba and Huawei Cloud. In Russia, for the first time, dedicated servers with FPGA processors have become available in the Selectel data center since 2017.

Why may need FPGA-accelerators? Data streams are growing on the one hand, and on the other hand, difficulties are noted in increasing the computing power without increasing the size and consumption of the computing system. As a rule, the application has management tasks and tasks of resource-intensive data processing. It is advisable to leave the management tasks on the CPU, and the processing tasks to send to a specialized resource. “Configuration on the fly” for a task is also a very useful feature. Synthesis of the computational resource on the FPGA for a specific task should also give a gain in both productivity gains and reduced consumption. Also, on the FPGA there is an internal fast memory and a developed (and reconfigurable) communication part, which allows you to organize almost all known input-output protocols. For example, for organizing hash memory, hardware DSP blocks, memory controllers, etc. In other words, it is a developed system on a chip, possessing the property of synthesizing a specific computational core for each task.

Basic differences FPGA from CPU, GPU

What types of accelerators are available today? Today available: multi-core processors (CPU) Xeon, GPU and FPGA, consider them below.

Each type of processor — universal (CPU), graphic (GPU), or FPGA — has its own advantages, otherwise they simply would not be made. CPUs provide good performance with the highest degree of versatility and applicability. About 99% of all existing programs are written for execution under the CPU. GPU GPUs have a larger number of cores and a vector architecture, a high exchange rate with memory and I / O. FPGAs have the highest performance per watt of power consumption due to the properties of the equipment, but require very careful and time-consuming programming.

Below about these differences a little more:

Universal CPU CPUs are essentially the workhorses of the IT industry. They can be used for a wide variety of tasks, but due to their CPU architecture, they are not as effective for parallel computing. In recent years, this problem is partially solved due to the implementation in the processor chip of multiple cores. However, even with the most productive CPUs, the number of cores is still measured in a few dozen.
Graphics processors (GPU) for many years worked only in the niche of displaying information on the screen. And only relatively recently have GPUs been used for high-performance computing, including mining of cryptocurrencies. Working with graphics as vector tasks led to such a development of the GPU architecture, which became adapted for the purposes of parallel computing. As a result, the modern architecture of the GPU allows you to speed up the run of vectorized data through its pipelines, which otherwise would have to run through many other logical blocks in the CPU with a corresponding loss in performance. Modern GPUs contain several thousand processor cores in a chip.
FPGA, in contrast to the universal and graphics processors, can be reprogrammed in accordance with the features of the computational problem solved on them. It turns out the synthesis of a specialized processor for a specific task. Other important differences between FPGAs are reduced power consumption per unit of computing power, as well as an architecture with parallel execution of multiple vector operations simultaneously - the so-called massively parallel fine-grained architecture. The number of cores in an FPGA chip can reach one million or more.

An FPGA accelerator, as a rule, is a hardware in a different form factor (VPX, Com-express, PCIe, etc.), which in addition to the FPGA chip (or several) also contains SRAM and DRAM memory, including ultra -HBM (high bandwidth DRAM) and high-speed I / O interfaces such as the popular 10/40/100 GE and PCI Express. FPGA accelerators are also available in the SOM form factor (system on a module, single board computer) for embedded systems, which is popular in video analytics systems or industrial applications.

FPGA Accelerator in SOM Form Factor

Each FPGA chip contains an array of up to 5 million logic elements (transcoding array and triggers), which can be reprogrammed for different functional tasks. In addition, there are hardware resources - cache memory, signal processors, digital processing units, interface units.

Why does FPGA benefit from ASIC performance? The answer is very simple - thanks to more advanced process technology to create crystals. For FPGA, 20 nm and even 14 nm process technologies are applied. While for the creation of ASIC crystals are used more "ancient" technical processes of the level of 60 nm. Accordingly, on the same chip area, an FPGA can have several times more logical cells than the ASIC, which ensures a performance gain.

FPGA Applications

From the moment of its invention and up to today, one of the basic applications of FPGA has been and remains the prototyping of microcircuits for small- and medium-sized products, when the manufacture of ASIC microcircuits is not economically feasible.

At the beginning of 2018, according to the Russian company Almaz-SP, the scope of application of FPGA accelerators was as follows:

50% - special applications in military electronics,
20% - telecommunications (GSM base station equipment, etc.),
10% - processing video streams (video studios, video analytics),
10% - industrial use,
10% - prototyping, etc. (including scientific calculations).

However, despite the predominantly military use in the past, the scope for civilian use of FPGA accelerators is now growing much faster. In 2015, Intel acquired one of the largest FPGA manufacturers, Altera. Altera development is now embodied in silicon already under the Intel brand. And the new line of FPGA chips, known as Intel Cyclone 10, was not long in coming. Cyclone 10 GX chip models show very high performance (up to 134 GFLOP) and have advanced I / O capabilities. Connection to other devices is made via the 10GE network port or via the PCI Express x4 bus. These FPGA chips are designed for machine vision, surveillance, video broadcasts, and robotics. The younger Cyclone 10 LP chip model is implemented as a computational core for engineering systems - control of sensor complexes, engine controllers, and so on.

In addition to the Cyclone line, there are other FPGA chip series inherited from Altera in the Intel production program: MAX, Arria, and Startix. The last two series are the most powerful FPGA chips on the market, in 2018 they are expected to upgrade to the level of Arria 10 and Startix 10. Startix 10 will be built on hyperflex architecture and have a performance of 10 teraflops (i.e., almost 3 orders of magnitude more powerful Cyclone 10).

The Cyclone, MAX, Arria and Startix series partially overlap each other in performance, but Intel positions each series separately. For Arria, these are signal processors for instrument making, for Startix, high-performance computing in data centers and telecommunications. We have already spoken about the application areas for the Cyclone series, which was the only one that received updates in 2017. But another such application for Cyclone is definitely worth mentioning - this is “Internet of things”, IoT.

More than 50% of the cases of FPGA accelerator applications fall on military and industrial electronics, but the scope of civilian tasks and scientific calculations is growing rapidly.

Concept image in FPGA technology

Above, we have listed Intel’s FPGA chip series popular for today, but in order to use them in servers, you will need to purchase FPGA accelerator cards and program the logic of the chips on the adapter for a specific application. Adapter boards are available from Intel partners in the FPGA Design Solutions Network. In particular, in Russia, such a partner is Almaz-SP LLC (also participates in Euler Project), which supplies both original Intel adapters and self-developed motherboards with latest-generation FPGA chips.

Demonstration of a server with an FPGA accelerator at the SelectelTechDay # 2 conference, in the center - Anton Visto, representative of Almaz-SP LLC

Demonstration of the server with the FPGA-accelerator "Almaz-SP" on SelectelTechDay # 2

Demo zone of hardware innovations on SelectelTechDay # 2. First left - FPGA server from Almaz-SP

Demo zone of hardware innovations on SelectelTechDay # 2 (FPGA - the first stand on the left)

If you need to abstract from the design route and focus on the computational problem, you can use OpenCL and Intel FPGA SDK for OpenCL. This will require the BSP motherboard support package, which will abstract away the complexity of building a system on a chip (memory controllers, PCIe, interfaces, clock domains, time constraints, partial reconfiguration, etc.) and focus on the computational task. Such a package is provided if OpenCL BSP support is declared for the card. With a similar support package, you can get a “software development environment” - where there is a platform model, a function for acceleration, a runtime support library, a memory model, and special extensions to increase throughput. Then proceed to writing code, profiling, optimization.

As a result of using the SDK and BSP, a single configuration file (bitstream) is obtained that configures the FPGA and it turns out a complete system on a chip for a specific computing task. The result of programming is a firmware that solves a specific applied problem (for example, calculating a matrix of equations, converting video formats, etc.). This firmware is called an FPGA image (FPGA Image). Quite often, the term “IP core” is used instead of the term “image”.

The FPGA image (FPGA Image) is a control firmware for the FPGA chip, designed and debugged to perform a specialized computational task.

Difficulty accessing FPGA technology for customers

Despite the attractive concept - “the highest performance for a specific computational problem” - two objective factors interfere with the widespread use of FPGA. This is the high cost of an adapter with an FPGA chip and a shortage of developers who have practical experience in programming and debugging FPGA cores.

In addition to the accelerator, it is also necessary to acquire a license for the Intel OpenCL SDK, without which it is only possible to launch compiled cores, but their compilation is impossible. The requirements for a developer’s computer are also very high: this includes recommendations for 18–48 GB of RAM. On a machine with an 8-core CPU and 32 GB of memory, compiling the kernel to calculate the Mandelbrot set takes about 2 hours. If processor utilization exceeds 90%, then compilation can take a day or even more. With a memory size of less than 16 GB, compilation may not be feasible.

Therefore, potential customers are actively interested in this technology, but are not in a hurry with the acquisition of FPGA-accelerators. This is mainly due to concerns that the cost of the accelerator (s) will be significant for their IT budget, and the in-house team will not be able to properly program and debug FPGA images.

FPGA Cloud Computing

FPGA cloud services appeared as a response to the high cost of accelerator boards with an FPGA chip. In this case, clients are offered to rent physical and / or virtual servers with FPGA accelerators installed in them. As a rule, this is a partner product from a manufacturer (for example, Intel) and a data center as an IaaS service provider.

FPGA server with an accelerator from Almaz-SP can be tested for free in the Selectel data center

One of the solutions to the availability of technology for mass use seems to be the possibility of renting computing power based on FPGA. In Selectel, the service involves accessing a server with an Euler line accelerator manufactured by the Euler Project based on Intel Arria 10 FPGA. The required SDK and BSP for development, debugging and compiling OpenCL cores, development tools for writing host applications (Visual Studio) are deployed on the server. As an introductory demonstration, we offer an example of the construction of the Mandelbrot set considered earlier: the project is provided in source codes and configured for compilation.

Euler Project for all comers holds a training course on programming on OpenCL for FPGA. This course is designed specifically for the Russian audience: engineers, researchers, students of technical universities. It incorporates material from official Intel training and provides an opportunity for step-by-step study of technology from the assembly of the simplest application to the use of specific optimization methods, sometimes absolutely necessary to achieve optimal performance.

In this form, FPGA technology becomes more attractive for customers, since they no longer need to purchase hardware directly, and capital expenditures are replaced by operating expenses. Accordingly, the range of companies that can afford to use calculations on FPGA-accelerators for their projects is significantly expanding.

The cloud model of using servers with FPGA accelerators gives access to this technology for many new customers who would like to try "how it works" on their specific projects and computing tasks.

FPGA image store concept

Creating an effectively working FPGA image for a specific application is quite a time consuming and time consuming task. A well-coordinated team for programming an image can take up to a couple of months, and less experienced clients will spend much more time, or even fail to cope with this task at all.

Therefore, the image store concept itself suggests itself - by analogy with existing app stores for platforms such as MacOS, Windows or Android. Developers could send there workable images created by them for various tasks, and clients could purchase them for uploading to their servers with FPGA accelerators if these images correspond to the computational tasks in their projects.

In 2018, Selectel began work on creating a similar store of FPGA images that could be used on Selectel’s leased servers with this technology. Thus, the development cycle for new projects would be significantly accelerated for clients, and the programmers themselves (authors' teams) would receive a certain income from previously completed work, plus they would be protected from pirated distribution of images throughout the market without their consent.

Useful link:

Free testing of a dedicated server with an FPGA adapter at Selectel Labs

Source: https://habr.com/ru/post/352174/

All Articles