How we created a fast event stream processing device on FPGA

The device is called a CEPappliance. CEP - from Complex Event Processing , and appliance - (and so it should be clear, but just in case) the “device” from English.

We started it back in 2010 as a hobby, working on it after the main work in the long evenings, gradually turning into short nights, and on weekends. For 5 years of such work, we created 3 prototypes in search of solutions with minimal delays and a simple programming model of data processing logic.

In 2015, we realized that we had a decent creation that allows us to process data streams with a guaranteed delay of 2-3 microseconds. And we began to look for opportunities to turn the business into a commercial product and, probably, to stop working for the “uncle”, to work only on our product, devoting all our time to it. At the end of 2015, we found our first client, left “uncle” and set off for “free swimming”.
')
Today we can say for sure that the device we got. We have not yet implemented all our plans and we still have to work a lot to add new functionality, sometimes to correct errors. But our device has been in commercial operation for a year.

Working on the "uncle" we have studied well the technical aspects and needs of trading financial instruments on exchanges and were guided primarily by them. These are automated trading (HFT, Algo Trading), risk control (Pre-trade), organization of “direct” access to trading (Direct Market Access), etc.

But we managed to make the CEP appliance a fairly universal device, applicable in areas where you need to pump a lot of data and do it not only quickly, but with guaranteed low latency. With built-in support for standard network protocols and the introduction of minimal delays, the device is applicable in telecommunications for detecting security breaches in networks and managing network load. The device can be used in telematics when it is necessary to make a decision in a few microseconds and react to the arrival of signals from the sensors. In this case, the logic of data processing device can be complex. For its description (programming) we use some techniques of the Complex Event Processing ( CEP ) technology.

CEPappliance was conceived and created to solve problems, which in a simplified form can be formulated as follows: with a total delay of less than 3 microseconds

receive input data (signal) via the network interface in the format of Ethernet, TCP / IP, UDP, FIX , FAST , TWIME (FIX SBE), etc .;
parse and extract user data;
analyze user data;
generate the output (reaction) and send it via the network interface.

CEPappliance differs from software solution running on architectures with a central processor in that the core of the device architecture is a user-programmable gate array ( FPGA ), on which all the steps of solving the described tasks are implemented.

Central processor architectures are evolving. Hybrid versions appear (see Fig. 1 , Fig. 2 and Fig. 3 ), in which the delivery time of data from network interfaces to the processor (and vice versa) is reduced by transferring processing of network and application protocol protocols from the central processor to network cards. However, the data delivery time is 1–3 microseconds (one way) and makes a tangible contribution to the delay, which separates the response time from the moment the signal arrives ¹ .

On the FPGA, we placed components for parsing, extracting, analyzing input data and generating output data on a single chip, figuratively speaking “without intermediaries” (see Fig. 4 ), which are necessarily present in solutions with a central processor.

Logic diagram of a traditional solution with a central processor

Fig. 1. Logic of a traditional solution with a central processor

Logical hybrid solution with a central processor and TCP Offload Engine on a network card

Fig. 2. The logic diagram of the hybrid solution with a central processor and TCP Offload Engine on a network card

Logic diagram of a hybrid solution with a central processor, TCP Offload Engine and implementation of application layer protocols in a network card

Fig. 3. A hybrid solution logic diagram with a central processor, TCP Offload Engine and implementation of application layer protocols in a network card

Fig. 4. CEPappliance logic

In CEPappliance, components for parsing, extracting, analyzing input data and generating output data are located on an FPGA chip and directly interact with each other.

To do this, I had to “reinvent the wheel” again. Let me remind you that we started work (back in 2010) on CEPappliance in hobby mode. They did everything themselves “as it should and rightly”. As a result, among other things, we implemented Ethernet, TCP / IP, UDP, FIX, FAST and TWIME from scratch.

We managed to create these components in such a way that the input data is parsed at the speed of their arrival (at wire speed ). Components implement relevant standards that are “carved in stone” and do not change often. For standard protocols, we have provided a configuration mechanism. For example, the modules of the FIX, FAST, TWIME, and other protocols are configured using user-defined parameters and patterns or schemes that describe the structure of messages.

At the same time, we proceeded from the assumption that (user-defined) data processing algorithms may change. For example, trading strategies or checks performed by a broker to minimize risks (pre-trade risk checks) follow a change in the market situation, the modernization of the micro-architecture of the exchange or the requirements of regulators.

Developing algorithms for FPGA directly in hardware languages (VHDL, Verilog, etc.) requires significantly more time for coding, debugging, and testing than development in high-level languages [2] . This also requires special skills that programmers who write programs in high-level languages, as a rule, do not possess. And if you plan to use FPGA to speed up the execution of your algorithms, then you will have to pass a detailed description of the algorithm to the FPGA developer who will implement it. Sometimes this is highly undesirable, since the transfer of the description of the algorithm creates for its owner the risk of losing competitive advantage.

Our device allows the user to describe the data processing algorithm himself. For this we have developed

high-level algorithmic language
processor of the original architecture and
an optimizing compiler that translates programs from a high-level language to processor codes, and which can automatically parallelize program execution on several processors running at the same time.

Our own programming language, processor and compiler allow us to implement functions available to the user on FPGA (hardware). These functions can be parts of the algorithm or the entire algorithm entirely - it depends on the appropriateness of such an implementation, the wishes and capabilities of the user. This approach can significantly speed up the execution of programs at CEPappliance in some cases.

Having provided the user with the opportunity to independently program the CEP appliance, we obviously had to provide tools for debugging these programs. Without such tools, it will be difficult to take full advantage of the CEPappliance. Therefore, we developed a device emulator that is 100% compatible with the device itself. Having debugged the program on the emulator, you can change the configuration (in most cases it is changing the IP address) and immediately start the program on the device.

In addition to debugging tools, the device emulator allows you to evaluate the program execution delays by the device itself. Using the delay measurements thus obtained, you can optimize the program.

And for automatic testing of user programs written for CEPappliance, we have a special tool - Test Bench, which reads test scripts in tabular form and executes them. The same set of tests can be performed with both the device and its emulator.

Well, summing up some of the results ... Our fees are in the data center of the Moscow Stock Exchange and successfully trade. We cannot tell about the results of the auction - this is not our topic, but the client is very pleased (and this text has been agreed with it).

There is a lot of work on the development of the device, the search for customers in areas outside the stock exchange trading and many new ideas!

¹ About how this delay is formed in the case of TCP / IP data exchange can be found in [1] . And here it is described how this delay can be reduced by implementing a hybrid architecture using FPGA.

Links

1. S. Larsen and P. Sarangam, “Architectural Breakdown of the End-to-End Latency in a TCP / IP Network,” International Journal of Parallel Programming, Springer, 2009.
2. David F. Bacon, Rodric Rabbah, and Sunil Shukla. FPGA Programming for the Masses . ACM Queue, Vol 11 (2), February 2013.

Source: https://habr.com/ru/post/334574/

All Articles

How we created a fast event stream processing device on FPGA

Links

More articles: