Using FPGA and DSL to speed up HFT trading

Researchers from the German University of Heidelberg have published a story about using the domain-specific language DSL and FPGA to create a high-speed trading system. We bring to your attention the main thoughts of this work.

In modern financial markets, a significant number of all operations are accounted for by automated trading systems — as early as 2010, more than half of all transactions on American exchanges were performed using algorithms. Among them are those that show positive financial results in the event that targeted actions are performed faster than other traders - then the system manages to earn, if the same algorithm gets ahead of it, then it will not be possible to make a profit. This trade is called high frequency (High Frequency Trading, HFT).
')
To transfer financial information about stock quotes and current purchase and sale requests, special protocols are used to transfer financial data (we wrote a series of articles about them). Often for these purposes are used international protocols FIX and FAST . Accordingly, to improve the performance of the HFT system, the developer needs to optimize the process of decoding market data FAST, as well as lower-level protocols.

To solve this problem, special accelerators are often used, which allow you to transfer Ethernet, UDP and IP processing tasks from the software to the hardware - this leads to a serious increase in productivity. This material will discuss the process of creating such an accelerator engine using domain-specific language (DSL).

Trading Infrastructure

As a rule, three main entities are involved in the online trading process - this is the exchange itself, tools for processing data flows from it, and market participants, that is, traders. The exchange receives orders to buy or sell through the so-called gateway servers. Then the trading engine of the exchange is trying to “reduce” buyers and sellers. If the transaction is successful, participants are notified of this. If the transaction has not yet passed and the order is in the queue, it can still be canceled.

Handlers of data streams receive information about current stock prices and the queue of bids to buy or sell. This information is transmitted by the exchange and should be transmitted to interested bidders as soon as possible. The FAST protocol is often used to transport data. Imagine the above trade infrastructure in the form of a scheme:

Protocol stack

In modern implementations of systems for electronic commerce protocols of several levels are used.

TCP / IP, UDP and Ethernet

The basic communication protocol used by exchanges and other market participants. Non-critical information such as stock price data is transmitted via UDP to increase speed, critical information about trade orders is transmitted via TCP / IP. Accordingly, any trading system should be able to work effectively with these protocols. In the current implementation, this UDP processing task is transferred directly to the hardware.

FAST

To transfer large amounts of financial data without increasing delays, a FAST protocol was developed. FAST messages consist of various fields and operators. The protocol works over UDP - to reduce delays, several messages are placed in one UDP frame. FAST messages do not contain information about the size and do not set the parameters for splitting into frames. Instead, each message is determined by a template that must be known in advance — only in this way can the data stream be decrypted.

Multiple FAST UDP frame

Each message starts with a “Presence Card” (PMAP) and a Template Identifier (TID). PMAP is a mask that is used to indicate which of certain fields, sequences, or groups belongs to the current thread. The TID is used to determine the pattern that is required to decode the message. Patterns are described using XML, as in the example below:

<template name="This_is_a_template" id="1"> <uInt32 name="first_field"/> <group name="some_group" presence="optional"> <sint64 name="second_field"/> </group> <string name="third_field" presence="optional"/> </template>

Here TID is 1, the template consists of two fields and one group. The field can be a string, an integer, with or without a sign, 32, 64-bit or decimal.

Data compression is implemented by removing the leading zeros within each data word. Stop bits are used to decode compressed data. The first seven bits of each byte are used to encode the data. The eighth bit is used as a stop bit for field separation. For decompression, this stop bit must be removed, and the remaining 7 bits must be shifted and concatenated in all cases where the field is more than one byte.

If there is the following binary input stream:

These three bytes represent two fields, as indicated by the underlined stop bits. To find out the real value of the first field, assuming that it is an integer, it suffices to replace the eighth bit with zero. The result will be as follows (binary and hexadecimal):

The second field covers two bytes. To find out its real value, for the first seven bits of each byte, you need to perform concatenation and add two zero bits. The binary and hex result will look like this:

Making decisions

One of the most important aspects of the algorithmic trading system is making decisions about financial transactions. Deciding whether to buy or sell a stock may require a large amount of data and resources to process it. The system can take into account many parameters and variables that are compared using the tools of mathematics and statistics.

In this article we will not consider in detail the decision-making process, let us say only that, because of its complexity, it is never taken out directly to iron, and remains the prerogative of software.

Base implementation

FPGAs are often used to create modern online trading systems. Tasks for processing Ethernet, IP, UDP, and FAST protocol decoding tasks are transferred to this hardware. Parallelism FPGA allows you to achieve a serious increase in the speed of work in comparison with exclusively software tools. The architecture of the system described in this paper is as follows:

IP, UDP and ETH decoding on hardware

To reduce delays in the processing of market data, working with the IP, UDP and Ethernet stack is completely on iron. In this case, the Xilinx Virtex-6 FPGA is used, which has high-speed serial transceivers that can work as a PHY for direct access to a 10Gbit Ethernet network switch.

Work with Ethernet MAC and protocols of other levels is carried out by internal modules. All elements are implemented at the Register Transfer Level (RTL) level using state machines. Iron operates at a frequency of 165 MHz and provides a 64-bit data bus, which makes it possible to achieve a “processor” bandwidth of 1.25 GB / s, which corresponds to an incoming stream of 10 Gbit / s.

FAST decoding on hardware

In order to achieve even greater speed of the trading system, the FAST protocol is also being decoded into the hardware. However, this process is more complicated than working with regular UDP, which is almost the same for different applications. FAST, on the other hand, is a much more original and flexible protocol - each handler can define its own pattern and presence maps, which makes implementation using state machines impossible.

Therefore, a module for working with microcodes was created on the iron side, which can decode any implementation of FAST. To finish off support for a new protocol implementation, you need to create only microcode instructions. At the same time it is not necessary to make any modifications to the iron itself.

The microcode is distinguished from a specification written in the domain-specific language DSL. Using the microcode module instead of the usual processor allows you to achieve greater speed. Due to the limited set of instructions used, the processing speed increases by several orders of magnitude.

The figure below shows the implementation of the FAST decoder. Each stage uses FIFO buffers. Each of the three stages operates at 200 MHz and uses the 64-bit data path. The resource consumption for FAST processing is 5630 LUT (Lookup Tables).

Briefly go through the other modules of the system:

FAST decompressor - detects stop bits and aligns incoming fields.
FAST microcode module - runs a program that loads into the FPGA on startup and sets a route for each template.
Direct Access Engine FAST - to transmit the decoded FAST stream to the trading software, a special Direct Access engine was created. It transfers the received data directly to the buffer to which the software accesses, bypassing the level of the operating system.
Subject-oriented language

The binary code for the microcode module is generated by the assembler using the DSL implementation developed specifically for this project. The assembler code for the template above could look something like this:

FAST protocol analysis

In order to create a system that could work effectively in all market conditions, even at times of high volatility and trading activity, deep analysis of FAST data is necessary. First of all, it is necessary to determine how fast the update in the real trading system can be. To this end, a theoretical model was developed that describes the upper limit of the update rate of information and takes into account various data compression mechanisms.

In the next step, a more realistic model was developed, based on the figures obtained in the course of real trading.

The formula below shows the data transfer rate that is provided by the data feed translation mechanisms:

Here compression is the compression factor of the FAST protocol, and overhead is the overhead of the Ethernet and UDP protocol. The maximum data development rate that a trading accelerator can support can be described as:

where cf is the clock frequency, CPI means the number of cycles per instruction, and fieldsize represents the number of bits per field / instruction.

Calculate the upper limit of Dfast, you can choose the theoretical maximum for Dfeed, which is 10 Gb / s, and the degree of compression (compression), which is 8 (64 bits can be compressed to the size of a byte). In order to obtain a realistic overhead analysis, several gigabytes of market data were analyzed. Overhead UDP and Ethernet depend on packet size and many other factors, but the researchers were able to determine the median overhead for Ethernet, IP and UDP frames - it was 10%.

In the same way, it is possible to calculate a stable data transfer rate Dmax, which a trading accelerator can “digest”. The field size is 64 bits, and cf is 200 MHz. As for the CPI parameter, the analysis shows that in the current implementation of the system, it takes 2.66 cycles to process the field, which results in a maximum data processing speed of 4.8 Gbit / s by the FAST processor:

Since Dmax <Dfast, it is obvious that in the moments of bursts of market activity, the system simply cannot cope with the increased data volumes. The only parameter that can somehow be improved is CPI. And this can be done with the help of three techniques.

Segments of "custom" instructions

The decoding of FAST messages is carried out according to certain rules, which are set using TID and PMAP. For each incoming message, the microcode module performs a TID analysis to find the instruction code to be executed for this template. Inside the instruction flow, the further branches that need to be processed are indicated by PMAP. The obvious solution, which significantly reduces the number of branches used, is to use discrete microcode segments not only for each pattern, but also for each sequence {TID, PMAP}, since it is this pair that predetermines all branches of the message.

The disadvantage of this method is an increase in overhead execution. The size of the instruction is also seriously increasing. Nevertheless, the analysis of large volumes of FAST data leads to interesting conclusions. Observations show that for five sequences {TID, PMAP} the probability of their occurrence is 92%. Therefore, it may make sense to implement microcode segments specifically for these pairs. The implementation of this approach can significantly reduce the CPI - during the tests, the indicator was reduced from 2.66 to 2.35 - that is, the improvement was 13%.

Precomputation of branches

Each instruction branch requires three clock cycles to be executed and the use of two NOP instructions in the execution input. In this implementation, there are two types of branch instructions. The first uses the TID to search for the instruction pointer segment, and the second instruction is used for branching inside the microcode segment. Delay in viewing a segment of a pattern can be eliminated by previewing the incoming fields and redirecting this information to memory. Since UDP frames contain a large number of templates, TID searches are performed frequently, which means there is room for optimization.

Variable Length Processing

Another source of instruction branches can be processing fields of variable length. Due to the stop bits, the values in the compression scheme within the FAST stream can be from 8 to 72 bits (64 bits + 8 stop bits). Since the data path is only 64 bits, processing is stretched for an additional one or two cycles. This also leads to additional NOP cycles.

You can remedy the situation by increasing the data path to 72 bits and decompressing the fields as an integer. In the case when the field is a number, it can be processed directly. In other cases, the additional 8 bits are simply ignored. This approach allows you to avoid the appearance of branches of instructions, the implementation of variable-length fields made it possible to reduce the CPI from 2.1 to 1.7.

Evaluation

To assess the performance of the system using various optimization techniques, the results of its work were compared with the basic implementation. The figure below shows the results of the work of applying optimization techniques to increase the amount of processed data:

Also, the FPGA implementation was compared with software running on regular hardware (Intel Xeon 5472 and IBM Power6). Below is a comparison of performance in these cases:

FAST message processing speed is also very different. In order to decode a FAST message with a length of 21 bytes, the implementation on Intel Xeon took 261 ns, IBM Power6 took 476 ns, and the FPGA implementation only 40 ns.

Source: https://habr.com/ru/post/276857/

All Articles