Experience using a GPU for financial modeling

In this article, I briefly describe my experience in optimizing a single enumeration task, ranging from a single-processor algorithm to a multiprocessor and to the version on OpenCL.

Preamble

One of the projects (C #) that I work on is intended for financial analysis and modeling. It allows the user to describe the input parameters and the sequence of financial information processing algorithms (quotes and other data on financial instruments) in a certain bird language. One of the tasks solved with the help of the application is the optimization of the parameters of the trading system (TS).
')
A vehicle has a set of parameters that determine its behavior during trading. Inside the trading day, some of the parameters change, i.e. TS adapts to the current market. The task was to determine the optimal parameters of the vehicle by retrospective analysis of quotations (backtesting) for a user-specified period of time.

Algorithm

Unfortunately, the only way to find the optimum was a complete bust.

A set of combinations of vehicle parameters that satisfy the user-defined constraints is compiled. There are restrictions between the parameters themselves, which allow to drop a significant amount of search. Parameters are moved within the boundaries with a certain step. The smaller the step, the higher the accuracy, but correspondingly the more the options under consideration. For convenience, such a set of combinations can be represented as an n-dimensional sparse matrix, but there is no benefit to this algorithm, therefore this set was a simple linked list of parameter structures.
For each combination of parameters, the operation of the vehicle is emulated.
The vehicle passes through the entire set of quotes and, when the conditions are met, fixes the execution of purchase / sale transactions. This calculates the overall financial result of the vehicle. After the emulation is complete, the financial result is saved.
After completion of the processing of all combinations of parameters, the best is found among the results. According to the best combination of parameters, another emulation is performed, but this time with the display of all transactions and results to the user.

First working version

After the implementation of the algorithm and getting the first useful (for the end user) results, it turned out that this search is very slow.
The iteration was performed sequentially in one thread and the load of 4 processor cores was rather low. Immediately I got the idea that it would be nice to parallelize the processing.

Parallel processing on the CPU

Distribution by streams was implemented using Parallel.ForEach over a set of TS parameters.

Parallel.ForEach(counters, counter => Process(points, counter));

But after measuring the performance of the gain was negligible. It turned out because of the overhead of switching threads. Those. the processing time of one set of parameters was slightly different from the time required to initialize the stream.
I divided the entire set of parameters into roughly equal blocks. The number of blocks set equal to the number of available processor cores.

 Parallel.For(0, processorCount, i => Process(points, block[i]));

In this implementation, the kernels were loaded evenly to 99-100% and the operation time was reduced almost 4 times, because this is almost an ideal task for parallel processing.

The user was very pleased with this acceleration.
But after a couple of days, he set another task and she increased the search volume a hundred times. The solution of a new problem on stock information for 2 weeks of trading was considered almost a day. In principle, it suited the user, but it seemed to me inadequate.

The algorithm could not be simplified - it was reduced to a strictly necessary set of operations; the profiler did not show any obvious problems.
Apply some numerical optimization method also failed, because the objective function (the result of trading) had many local maxima. Passing through the parameters with a wide step could simply miss the global maximum.
I just needed more computing power.

Transfer to GPU

About GPU computing, I had a fairly general idea. Of course, I read articles about how great it is, but I had no idea about the limitations of technology.
In the end, I decided that it was worth sorting out this topic. Even if it is impossible to transfer the algorithm, I still find out something useful.

I downloaded the Stanford CUDA course in iTunes and started watching lectures. Lectures are good in that they give a general idea of architecture. Basically, the base of 5 first lectures was enough for me to understand what I need to do. But I decided to implement it not on CUDA, but on OpenCL. It so happened that at home I have AMD video cards, and at work NVIDIA.

After watching lectures and armed with a dictionary to translate CUDA terms into OpenCL, I wrote the first implementation in four hours. To work with OpenCL, I used OpenCL.Net .

Converted data about quotes and TS parameter sets were transferred to the video card. Each combination of parameters was processed by a separate thread. Those. the flow at the start reads its TC parameters and then went through the entire history of quotes. After processing all the combinations, the results of the work were taken from the video card.

And then I started stuffing cones:

Algorithm thought something, but thought not. The results did not coincide with the reference.
After a long debugging, it turned out that I was incorrectly thinking that bool takes 1 byte. According to the specification, it is deployed on the device in int and takes 4 bytes. To eliminate ambiguity, I simply changed the type of data being transferred to int.
The algorithm began to produce "almost the same" results. And the results were consistently slightly better than in the CPU version. I once again revised the cl-code and did not find an error. I had to recheck the CPU version, and yes, the error was there. It turned out that in the process of optimization I seemed to reject unnecessary options, which in fact sometimes turned out to be not superfluous. After removing the "optimization" from the CPU version, the results began to coincide.
Before that, I did all the experiments on my home computer. When checking at work it turned out that the application dies after several launches due to lack of memory. At the same time everything worked fine at home. It turned out that I really did not free the memory, but at the same time, the AMD implementation of OpenCL from AMD processed this programmer's cant normally. NVIDIA's OpenCL implementation does not forgive mistakes.

In principle, after this there were several more iterations on the fine-tuning, including I became acquainted with Kernel Analyzer , but on the whole it was a ready-made solution.

Results

The user after receiving the new version was very pleased. Even on its weak Radeon 5450 (80 universal processors) calculations accelerated tenfold.
He immediately began to torture me about how powerful a card can be put in his computer (I could not adequately answer because of the difficulties of remote diagnostics of his power supply and MP), how many video cards can I stick into one system engineer (there was a desire to send him links on geek-porn with photos of bitcoin farms ), and whether it is possible to parallelize to several machines with video cards at once.

Before completion, I would like to give the time to solve the same task (analysis of stock data for 2 days):
1) AMD Phenom II X6 1090T (6 cores, 100% load) - 4 hours 48 minutes (288 minutes)
2) Radeon 5850 (1440 universal processors, approximately 50% load) - 4 minutes
those. processing time decreased by 72 times.

Thank you for your attention, I will be glad to answer questions. Reports about typos missed zpt, the requirements for replacing the hyphen on the dash, it is desirable to send in a personal.

Source: https://habr.com/ru/post/131983/

All Articles