Working with Big Data with the help of GPU: accelerating the work of databases dozens of times
For several years, data centers of many companies have been working with acceleration calculations on the GPU. Our company is now studying this issue, since this type of computing is becoming increasingly popular. For example, GPU-accelerated computations can (and should) be used to accelerate resource-demanding applications created for work in such areas as deep learning, analytics and design. This method is used in data centers of large companies, in laboratories of research organizations, in enterprises.
Due to the acceleration, many services operate on the GPU, which ensure the operation of neural networks or processing data from smart cars. The advantage of this method is that the resource-intensive part of the application, which requires a large computing power, is processed on the GPU, and everything else is executed on the CPU. In the past few years, combined solutions have begun to appear, on the basis of which high-speed databases are created. Such solutions should be used, for example, for visualization of large data arrays. We are considering the possibility of launching a similar service in our company, so we studied the issue in some detail. We have selected several solutions that we would like to use. Not all of them are open, but the possibilities are encouraging. ')
Map-D
The name of this project means “massively parallel database”. With it, you can visualize huge amounts of data. For example, track the movement of an outbreak of influenza virus almost in real time.
With the help of Map-D, you can also build a map of messages about various disasters. All this is done in just a few milliseconds, subject to the availability of a ready-made array of primary data.
Map-D is a platform that appeared several years ago, and since then this solution has continued to be improved. With it, you can analyze large data arrays that use the power of parallel computing of the GPU. This speeds up the analysis of data arrays 70-1000 times, depending on the type of data itself.
The service database is integrated into the memory of a large number of GPUs and individual clusters. It allows you to process billions of base points with almost instant output of the analysis results in graphical form. According to the creators of the project, the SQL database can be processed Map-D with great speed.
Kinetica
Literally this month, another interesting project has been updated , which is called Kinetica. The principle of its work is about the same as in the previous case. It is proposed to use the solution in cases when it is required to work with huge data arrays that are to be visualized. Data is processed “on the fly” with a quick conclusion of an objective picture of the results.
Previously, the company was called GPUdb, which already hinted at its attitude to databases. The solution is positioned as an aid to corporations. That is why there is support for standard commercial functions like SQL-92 queries, clustering, one-click installation and so on.
According to the developers, using a GPU means working 4000+ cores on one device, versus 8-32 cores in the case of using the CPU. Kinetica includes a native visualization engine, plus plug-ins from third-party companies. All this, according to the creators of the platform, allows you to receive a hundred-fold increase in performance compared with the CPU.
BlazingDB
This is a rather specific project, which is positioned as a solution for companies working with PostgreSQL, MySQL or Amazon Redshift. Developers promise multiple productivity gains for all of these products.
The key difference between the platform and other solutions is the offer of local and cloud instances from third-party developers. If you already have data in Amazon or Azure, you can add BlazingDB and compare performance gains.
Previously, the company was non-profit, and only from June, developers began to try to monetize the service. To work with BlazingDB, you need the Nvidia CUDA driver for Linux. Unfortunately, the only platform currently supported is Ubuntu 14.04.
Blazegraph
Not all databases are SQL systems. Some of them are optimized for specific data handling tasks. For example, for working with graphs, where the interrelation of individual objects with the visualization of information obtained after analysis is analyzed.
Such work requires resources, and GPU-oriented computing is just what you need. One of the platforms, "sharpened" for the performance of such tasks - is Blazegraph . Note that this is an open-source solution, written in Java, with two methods for accelerating computing based on the GPU.
The developers say that Blazegraph provides 200-300-fold increase in performance compared with CPU-solutions. Actually, the first solution is to use a GPU for computing.
The second option is to simply rewrite your resource-intensive application using a Blazegraph DASL. It is a language designed to enable parallel operations on a GPU. “Using Spark with CUDA and GPU speeds up the execution of many applications 1000 times when compared to running the same applications on the CPU,” say the developers.
PGStrom
This is a popular open-source PostgreSQL database, which has many positive aspects. Firstly, this database is scalable, supports NoSQL / JSON and some other functions.
The interesting thing is that it does not support GPU-based acceleration out of the box. In order to get such a function, you need to use a third-party project called PGStrom . Upon receipt of a request, the PG-Strom determines whether this request can be executed using the GPU. If so, a GPU-optimized version of the query is created. The result is redirection and execution of the request from the GPU.
Setting up a PG-Strom is a lot of work. It requires the Nvidi CUDA toolkit and must be compiled from the original when. But as soon as PG-Strom integrates into PostgreSQL, it starts working without having to rewrite something for use on the GPU.