Parallelization of long operations

I often have to deal with tasks that require a very large database performance when processing large amounts of data. Today I will talk about a very simple but effective method that can help you out if the database is no longer keeping up with the amount of data that is accumulated and must be processed. The method does not depend on the database, but out of habit I publish it on the PostgreSQL blog, and the example will be on it. Let's go straight to the example.

Spherical horse

Suppose we are talking about the simplest billing (it is clear that the method is applicable not only for billing, but with it it will look quite clearly). The table in which data on calls of subscribers is accumulated will in our case be in the following format:

CREATE TABLE billing.calls ( call_id BIGINT, call_time TIMESTAMP , subscriber_id INTEGER , duration INTERVAL ); * This source code was highlighted with Source Code Highlighter .

This is how everything is simple. Call details are stored at this rate with insane speed, and must be charged in a reasonable time.

For billing, we have a database function with the following signature:

FUNCTION calculate( IN subscriber_id INTEGER , IN duration INTERVAL , OUT status_code text) RETURNS void * This source code was highlighted with Source Code Highlighter .

We carry out tariffing by running once every 5 minutes the following request:

SELECT calculate(subscriber_id, duration) FROM billing.calls; * This source code was highlighted with Source Code Highlighter .

And at one point, we understand that this request simply does not have time to be completed in 5 minutes. During this time, more data is accumulated, then again, and here we sit and wait for the night, when the stream will weaken a little, and the queue will finally be raked. Such is the perspective. I must say that we sat and waited, not alone. Together with us we sat 3 (for example) the remaining cores of our server, while one pored over the request. PostgreSQL, unfortunately, does not know how to parallelize queries itself, but in our case this is not necessary. Much better results will be given by a very simple and obvious trick. Create an index on the function "the remainder of the subscriber_id division by 4":

CREATE INDEX billing.calls_subscriber_id_mod_idx ON billing.calls USING btree ((subscriber_id % 4)); * This source code was highlighted with Source Code Highlighter .

And now we run in four threads (for example, four different jobs):

SELECT calculate(subscriber_id, duration) FROM billing.calls WHERE subscriber_id % 4 = @mod; * This source code was highlighted with Source Code Highlighter .

where mod is 0,1,2 or 3 (for each stream its own).

As a result

This technique solves problems with blocking that can occur if two different threads get a call from one subscriber. Also, in parallel, these jobs will work faster than if we had hoped for the parallelization of the database itself (if we do not postgame, but oracle, for example).
')
The method is applicable to any database supporting an index by function (Oracle, Postgresql). In the case of MSSQL, you can create a calculated column and an index on it. MySQL does not support functional indexes, but, as a workaround, you can create a new column with an index on it, and update it with a trigger.

Source: https://habr.com/ru/post/76309/

All Articles

Parallelization of long operations

Spherical horse

As a result

More articles: