How not to recalculate amounts and averages every time

Imagine that we have an electronic payment system, and in it there is a table of operations in the database. And we want to calculate, for example, what size is the average operation. It is easy, here is the request, only long is executed:

> SELECT avg(amount) FROM transfer;
65.125965782378
generated in 3850 seconds

Now imagine that the indicator should be the freshest, and the records in the table are made every second, and in a month they are recruited millions. Or other requirements, but the essence is the same - to aggregate the same data every time is very expensive. Regular databases do not offer such optimizations. How to be?

Those who ride a bike could wonder how this cycling computer can read the average speed indefinitely and every second, but not keep all the speed values. Now, of course, a microSD card will enter the cycle computer, but they worked the same way years ago when there was no such memory.

It's simple: it stores the distance traveled and time. Every second you just need to increase the time counter and add the distance traveled to the odometer. Only 2 values of 4 or 8 bytes, only 2 addition operations. To derive the average, you need one divided by another.
')

Other transformations

In the case of the payment system, we periodically need to not only add, but also delete the values of the sum or average. This is done by the same formula, only in the opposite direction, if we know which value to remove.

subtraction amount

If we do not know d ₁ , then nothing can be done.

If we change a value that has already been taken into account, we do the same thing twice: remove the old value from the average and add the new one.

average update

Those who own mathematics easily decompose into similar formulas such statistics as variance, product, even the product of each value by each (or covariance). Where to put such knowledge into practice is an example.

Questionnaires

Users fill out a questionnaire from different questions, where they agree on a scale from 1 to 100 (from strong agreement to complete disagreement). It was necessary to calculate the difference between 2 users, that is, the standard deviation between the answers to the corresponding questions:

i and j are different users. You also need to calculate the difference between user X and the group, and between him and all other users. (I deliberately departed from the terminology of mathematical statistics, omitting the division by n in the dispersion. The essence and complexity of the calculations did not qualitatively change from this.)

The query to the database, without optimization, turned out like this:

  SELECT
         avg (power (my_answer.value - his_answer.value, 2))

     FROM
         answer AS my_answer
             INNER JOIN answer AS his_answer ON my_answer.question_id = his_answer.question_id

     WHERE
         my_answer.user_id = X AND
         his_answer.user_id = Y

For several users, you just need to change the condition:

  ...
     WHERE
         his_answer.user_id IN (Z) AND
         ...

or even remove it, if you compare the answers with all. Obviously, for this you have to select the entire table. It, of course, can be cached - 50 questions and 10,000 profiles, that is, 500,000 lines, just a few megabytes of memory. But since users often fill out questionnaires and often look at the results, they don’t want to repeat computational operations extra times, they need to be optimized.

On this formula in the first line - the formula of dispersion, or rather the part that needs to be optimized. X _q is the answer of this user (for which the indicator is considered) to question q, V _qu is the answer of user u to the same question q. We aggregate by users and questions. If there are few questions (m ≅ 50), then there are many questionnaires (n ≅ 10 000). In the second line - the same formula, where the amount by users (n, 10 000) is isolated. The sum over q hangs over all this expression, that is, the data should be broken down by questions, and we should aggregate each time over them, but this is only 50 lines. And the amounts of 10,000 users, u, we have isolated and can count and store. If we denote the sum of the answers to the question q as S _q , and the sum of the squares as R _q , we get a very compact formula:

S _q and R _q can be calculated in advance and saved in a separate table.

Measurements

To get a feel for how much the system can work faster, I made such a script that generates random answers from abstract users to abstract questions. The script fills only one table, the answers. Other tables we do not need.

  import sys, sqlite3, random, itertools

 USERS = xrange (100000)
 QUESTIONS = xrange (50)

 conn = sqlite3.connect ('aggregation.sqlite3')
 cur = conn.cursor ()

 cur.execute ('drop table if exists answer');
 cur.execute ('create table answer (user_id integer, question_id integer, value integer)')

 for u, q in itertools.product (USERS, QUESTIONS):
	 cur.execute ('insert into answer values (?,?,?)', (u, q, random.randint (1, 100)))

 # we build indexes after data entry,
 # so that during the insertion once again not to rebuild the binary tree
 cur.execute ('create index answer_user on answer (user_id)')
 cur.execute ('create index answer_question on answer (question_id)')
 conn.commit ()

 cur.execute ("" "
	 create table answer_summary as
	 select question_id, sum (value) value, sum (value * value) value_square, count (*) num
	 from answer
	 group by question_id
	 "" ")

 cur.execute ('create unique index answer_question on answer_summary (question_id)')

50 responses per 100,000 users - 5 million records. On my weak laptop (1.5 GHz x2 and 2 GB of memory), the tables were built for about half an hour, and the file, depending on the number of entries, was tens to hundreds of megabytes.

I made several requests with sums and sums of squares, the subjective feelings of which I quote below. It is important that Linux caches the database file completely in memory. That is the only thing that slowed down the work only calculations: index search and addition.

And here are the results if we consider the statistics of the differences in answers:

	no optimization	with optimization
one to all	seconds	instantly
everyone to everyone	minutes	instantly

Side effects

If our program communicates with the database, and we need the amount of a large number of lines (thousands and more), it is reasonable to let the database do it, and not to send all the data to the program and add up - this increases the overhead.

However, if our data are already summarized in advance, as is the case with the questionnaires from the example, then we have only 50 + 50 lines. Such data can already be simply selected in its original form and calculated in the program, where the code will be much more concise.

Similarly, you can make and update the amounts: you do not need to write complex UPDATE queries with the union of tables, you can select data, add and then update using INSERT OR REPLACE.

Where the approach does not work

Let's go back to the example with a speedometer. We have an average speed in kilometers per hour. We can linearly convert it to meters per second. If we recalculated from km / h to m / s, and only then aggregated, it would be the same.

If we keep the average, and we want to calculate the average air resistance (proportional to the square of the speed), we would have failed, because the sum of the squares of the values is not the sum of the values. Looking for initial observations.

Not so difficult

In fact, if you don’t have OLAP, but simply ORM, you don’t need to write sheets of SQL queries, which you will then need to maintain and - worst of all - understand another person. Such optimizations can be done in the form of related models. Here's how to organize models in Django:

  class Question (Model):
     text = CharField ()


 class Answer (Model):
     user = ForeignKey (User)
     question = ForeignKey (Question)
     value = IntegerField ()


 class AnswerAggregate (Model):
     question = ForeignKey (Question, related_name = 'aggregates')
     summ = IntegerField ()
     summ_of_squares = IntegerField ()
     answers_count = IntegerField ()


 def add_to_aggregates (* kwargs):
     answer = kwargs ['instance']
     answer.question.aggregates.update (
         summ = F ('summ') + answer.value,
         summ_of_squares = F ('summ_of_squares') + answer.value ** 2,
         answers_count = F ('answers_count') + 1
     )


 def remove_from_aggregates (* kwargs):
     answer = kwargs ['instance']
     answer.question.aggregates.update (
         summ = F ('summ') - answer.value,
         summ_of_squares = F ('summ_of_squares') - answer.value ** 2,
         answers_count = F ('answers_count') - 1
     )


 signals.post_add.connect (add_to_aggregates, model = Answer)
 signals.pre_update.connect (remove_from_aggregates, model = Answer)
 signals.post_update.connect (add_to_aggregates, model = Answer)
 signals.pre_delete.connect (remove_from_aggregates, model = Answer)

When we create an answer, we add it in different amounts. To delete, you just need to subtract the values. To change, we subtract the old and add the new value.

As an exciting homework, you can write a query through ORM or in SQL, which considers the standard deviation from the formula above.

Price improvements

How expedient it is to optimize such calculations is a separate question. In the example with the questionnaires, reports began to slow down with a couple of thousand users, and with ten thousand each such request was executed in a couple of seconds. Such amount of data is quite achievable even on a small project, and even more so in a commercial enterprise. Conventional databases today do not do such optimizations automatically. Optimizations are built into the OLAP databases, but for a small business they are expensive and cumbersome. Therefore, such small optimizations can be a good solution.

The price for such optimization will be:

1. Derive formulas and understand them, for this you need a good knowledge of mathematical statistics.
2. Write a trigger or procedure in ORM and debug.
3. Document the system in detail so that the new employee after you does not start using the usual aggregation functions.

Total

Firstly, as seen in the measurements, the main brake in the aggregates (SUM, AVG) is not at all reading the disc, but summation.

Secondly, even complex aggregate functions can be expanded and aggregated into them. I showed how the square of the difference can be decomposed into components and select in it the sum of the quantities and the sum of their squares. The amounts can be calculated in advance and stored at the ready.

Report resource consumption decreases in proportion to the number of observations, O (n). On systems storing gigabytes of data, this can be a significant acceleration of work, and reports can be accelerated with up to fractions of seconds.

Thirdly, you can add new data, edit and delete old ones without even recounting the aggregate completely. Resource consumption conversion also decreases in O (n) times.

Fourthly, working with a small number of rows, that is, only with aggregates, the amount of data will decrease, and they can be taken from the database and calculated in the program code, having gone from cumbersome SELECT or UPDATE queries.

For those who read it to the end: the listing at the very beginning of the article is fictional.

Source: https://habr.com/ru/post/139622/

All Articles