📜 ⬆️ ⬇️

Why is database denormalization necessary, and when to use it



In our blog on Habré, we not only talk about the development of our product - billing for Hydra telecom operators , but also publish materials about working with infrastructure and using technologies.

Recently, we wrote about the use of Clojure and MongoDB , and today we will talk about the pros and cons of denormalizing databases. Database developer and financial analyst Emil Drkušić wrote on Vertabelo's blog about why, how and when to use this approach. We present to your attention the main points of this article.
')

What is denormalization?


Usually, this term is understood as a strategy applicable to an already normalized database in order to increase its performance. The point of this action is to put redundant data where they can bring maximum benefit. To do this, you can use additional fields in existing tables, add new tables, or even create new instances of existing tables. The logic is to reduce the execution time of certain queries through the simplification of data access or through the creation of tables with the results of reports built on the basis of the initial data.

An indispensable condition for the process of denormalization is the presence of a normalized base. It is important to understand the difference between the situation when the database was not normalized at all, and the normalized base, which was then denormalized. In the second case, everything is fine, but the first one speaks of errors in the design or lack of knowledge among the specialists who were engaged in this.

Consider the normalized model for the simplest CRM system:



Let's go over the tables available here:


In this example, the database is greatly simplified for clarity. But it is not difficult to see that it is perfectly normalized - there is no redundancy in it, and everything should work like a clock. No performance problems arise until the database encounters a large amount of data.

When it is useful to use denormalization


Before we take on the normalization of what has already been normalized once, naturally, you need to clearly understand why this is necessary? You should make sure that the benefits of applying the method outweigh the possible negative effects. Here are some situations in which you should definitely think about denormalization.

  1. Preservation of historical data . The data changes over time, but it may be necessary to preserve the values ​​that were entered at the time the record was created. For example, the name and surname of a client or other information about his place of residence and occupation may change. The task must contain the values ​​of the fields that were relevant at the time the task was created. If this is not provided, then it will not be possible to recover past data correctly. You can solve the problem by adding a table with a history of changes. In this case, a SELECT query that returns the task and the actual name of the client will be more complicated. Perhaps an additional table is not the best way out.
  2. Increase query performance . Some queries may use multiple tables to access frequently requested data. An example is a situation when it is necessary to combine up to 10 tables in order to obtain the name of the client and the name of the goods that were sold to him. Some of them, in turn, may contain large amounts of data. In this scenario, it would be wise to add the client_id field client_id to the products_sold table.
  3. Accelerate report generation . Business often needs to upload certain statistics. Creating reports on "live" data may require a lot of time, and the performance of the entire system may in this case drop. For example, you want to track customer sales for a certain period for a given group or for all users at once. The request that solves this task in the “combat” base will be shoveling it completely before such a report is formed. It is not difficult to imagine how much slower everything will work if such reports are needed daily.
  4. Preliminary calculations of frequently requested values . There is always a need to keep the most frequently requested values ​​at the ready for regular calculations, rather than creating them anew, generating them each time in real time.

The conclusion suggests itself: you should not turn to denormalization if there are no problems associated with the performance of the application. But if it is felt that the system has slowed down or will soon slow down, it’s time to think about the use of this technique. However, before referring to it, it is worthwhile to use other possibilities for improving performance: query optimization and proper indexing.

Not so smooth


The obvious goal of denormalization is to increase productivity. But everything has its price. In this case, it consists of the following items:


Denormalization by example


In the presented model, some of the aforementioned denormalization rules were applied. Blue marked new blocks, pink - those that have been changed.



What has changed and why?



The only innovation in the product table is the string units_in_stock . In the normalized model, we can calculate this value as follows: ordered title - sold - (proposed) - written off (units ordered - units sold - (units offered) - units written off). The calculation is repeated every time a customer requests a product. This is a fairly time consuming process. Instead, it is possible to calculate the value in advance so that by the time the request is received from the buyer, everything will be ready. On the other hand, the units_in_stock attribute must be updated after each entry, update, or delete operation in the products_on_order , writeoff , product_offered and product_sold tables.



Two new attributes have been added to the task table: client_name and user_first_last_name . Both of them store values ​​at the time the task is created — this is necessary, because each of them can change over time. You also need to save the foreign key that binds them to the original user and client IDs. There are other values ​​that need to be stored - for example, the client’s address or information on taxes included in the price, such as VAT.



The denormalized product_offered table received two new attributes: price_per_unit and price . The first of them is necessary to store the current price at the time of the offer of goods. The normalized model will only show its current state. Therefore, as soon as the price changes, the “price history” will change. Innovation does not just speed up the base, it improves functionality. The price string calculates the value of units_sold * price_per_unit . Thus, it is not necessary to do the calculation every time you need to look at the list of proposed goods. This is a small price per increase.

Changes to the product_sold table are made for the same reasons. With the only difference that in this case we are talking about the sold items of goods.



The statistics_per_year table (annual statistics) in the test model is a completely new element. In essence, this is a denormalized table, since all its data can be calculated from other tables. Information about current tasks, successfully completed tasks, meetings, and calls for each given client is stored here. This place also stores the total amount of accruals for each year. After entering, updating, or deleting any data in the task , meeting , call and product_sold tables, you must recalculate this data for each client and the corresponding year. Since the changes are most likely related only to the current year, reports for previous years can now remain unchanged. The values ​​in this table are calculated in advance, so we will save time and resources when we need the results of the calculations.

Denormalization is a powerful approach. Not that it should be resorted to every time there is a task of increasing productivity. But in some cases this may be the best or even the only solution.

However, before making a final decision on the use of denormalization, you should make sure that it is really necessary. It is necessary to analyze the current performance of the system - often denormalization is used after the system is started up. Do not be afraid of this, but you should carefully monitor and document all changes, then problems and data anomalies should not arise.

Our experience


We in Latera do a lot of work on optimizing the performance of our Hydra billing system, which is not surprising given the size of our customers and the specifics of the telecom industry.

One of the examples in the article assumes the creation of a table with interim results to speed up reports. Of course, the most difficult thing in this approach is to maintain the current state of such a table. Sometimes it is possible to shift this task to a DBMS — for example, to use materialized views. But, when business logic is slightly more complex to obtain intermediate results, the relevance of denormalized data has to be maintained manually.

“Hydra” has a deeply developed system of privileges for users, billing operators. Rights are granted in several ways - you can allow certain actions to a specific user, you can prepare roles in advance and give them different sets of rights, you can give a certain department special privileges. Just imagine how slow the calls to any entities of the system would be if every time you had to go through this whole chain to make sure: “yes, this employee is allowed to enter into agreements with legal entities” or “no, this operator does not have enough privileges to work with subscribers of the neighboring branch ". Instead, we separately store the ready aggregated list of existing rights for users and update it when changes are made to the system that can affect this list. Employees move from one department to another much less frequently than they open the next subscriber in the billing interface, which means we have to calculate the full set of their rights just as often.

Of course, denormalizing the repository is only one of the measures taken. Part of the data should be cached directly in the application, but if intermediate results on average live much longer than user sessions, it makes sense to seriously think about denormalization to speed up reading.

Other technical articles on our blog:


Source: https://habr.com/ru/post/281262/


All Articles