Why is database denormalization necessary, and when to use it

In our blog on Habré, we not only talk about the development of our product - billing for Hydra telecom operators , but also publish materials about working with infrastructure and using technologies.

Recently, we wrote about the use of Clojure and MongoDB , and today we will talk about the pros and cons of denormalizing databases. Database developer and financial analyst Emil Drkušić wrote on Vertabelo's blog about why, how and when to use this approach. We present to your attention the main points of this article.
')

What is denormalization?

Usually, this term is understood as a strategy applicable to an already normalized database in order to increase its performance. The point of this action is to put redundant data where they can bring maximum benefit. To do this, you can use additional fields in existing tables, add new tables, or even create new instances of existing tables. The logic is to reduce the execution time of certain queries through the simplification of data access or through the creation of tables with the results of reports built on the basis of the initial data.

An indispensable condition for the process of denormalization is the presence of a normalized base. It is important to understand the difference between the situation when the database was not normalized at all, and the normalized base, which was then denormalized. In the second case, everything is fine, but the first one speaks of errors in the design or lack of knowledge among the specialists who were engaged in this.

Consider the normalized model for the simplest CRM system:

Let's go over the tables available here:

The user_account table stores data about users registered in the application (to simplify the role model and user rights are excluded from it).
The client table contains some basic information about clients.
The product table is a list of products offered.
The task table contains all created tasks. Each of them can be represented as a set of coordinated actions in relation to the client. For each there is a list of calls, meetings, proposed and sold goods.
The call and meeting tables store data about orders and customer meetings and link them to current tasks.
The task_outcome , meeting_outcome and call_outcome contain all possible outcomes of calls, meetings, and tasks.
product_offered stores a list of products that have been offered to customers;
product_sold - products that managed to sell.
The supply_order table stores information about all placed orders.
The writeoff table contains a list of writeoff written off for any reason.

In this example, the database is greatly simplified for clarity. But it is not difficult to see that it is perfectly normalized - there is no redundancy in it, and everything should work like a clock. No performance problems arise until the database encounters a large amount of data.

When it is useful to use denormalization

Before we take on the normalization of what has already been normalized once, naturally, you need to clearly understand why this is necessary? You should make sure that the benefits of applying the method outweigh the possible negative effects. Here are some situations in which you should definitely think about denormalization.

Preservation of historical data . The data changes over time, but it may be necessary to preserve the values that were entered at the time the record was created. For example, the name and surname of a client or other information about his place of residence and occupation may change. The task must contain the values of the fields that were relevant at the time the task was created. If this is not provided, then it will not be possible to recover past data correctly. You can solve the problem by adding a table with a history of changes. In this case, a SELECT query that returns the task and the actual name of the client will be more complicated. Perhaps an additional table is not the best way out.
Increase query performance . Some queries may use multiple tables to access frequently requested data. An example is a situation when it is necessary to combine up to 10 tables in order to obtain the name of the client and the name of the goods that were sold to him. Some of them, in turn, may contain large amounts of data. In this scenario, it would be wise to add the client_id field client_id to the products_sold table.
Accelerate report generation . Business often needs to upload certain statistics. Creating reports on "live" data may require a lot of time, and the performance of the entire system may in this case drop. For example, you want to track customer sales for a certain period for a given group or for all users at once. The request that solves this task in the “combat” base will be shoveling it completely before such a report is formed. It is not difficult to imagine how much slower everything will work if such reports are needed daily.
Preliminary calculations of frequently requested values . There is always a need to keep the most frequently requested values at the ready for regular calculations, rather than creating them anew, generating them each time in real time.

The conclusion suggests itself: you should not turn to denormalization if there are no problems associated with the performance of the application. But if it is felt that the system has slowed down or will soon slow down, it’s time to think about the use of this technique. However, before referring to it, it is worthwhile to use other possibilities for improving performance: query optimization and proper indexing.

Not so smooth

The obvious goal of denormalization is to increase productivity. But everything has its price. In this case, it consists of the following items:

Disk space . Expected, since data is duplicated.
Data anomalies . It is necessary to understand that from a certain moment the data can be changed in several places simultaneously. Accordingly, it is necessary to correctly change their copies. The same applies to reports and precomputed values. You can solve the problem using triggers, transactions and stored procedures for combining operations.
Documentation Every use of denormalization should be documented in detail. If in the future the structure of the base changes, then in the course of this process it will be necessary to take into account all past changes - it is possible that they will generally be abandoned by that moment as unnecessary. (Example: a new attribute has been added to the client table, which makes it necessary to save past values. To solve this problem, you have to change the denormalization settings).
Slowing down other operations . It is possible that the use of denormalization will slow down the processes of insertion, modification and deletion of data. If such actions are carried out relatively rarely, then this may be justified. In this case, we break one slow SELECT query into a series of smaller queries for entering, updating, and deleting data. If a complex query can seriously slow down the entire system, then slowing down many small operations will not affect the quality of the application’s work in such a dramatic way.
More code . Items 2 and 3 will require the addition of a code. At the same time, they can significantly simplify some queries. If the existing database is denormalized, then these queries will need to be modified to optimize the entire system. You will also need to update existing entries by filling in the values of the added attributes — this will also require writing a certain amount of code.

Denormalization by example

In the presented model, some of the aforementioned denormalization rules were applied. Blue marked new blocks, pink - those that have been changed.

What has changed and why?

The only innovation in the product table is the string units_in_stock . In the normalized model, we can calculate this value as follows: ordered title - sold - (proposed) - written off (units ordered - units sold - (units offered) - units written off). The calculation is repeated every time a customer requests a product. This is a fairly time consuming process. Instead, it is possible to calculate the value in advance so that by the time the request is received from the buyer, everything will be ready. On the other hand, the units_in_stock attribute must be updated after each entry, update, or delete operation in the products_on_order , writeoff , product_offered and product_sold tables.

Two new attributes have been added to the task table: client_name and user_first_last_name . Both of them store values at the time the task is created — this is necessary, because each of them can change over time. You also need to save the foreign key that binds them to the original user and client IDs. There are other values that need to be stored - for example, the client’s address or information on taxes included in the price, such as VAT.

The denormalized product_offered table received two new attributes: price_per_unit and price . The first of them is necessary to store the current price at the time of the offer of goods. The normalized model will only show its current state. Therefore, as soon as the price changes, the “price history” will change. Innovation does not just speed up the base, it improves functionality. The price string calculates the value of units_sold * price_per_unit . Thus, it is not necessary to do the calculation every time you need to look at the list of proposed goods. This is a small price per increase.

Changes to the product_sold table are made for the same reasons. With the only difference that in this case we are talking about the sold items of goods.

The statistics_per_year table (annual statistics) in the test model is a completely new element. In essence, this is a denormalized table, since all its data can be calculated from other tables. Information about current tasks, successfully completed tasks, meetings, and calls for each given client is stored here. This place also stores the total amount of accruals for each year. After entering, updating, or deleting any data in the task , meeting , call and product_sold tables, you must recalculate this data for each client and the corresponding year. Since the changes are most likely related only to the current year, reports for previous years can now remain unchanged. The values in this table are calculated in advance, so we will save time and resources when we need the results of the calculations.

Denormalization is a powerful approach. Not that it should be resorted to every time there is a task of increasing productivity. But in some cases this may be the best or even the only solution.

However, before making a final decision on the use of denormalization, you should make sure that it is really necessary. It is necessary to analyze the current performance of the system - often denormalization is used after the system is started up. Do not be afraid of this, but you should carefully monitor and document all changes, then problems and data anomalies should not arise.

Our experience

We in Latera do a lot of work on optimizing the performance of our Hydra billing system, which is not surprising given the size of our customers and the specifics of the telecom industry.

One of the examples in the article assumes the creation of a table with interim results to speed up reports. Of course, the most difficult thing in this approach is to maintain the current state of such a table. Sometimes it is possible to shift this task to a DBMS — for example, to use materialized views. But, when business logic is slightly more complex to obtain intermediate results, the relevance of denormalized data has to be maintained manually.

“Hydra” has a deeply developed system of privileges for users, billing operators. Rights are granted in several ways - you can allow certain actions to a specific user, you can prepare roles in advance and give them different sets of rights, you can give a certain department special privileges. Just imagine how slow the calls to any entities of the system would be if every time you had to go through this whole chain to make sure: “yes, this employee is allowed to enter into agreements with legal entities” or “no, this operator does not have enough privileges to work with subscribers of the neighboring branch ". Instead, we separately store the ready aggregated list of existing rights for users and update it when changes are made to the system that can affect this list. Employees move from one department to another much less frequently than they open the next subscriber in the billing interface, which means we have to calculate the full set of their rights just as often.

Of course, denormalizing the repository is only one of the measures taken. Part of the data should be cached directly in the application, but if intermediate results on average live much longer than user sessions, it makes sense to seriously think about denormalization to speed up reading.