Sparse columns or sparse columns in MS SQL Server. Real experience

One of our rather large clients, in whose electronic document management system more than 10,000 users work every day at a time, so-called sparse columns or sparse columns were used.

The article is an attempt to reduce the prerequisites and the results of applying this functionality (and some other DBMS settings) in a single place.

Problems and background

To dive into the subject a couple of words about the system: the system is a product whose development began in the 2000s. At the moment, the system is actively developing. The product has a client-northern architecture with several application servers.

Microsoft SQL Server DBMS is used as server side.
')
Taking into account the fact that the system is no longer “new”, in the database structure there are corresponding mechanisms / options / objects, the use of which at the current moment looks unreasonable and outdated. Gradually, there is a rejection of these objects, but there are situations when they are still used.

Premise # 1

At the next performance audit, together with the Customer, we paid attention to the rapid growth of one of the tables (let's call it table X). The volume of table X is almost over 350 GB (by the way, the volume of the entire database is about ~ 2TB). The distribution according to the actual data of the table and the indices was as follows:

data was about 115 GB
the rest of the volume ~ 235 GB accounted for indices.

Those. The situation is quite uncommon, when the indexes on the table are about ~ 2 times larger than the data itself. Those. we get quite high overhead costs, which in turn negatively affect:

the duration of the insert / update data in this table (the more indices, the “more expensive” the operation);
the duration of servicing operations (maintenance) of these indices;
the duration of the backup and restore the database in case of failure;
disk space requirements are increasing in terms of volume.

Premise number 2

In a large way, the DBMS operation can be described as follows: all data is loaded from disks to the buffer pool (cache) before processing. This reduces the number of disk operations and speeds up the processing of the most frequently used data. In more detail with the mechanism can be found, for example, in the article . The efficiency of using the buffer pool can be indirectly monitored using the Page Life Expectancy counter - the lifetime of the page in the buffer pool.

In the interval of several months, they revealed a negative trend in reducing the lifetime of the page in the buffer pool.

Inaction could lead to:

a significant increase in the load on the disk subsystem;
increase the duration of user operations.

As a result, it was decided to conduct an analysis of the causes.

Since the system in these months was not static and was constantly modified, we decided to start the analysis with the contents of the buffer pool. For this, we used the data of the dynamic view: sys.dm_os_buffer_descriptors .

Request example:

Into_BufferPool

SELECT indexes.name AS index_name, objects.name AS object_name, objects.type_desc AS object_type_description, COUNT(*) AS buffer_cache_pages, COUNT(*) * 8 / 1024 AS buffer_cache_used_MB FROM sys.dm_os_buffer_descriptors INNER JOIN sys.allocation_units ON allocation_units.allocation_unit_id = dm_os_buffer_descriptors.allocation_unit_id INNER JOIN sys.partitions ON ((allocation_units.container_id = partitions.hobt_id AND type IN (1,3)) OR (allocation_units.container_id = partitions.partition_id AND type IN (2))) INNER JOIN sys.objects ON partitions.object_id = objects.object_id INNER JOIN sys.indexes ON objects.object_id = indexes.object_id AND partitions.index_id = indexes.index_id WHERE allocation_units.type IN (1,2,3) AND objects.is_ms_shipped = 0 AND dm_os_buffer_descriptors.database_id = DB_ID() GROUP BY indexes.name, objects.name, objects.type_desc ORDER BY COUNT(*) DESC;

With a volume of the buffer pool of ~ 185 GB of the order of 80-90 GB, we compiled the cluster index data of our problem table X. The volume of the rest of the buffer pool was distributed fairly evenly between the indices. From this it followed that the maximum effect could be obtained by optimizing in some way the data of table X (in this case, we are talking about its clustered index).

Premise number 3

Practice shows the accumulation of large amounts of data in a single table will sooner or later affect the performance, if not all, then at least part of the operations associated with this table. The situation is non-linearly aggravated with a large number of columns in this table.

In addition, when we analyzed table X for data filling, we saw the following picture: for almost all rows, only a certain set of columns was filled (due to which system flexibility and adaptation to specific business requirements are achieved). What essentially leads again to the low efficiency of data storage and processing, since some cells do not store information, but nevertheless, space for these cells is reserved (for example, adding an empty column with data type int will increase the cost of storing the table by at least [4 bytes * number of rows in the table]).

Solutions / Corrections

Taking into account all the initial data presented above, 4 directions were identified for further development:

filtered indexes;
data compression by means of DBMS (data compression);
sparse columns;
partitioning table X into several smaller tables.

Filtered Indexes

Here is what the official documentation tells us:
“The filtered index is an optimized nonclustered index, particularly suitable for queries that select a well-defined data subset ... A well-designed filtered index improves query performance and also reduces the cost of maintaining and storing indexes compared to full-size indexes . ”

If we say a little simpler, then we are talking about the possibility of creating an index only for a part of the data in the table, for example, we can create an index in table X for a specific business case.
But to apply the index, it was necessary to use a new version of the software, in which the database structure was changed, including. In particular, in the new version, the values of the parameters of the client software connection from the DBMS were changed to ON:

But in our case, the update was planned in six months, and we could not wait so long. Moreover, it was also not intended to use filtered indexes, since this made ineffective, for example, using the option of forced parameterization.

Data compression

Since the client has installed the DBMS version - 2012, data compression for this version is possible of two types:

page level compression (page compression);
row-level compression.

If we consider the version of SQL 2016, there are some changes , but in our case they were also irrelevant (switching to the next version of SQL on a powerful hardware is quite expensive from a financial point of view). Therefore, we stopped at the first two in more detail.

According to the documentation , page-level compression is a more resource-intensive operation for the CPU than line-level compression. Based on this page-level compression option was dropped immediately.

Next was an attempt to use row compression, but the documentation also stumbled upon the mention that even it consumes additional resources. And since the processor is a very limited resource, this option also had to be abandoned.

Sparse columns

Sparse columns are regular columns that have an optimized storage for NULL values. Sparse columns reduce the space required to store NULL values, but the cost of getting values other than NULL increases.
To achieve a positive effect in each particular column, there must be a certain percentage of NULL values. This percentage depends on the type of data in the column, for example:

Data type	Byte Spare	Sparse bytes	NULL percentage
float	four	eight	64%
datetime	eight	12	52%
varchar	eight	12	52%
int	2	four	60%

Moreover, not every column of m. transferred to sparse. A list of limitations and incompatibilities is given in the official documentation .

Those. To assess the possibility of transferring to a sparse in a large way, it was necessary to analyze

presence of restrictions from the documentation on a specific table / column;
the real percentage of NULL values in these columns;

This was helped by a request, the text of which is available at the link below. The request itself for large volumes takes a long time, it is recommended to indicate the specific table that you need to analyze.

SparseCandidate

 CREATE TABLE #temp ( ColumnName varchar(50), ColumnID int, TableName varchar(50), TableId int, TypeName varchar(50), IsParse bit, IsNullable bit, NumberOfRow bigint, NumberOfRowNULL bigint, Ratio int) SET NOCOUNT ON INSERT into #temp SELECT DISTINCT sys.columns.name ColumnName, sys.columns.column_id ColumnID, OBJECT_NAME(sys.columns.object_id) AS TableName, sys.columns.object_id TableID, CASE systypes.name WHEN 'sysname' THEN 'nvarchar' ELSE systypes.name END AS TypeName, sys.columns.is_sparse IsParse, sys.columns.is_nullable IsNullable, 0,0,0 FROM sys.columns (NoLock) INNER JOIN systypes (NoLock) ON systypes.xtype = sys.columns.system_type_id WHERE sys.columns.object_id = OBJECT_ID('my_table') -- change table name and systypes.name NOT IN ('geography', 'geometry', 'image', 'ntext', 'text', 'timestamp') and sys.columns.is_sparse = 0 and sys.columns.is_nullable = 1 and sys.columns.is_rowguidcol = 0 and sys.columns.is_identity = 0 and sys.columns.is_computed = 0 and sys.columns.is_filestream = 0 and sys.columns.default_object_id = 0 and sys.columns.rule_object_id = 0 and sys.columns.system_type_id=sys.columns.user_type_id delete tps from #temp tps where exists ( select DISTINCT 'Exists' from sys.columns inner join sys.indexes i on i.object_id = tps.TableId inner join sys.index_columns ic on ic.column_id = tps.ColumnID inner join sys.columns c on c.object_id = tps.TableId and ic.column_id = c.column_id where i.type =1 or i.is_primary_key = 1) select count(*) from #temp delete tps from #temp tps inner join sys.partitions p on p.object_id = tps.TableId where p.data_compression<>0; DECLARE @TableName nvarchar(1000) DECLARE @ColumnName nvarchar(1000) DECLARE @vQuery nvarchar(1000) DECLARE @result1 INT DECLARE @result2 INT DECLARE tables_cursor CURSOR FAST_FORWARD FOR SELECT TableName,ColumnName FROM #temp OPEN tables_cursor FETCH NEXT FROM tables_cursor INTO @TableName,@ColumnName WHILE @@FETCH_STATUS = 0 BEGIN -- Search the number of row in a table SET @vQuery = 'SELECT @result1= COUNT(*) FROM [' + @TableName + ']' EXEC SP_EXECUTESQL @Query = @vQuery , @Params = N'@result1 INT OUTPUT' , @result1 = @result1 OUTPUT -- Search the number of row in a table SET @vQuery = 'SELECT @result2= COUNT(*) FROM [' + @TableName + '] where [' + @ColumnName + '] is null' EXEC SP_EXECUTESQL @Query = @vQuery , @Params = N'@result2 INT OUTPUT' , @result2 = @result2 OUTPUT update #temp set NumberOfRow = @result1,NumberOfRowNULL = @result2,Ratio = (@result2*100/@result1) where ColumnName=@ColumnName and TableName=@TableName FETCH NEXT FROM tables_cursor INTO @TableName,@ColumnName END CLOSE tables_cursor DEALLOCATE tables_cursor --delete from #temp where Ratio>10 select * from #temp --drop table #temp

Next, from the resulting list, it is necessary to determine the columns that satisfy our conditions as much as possible (with max NULL values) and change them to sparse. The change itself is best done in single_user database mode to eliminate the occurrence of long locks. After converting a column to sparse, you need to perform an index rebuild, only after that you can see the resizing of the table.

I emphasize that the use of the mechanism does not increase the load on the processor of the DBMS server (it was checked both in practice and load testing).

Perhaps this is one of those rare situations that can be briefly depicted with the following pictures:

Splitting a table into smaller ones

In this case, the work could also be divided into blocks:

making changes to the system architecture;
modification of all applied development for a new architecture.

Both items are quite expensive and the separation of the table was the last in the list of possible optimizations. All this could be delayed indefinitely.

By the way, the implementation of this item is currently not required. Good or bad in terms of the development of the product as a whole, I think time will tell ...

Sparse effect

Well, instead of conclusions, I would like to note in numbers the positive aspects of using sparse columns:

reduced the volume of the cluster index of table X by ~ 2 times (a similar effect on reducing the volume of MB during the re-creation of nonclustered indexes with key-fields, which were translated into sparse);
Based on claim 1, the efficiency of using the buffer pool has increased, since decreased the amount of data in table X in the buffer pool;
based on p.1-2, the page lifetime in the buffer pool has increased (blue line), and as a result, the load on the disks has decreased (yellow line);
reduced the duration of the part of operations associated with large amounts of data sampling, because increased probability of finding data in the buffer pool;

If we reduce the data in the table we get:

Indicator	Improvement	Note
Clustered index volume of table X, GB	2 times	On disk (HDD)
Clustered index size of table X in the buffer pool, GB	2 times	In Memory (RAM)
Page Life Expectancy, sec	2 times	The lifetime of the page in the buffer pool
Disk Transfers / sec, iops	1.6 times	The number of disk operations. The load on the storage system is reduced.

Sparse restrictions

However, in addition to the advantages, there were also restrictions:

sparse columns should be updated periodically. After some time, the distribution of NULL and non-NULL values in sparse columns may change and the use of sparse will be unjustified;
The number of columns that can be translated in sparse is limited. In case of exceeding when updating lines, error 576 may occur.

Detailed list of restrictions can be found here .

If you have read the article to the end, then before performing any of the settings presented above, remember that the author of this article does not bear any responsibility for the possible loss or damage to your data. Do not forget to make backups.

Perhaps someone has already applied the settings described in the article on high-load systems, unsubscribe to what results you came.

Source: https://habr.com/ru/post/342838/

All Articles