SQL Server: Performance when inserting data into a table with a clustered index and without

Hi, Habr! I present to you the translation of the article SQL Server Insert Performance for Clustered Indexes vs. Heap tables

Question

I have read many different documentation on modern SQL Server methods, stating that each table should have a clustered index and not be a bunch of non-clustered indexes. Most sources point out the administrative benefits of using clustered indexes. But is there any impact on performance and other positive or negative sides?

Decision

To check the performance of each option, we will create two identical tables with one difference: in one there will be a primary key created on the basis of a clustered index, and in the other a primary key will be created on the basis of a nonclustered index and the data will remain as a heap. Below is a script to create test tables and fill them with data.

Script 1

-- Create table and indexes CREATE TABLE testtable ([col1] [int] NOT NULL PRIMARY KEY CLUSTERED, [col2] [int] NULL, [col3] [int] NULL, [col4] [varchar](50) NULL, [col5] uniqueidentifier); -- Load sample data into table DECLARE @val INT SELECT @val=1 WHILE @val < 5000000 BEGIN INSERT INTO testtable (col1, col2, col3, col4, col5) VALUES (@val,round(rand()*100000,0), round(rand()*100000,0),'TEST' + cast(@val AS VARCHAR), newid()) SELECT @val=@val+1 END GO -- Create sample table and indexes CREATE TABLE testtable2 ([col1] [int] NOT NULL PRIMARY KEY NONCLUSTERED, [col2] [int] NULL, [col3] [int] NULL, [col4] [varchar](50) NULL, [col5] uniqueidentifier); INSERT INTO testtable2 SELECT * FROM testtable;

The first thing I noticed at once was that the heap used more space, since table and index are separate structures. With a clustered index, data is stored in the leaves of the index and therefore uses less space. Below is a table showing the amount of space used by each object at the end of the data load.

Table	Index	Used (KB)	Reserved (KB)	Number of lines
testtable	PK__testtabl__357D0D3E3D086A66	257952	257992	4999999
testtable2	HEAP	256992	257032	4999999
testtable2	PK__testtabl__357D0D3F2CBA35D8	89432	89608	4999999

After the primary data has been loaded, we proceed to the second stage of loading and look at the performance using the SQL Profiler, and also check the used space. Below is a script that loads an additional 100,000 records into each table.
')

Script 2

 -- insert when all pages are full DECLARE @val INT SELECT @val=5000000 WHILE @val < 5100000 BEGIN INSERT INTO testtable (col1, col2, col3, col4, col5) VALUES (@val,round(rand()*100000,0), round(rand()*100000,0),'TEST' + cast(@val AS VARCHAR), newid()) SELECT @val=@val+1 END GO DECLARE @val INT SELECT @val=5000000 WHILE @val < 5100000 BEGIN INSERT INTO testtable2 (col1, col2, col3, col4, col5) VALUES (@val,round(rand()*100000,0), round(rand()*100000,0),'TEST' + cast(@val AS VARCHAR), newid()) SELECT @val=@val+1 END GO

To begin with, we will look at the used space, which, as expected (after all, the pages were filled to the end and nothing was deleted), increased proportionally for each object. Both tables increased in size about the same.

Table	Index	Used (KB)	Reserved (KB)	Number of lines
testtable	PK__testtabl__357D0D3E3D086A66	263128	263176	5099999
testtable2	HEAP	262392	262472	5099999
testtable2	PK__testtabl__357D0D3F2CBA35D8	91216	91272	5099999

But the SQL Profiler shows us more interesting things. Since when inserting data into a heap, two objects need to be updated - a nonclustered index and the table itself, this also requires additional resources from the CPU, plus more read and write operations are performed at this moment than when inserting data into a table with a clustered index, therefore It will take a little longer.

Index type	CPU (ms)	Read operations	Write operations	Duration (ms)
Clustered	3500	304919	654	11288
A pile	3890	406083	904	11438

Now we will randomly delete some of the data and insert another 100,000 entries to check how the presence of white space on some pages will affect performance. The script that performs the removal and insertion is shown below.

Script 3

 -- remove 1000000 random records from each table DELETE FROM testtable WHERE col1 in (SELECT TOP 1000000 col1 FROM testtable ORDER BY newid()); DELETE FROM testtable2 WHERE col1 not in (SELECT col1 FROM testtable); GO -- insert when there is free space in pages DECLARE @val INT SELECT @val=5100000 WHILE @val < 5200000 BEGIN INSERT INTO testtable (col1, col2, col3, col4, col5) VALUES (@val,round(rand()*100000,0), round(rand()*100000,0),'TEST' + cast(@val AS VARCHAR), newid()) SELECT @val=@val+1 END GO DECLARE @val INT SELECT @val=5100000 WHILE @val < 5200000 BEGIN INSERT INTO testtable2 (col1, col2, col3, col4, col5) VALUES (@val,round(rand()*100000,0), round(rand()*100000,0),'TEST' + cast(@val AS VARCHAR), newid()) SELECT @val=@val+1 END GO

This time, SQL Profiler shows an increase in the performance gap between the two tables in favor of a clustered index.

Index type	CPU (ms)	Read operations	Write operations	Duration (ms)
Clustered	3562	304859	653	10334
A pile	4973	422142	7053	13042

Most likely, this difference arose from the fact that when inserting records into a heap, the DBMS will look for empty space on each page to place data in it. The reason for this is the fact that the data in the heap is not sorted, which means that the inserted records can be placed anywhere. When it comes to a table with a clustered index, when increasing the value of the primary key, the insertion is always done at the end of the table. Also pay attention to the space used by the tables. In the case of a heap, when reusing space, the amount of space occupied by the table should not increase, because we inserted as much as we deleted. Below is information about the tables, recorded after the last script, in which it is clear that the space occupied by the heap has not changed and this confirms the earlier suggestion.

Table name	Index name	Used (KB)	Reserved (KB)	Number of lines
testtable	PK__testtabl__357D0D3E3D086A66	268304	268360	4199999
testtable2	HEAP	262392	262472	4199999
testtable2	PK__testtabl__357D0D3F2CBA35D8	93008	93064	4199999

Conclusion

In this study, we have seen that using a clustered index, as compared to a heap, gives the best results for all the categories measured: CPU, I / O, and execution time. The only side effect is that using a clustered index requires a bit more space. This space can be reduced by rebuilding the index. In any case, before making any changes to the indexes on the production system, you need to check everything on your equipment to make sure that the performance has improved.

Source: https://habr.com/ru/post/340216/

All Articles

SQL Server: Performance when inserting data into a table with a clustered index and without

Question

Decision

Conclusion

More articles: