📜 ⬆️ ⬇️

SQL Server: Performance when inserting data into a table with a clustered index and without

Hi, Habr! I present to you the translation of the article SQL Server Insert Performance for Clustered Indexes vs. Heap tables

Question


I have read many different documentation on modern SQL Server methods, stating that each table should have a clustered index and not be a bunch of non-clustered indexes. Most sources point out the administrative benefits of using clustered indexes. But is there any impact on performance and other positive or negative sides?

Decision


To check the performance of each option, we will create two identical tables with one difference: in one there will be a primary key created on the basis of a clustered index, and in the other a primary key will be created on the basis of a nonclustered index and the data will remain as a heap. Below is a script to create test tables and fill them with data.

Script 1
-- Create table and indexes CREATE TABLE testtable ([col1] [int] NOT NULL PRIMARY KEY CLUSTERED, [col2] [int] NULL, [col3] [int] NULL, [col4] [varchar](50) NULL, [col5] uniqueidentifier); -- Load sample data into table DECLARE @val INT SELECT @val=1 WHILE @val < 5000000 BEGIN INSERT INTO testtable (col1, col2, col3, col4, col5) VALUES (@val,round(rand()*100000,0), round(rand()*100000,0),'TEST' + cast(@val AS VARCHAR), newid()) SELECT @val=@val+1 END GO -- Create sample table and indexes CREATE TABLE testtable2 ([col1] [int] NOT NULL PRIMARY KEY NONCLUSTERED, [col2] [int] NULL, [col3] [int] NULL, [col4] [varchar](50) NULL, [col5] uniqueidentifier); INSERT INTO testtable2 SELECT * FROM testtable; 


The first thing I noticed at once was that the heap used more space, since table and index are separate structures. With a clustered index, data is stored in the leaves of the index and therefore uses less space. Below is a table showing the amount of space used by each object at the end of the data load.
TableIndexUsed (KB)Reserved (KB)Number of lines
testtablePK__testtabl__357D0D3E3D086A662579522579924999999
testtable2HEAP2569922570324999999
testtable2PK__testtabl__357D0D3F2CBA35D889432896084999999

After the primary data has been loaded, we proceed to the second stage of loading and look at the performance using the SQL Profiler, and also check the used space. Below is a script that loads an additional 100,000 records into each table.
')
Script 2
 -- insert when all pages are full DECLARE @val INT SELECT @val=5000000 WHILE @val < 5100000 BEGIN INSERT INTO testtable (col1, col2, col3, col4, col5) VALUES (@val,round(rand()*100000,0), round(rand()*100000,0),'TEST' + cast(@val AS VARCHAR), newid()) SELECT @val=@val+1 END GO DECLARE @val INT SELECT @val=5000000 WHILE @val < 5100000 BEGIN INSERT INTO testtable2 (col1, col2, col3, col4, col5) VALUES (@val,round(rand()*100000,0), round(rand()*100000,0),'TEST' + cast(@val AS VARCHAR), newid()) SELECT @val=@val+1 END GO 


To begin with, we will look at the used space, which, as expected (after all, the pages were filled to the end and nothing was deleted), increased proportionally for each object. Both tables increased in size about the same.
TableIndexUsed (KB)Reserved (KB)Number of lines
testtablePK__testtabl__357D0D3E3D086A662631282631765099999
testtable2HEAP2623922624725099999
testtable2PK__testtabl__357D0D3F2CBA35D891216912725099999

But the SQL Profiler shows us more interesting things. Since when inserting data into a heap, two objects need to be updated - a nonclustered index and the table itself, this also requires additional resources from the CPU, plus more read and write operations are performed at this moment than when inserting data into a table with a clustered index, therefore It will take a little longer.
Index typeCPU (ms)Read operationsWrite operationsDuration (ms)
Clustered350030491965411288
A pile389040608390411438

Now we will randomly delete some of the data and insert another 100,000 entries to check how the presence of white space on some pages will affect performance. The script that performs the removal and insertion is shown below.

Script 3
 -- remove 1000000 random records from each table DELETE FROM testtable WHERE col1 in (SELECT TOP 1000000 col1 FROM testtable ORDER BY newid()); DELETE FROM testtable2 WHERE col1 not in (SELECT col1 FROM testtable); GO -- insert when there is free space in pages DECLARE @val INT SELECT @val=5100000 WHILE @val < 5200000 BEGIN INSERT INTO testtable (col1, col2, col3, col4, col5) VALUES (@val,round(rand()*100000,0), round(rand()*100000,0),'TEST' + cast(@val AS VARCHAR), newid()) SELECT @val=@val+1 END GO DECLARE @val INT SELECT @val=5100000 WHILE @val < 5200000 BEGIN INSERT INTO testtable2 (col1, col2, col3, col4, col5) VALUES (@val,round(rand()*100000,0), round(rand()*100000,0),'TEST' + cast(@val AS VARCHAR), newid()) SELECT @val=@val+1 END GO 

This time, SQL Profiler shows an increase in the performance gap between the two tables in favor of a clustered index.
Index typeCPU (ms)Read operationsWrite operationsDuration (ms)
Clustered356230485965310334
A pile4973422142705313042


Most likely, this difference arose from the fact that when inserting records into a heap, the DBMS will look for empty space on each page to place data in it. The reason for this is the fact that the data in the heap is not sorted, which means that the inserted records can be placed anywhere. When it comes to a table with a clustered index, when increasing the value of the primary key, the insertion is always done at the end of the table. Also pay attention to the space used by the tables. In the case of a heap, when reusing space, the amount of space occupied by the table should not increase, because we inserted as much as we deleted. Below is information about the tables, recorded after the last script, in which it is clear that the space occupied by the heap has not changed and this confirms the earlier suggestion.
Table nameIndex nameUsed (KB)Reserved (KB)Number of lines
testtablePK__testtabl__357D0D3E3D086A662683042683604199999
testtable2HEAP2623922624724199999
testtable2PK__testtabl__357D0D3F2CBA35D893008930644199999

Conclusion


In this study, we have seen that using a clustered index, as compared to a heap, gives the best results for all the categories measured: CPU, I / O, and execution time. The only side effect is that using a clustered index requires a bit more space. This space can be reduced by rebuilding the index. In any case, before making any changes to the indexes on the production system, you need to check everything on your equipment to make sure that the performance has improved.

Source: https://habr.com/ru/post/340216/


All Articles