Delayed Durability or a story about how to speed up the execution of autotests from 11 to 2.5 minutes

Relatively recently, I started helping with a new product project, which is being developed as a free web service for tracking working time.

The technology stack was initially small-scale, and SQL Server 2014 Express was used as a data warehouse. One of the first tasks that I was assigned was to look into the possibility of accelerating autotests.

Before I joined the work, the project existed for a long time and had already managed to acquire a fair amount of tests (at that time I counted 1300 autotests). On the build machine with SSD tests were performed in 4-5 minutes, and on a regular HDD more than 11-12 minutes. It would be possible to put the entire SSD team, but this would not solve the essence of the problem, especially since in the near future it was planned to expand the functionality and the tests would become even more.
')
All tests were grouped and before performing each group, the database was cleared of old data. Previously, cleaning took place by re-creating the database, but this approach was very slow in practice. It was much faster to simply clean all the tables from the data and reset the IDENTITY value to zero, so that subsequent insertions would form the correct test data. The script with this approach became my starting point:

EXEC sys.sp_msforeachtable 'ALTER TABLE ? NOCHECK CONSTRAINT ALL' DELETE FROM [dbo].[Project] DBCC CHECKIDENT('[dbo].[Project]', RESEED, 0) DBCC CHECKIDENT('[dbo].[Project]', RESEED) DELETE FROM [dbo].[RecentWorkTask] ... EXEC sys.sp_msforeachtable 'ALTER TABLE ? WITH CHECK CHECK CONSTRAINT ALL'

The first thing that caught my eye was to turn off and re-enable all the foundations on the base. This behavior was intended to speed up the delete operation, but in fact it turned out to be a resource-intensive operation for the server. Decided to abandon the use of this approach.

Also, the idea immediately arose of using dynamic SQL to form a request to delete data. For example, if there are foreign keys on the table, then use the DELETE operation as before. Otherwise, you can delete data with minimal logging using the TRUNCATE command.

As a result, the request to delete data took the following form:

 DECLARE @SQL NVARCHAR(MAX) = '' SELECT @SQL = ( SELECT CASE WHEN p.[object_id] IS NOT NULL THEN CASE WHEN f.referenced_object_id IS NULL THEN N'TRUNCATE TABLE ' ELSE N'DELETE FROM ' END + obj_name ELSE '' END + CHAR(13) + CHAR(10) + CASE WHEN has_identity > 0 THEN N'DBCC CHECKIDENT('''+ obj_name + N''', RESEED, 0) WITH NO_INFOMSGS' ELSE '' END + CHAR(13) + CHAR(10) FROM ( SELECT obj_name = QUOTENAME(s.name) + '.' + QUOTENAME(o.name) , o.[object_id] , has_identity = IdentityProperty(o.[object_id], 'LastValue') FROM sys.objects o JOIN sys.schemas s ON o.[schema_id] = s.[schema_id] WHERE o.is_ms_shipped = 0 AND o.[type] = 'U' ) t LEFT JOIN ( SELECT DISTINCT [object_id] FROM sys.partitions WHERE [rows] > 0 AND index_id IN (0, 1) ) p ON t.[object_id] = p.[object_id] LEFT JOIN ( SELECT DISTINCT referenced_object_id FROM sys.foreign_keys ) f ON f.referenced_object_id = t.[object_id] FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)') PRINT @SQL --EXEC sys.sp_executesql @SQL

And if auto tests were initially performed for 11 minutes on my machine:

Then after I threw out the ALTER statements and rewrote the request, all the tests started to run 40 seconds faster:

Of course, you could be glad and put Resolved on the quiet for task, but the main problem remained:

When performing tests, the disk was very heavily loaded. Decided to see what are waiting on the server. To do this, first cleaned up sys.dm_os_wait_stats :

 DBCC SQLPERF("sys.dm_os_wait_stats", CLEAR)

Re-run the run AutoTests and after fulfilled the query:

 SELECT TOP(20) wait_type , wait_time = wait_time_ms / 1000. , wait_resource = (wait_time_ms - signal_wait_time_ms) / 1000. , wait_signal = signal_wait_time_ms / 1000. , waiting_tasks_count , percentage = 100.0 * wait_time_ms / SUM(wait_time_ms) OVER () , avg_wait = wait_time_ms / 1000. / waiting_tasks_count , avg_wait_resource = (wait_time_ms - signal_wait_time_ms) / 1000. / [waiting_tasks_count] , avg_wait_signal = signal_wait_time_ms / 1000.0 / waiting_tasks_count FROM sys.dm_os_wait_stats WHERE [waiting_tasks_count] > 0 AND max_wait_time_ms > 0 AND [wait_type] NOT IN ( N'BROKER_EVENTHANDLER', N'BROKER_RECEIVE_WAITFOR', N'BROKER_TASK_STOP', N'BROKER_TO_FLUSH', N'BROKER_TRANSMITTER', N'CHECKPOINT_QUEUE', N'CHKPT', N'CLR_AUTO_EVENT', N'CLR_MANUAL_EVENT', N'CLR_SEMAPHORE', N'DBMIRROR_DBM_EVENT', N'DBMIRROR_EVENTS_QUEUE', N'DBMIRROR_WORKER_QUEUE', N'DBMIRRORING_CMD', N'DIRTY_PAGE_POLL', N'DISPATCHER_QUEUE_SEMAPHORE', N'EXECSYNC', N'FSAGENT', N'FT_IFTS_SCHEDULER_IDLE_WAIT', N'FT_IFTSHC_MUTEX', N'HADR_CLUSAPI_CALL', N'HADR_FILESTREAM_IOMGR_IOCOMPLETION', N'HADR_LOGCAPTURE_WAIT', N'HADR_NOTIFICATION_DEQUEUE', N'HADR_TIMER_TASK', N'HADR_WORK_QUEUE', N'KSOURCE_WAKEUP', N'LAZYWRITER_SLEEP', N'LOGMGR_QUEUE', N'ONDEMAND_TASK_QUEUE', N'PWAIT_ALL_COMPONENTS_INITIALIZED', N'QDS_PERSIST_TASK_MAIN_LOOP_SLEEP', N'QDS_CLEANUP_STALE_QUERIES_TASK_MAIN_LOOP_SLEEP', N'REQUEST_FOR_DEADLOCK_SEARCH', N'RESOURCE_QUEUE', N'SERVER_IDLE_CHECK', N'SLEEP_BPOOL_FLUSH', N'SLEEP_DBSTARTUP', N'SLEEP_DCOMSTARTUP', N'SLEEP_MASTERDBREADY', N'SLEEP_MASTERMDREADY', N'SLEEP_MASTERUPGRADED', N'SLEEP_MSDBSTARTUP', N'SLEEP_SYSTEMTASK', N'SLEEP_TASK', N'SLEEP_TEMPDBSTARTUP', N'SNI_HTTP_ACCEPT', N'SP_SERVER_DIAGNOSTICS_SLEEP', N'SQLTRACE_BUFFER_FLUSH', N'SQLTRACE_INCREMENTAL_FLUSH_SLEEP', N'SQLTRACE_WAIT_ENTRIES', N'WAIT_FOR_RESULTS', N'WAITFOR', N'WAITFOR_TASKSHUTDOWN', N'WAIT_XTP_HOST_WAIT', N'WAIT_XTP_OFFLINE_CKPT_NEW_LOG', N'WAIT_XTP_CKPT_CLOSE', N'XE_DISPATCHER_JOIN', N'XE_DISPATCHER_WAIT', N'XE_TIMER_EVENT' ) ORDER BY [wait_time_ms] DESC

The greatest delay is observed with WRITELOG .

 wait_type wait_time waiting_tasks_count percentage ----------------------- ------------ -------------------- ----------- WRITELOG 546.798 60261 96.07 PAGEIOLATCH_EX 13.151 96 2.31 PAGELATCH_EX 5.768 46097 1.01 PAGEIOLATCH_UP 1.243 86 0.21 IO_COMPLETION 1.158 89 0.20 MEMORY_ALLOCATION_EXT 0.480 683353 0.08 LCK_M_SCH_S 0.200 34 0.03 ASYNC_NETWORK_IO 0.115 688 0.02 LCK_M_S 0.082 10 0.01 PAGEIOLATCH_SH 0.052 1 0.00 PAGELATCH_UP 0.037 6 0.00 SOS_SCHEDULER_YIELD 0.030 3598 0.00

We open the “encyclopedia of expectations” by Paul Randal, and find WRITELOG at the same time referring to MSDN :

Log flush to disk. It is clearly indicated that it was not a problem. or even make your transactions longer to reduce log flushes. If you want to use your DMV, you can use your DMV system. If WRITELOG is longer, you need to shard. If not, investigate why you're creating so much transaction log.

The log management subsystem waits for the log to disk. As a rule, it means that the disk system cannot ensure timely recording of the entire log volume, but on high-loaded systems this may be caused by general log recording restrictions / This may mean that you should divide the load between several databases, or even make your transactions a little more long to reduce the number of log entries to disk. To verify that the cause is in the I / O system, use DMV sys.dm_io_virtual_file_stats to examine the I / O latency for the log file and see if it is the same as the WRITELOG delay time . If the WRITELOG lasts longer, you have received internal contention for writing to disk and must share the workload. If not, find out why you are creating such a large transaction log.

And what do I need to find out now? Yes, every autotest when doing something writes to the database. As a solution to the problem with the expectations of WRITELOG , it would be possible to insert data not line by line, but in larger chunks. But in SQL Server 2014 , a new option appeared at the Delayed Durability database level, i.e., the ability to not flush data to disk immediately when committing a transaction.

How is data modified in SQL Server ? Suppose we are inserting a new line. SQL Server calls the Storage Engine component, which in turn refers to the Buffer Manager (which works with buffers in memory and disk) and says that I want to change the data.

After that, the Buffer Manager refers to the Buffer Pool (cache in memory for all our data, which stores information page by page - 8KB per page) and then modifies the necessary pages in memory. If these pages do not exist, then it will load them from the disk. At the moment when the page in memory has changed, SQL Server cannot yet say that the query has been completed. Otherwise, one of the ACID principles, namely Durability , would be violated, when at the end of the modification it is guaranteed that the data will be written to disk.

After modifying the page in memory, the Storage Engine calls the Log Manager , which writes data to the log. But it does not immediately, but through the Log Buffer , which has a size of 60Kb and is used to optimize performance when working with the log. Resetting data from the buffer to the log file occurs in a situation when:

The buffer was filled and the data was saved in the log.
The user has executed sys.sp_flush_log .
There was a commit transaction and all that was in the Log Buffer was recorded in the log.

When the data has been saved in the log, it is confirmed that the data has been modified and the SQL Server weighs the client.

According to this logic, you can see that the data does not fall into the data file. SQL Server uses an asynchronous mechanism to write to data files. There are two such mechanisms in total:

1) Lazy Writer , which runs periodically, checks whether there is enough memory for SQL Server . If not, pages from memory are pushed out and written to the data file. And those that have been changed are flushed to disk and discarded from memory.

2) Checkpoint , which scans dirty pages about once a minute, drops them to disk and leaves them in memory.

Suppose a situation occurs when a bunch of small transactions occur in the system, which, for example, modify the data line by line. After each change, the data leaves the Log Buffer in the transaction log. Remember that all changes go to the log synchronously and other transactions have to wait for their turn.

I will show by example:

 USE [master] GO SET NOCOUNT ON IF DB_ID('TT') IS NOT NULL BEGIN ALTER DATABASE TT SET SINGLE_USER WITH ROLLBACK IMMEDIATE DROP DATABASE TT END GO CREATE DATABASE TT GO ALTER DATABASE TT MODIFY FILE (NAME = N'TT', SIZE = 25MB, FILEGROWTH = 5MB) GO ALTER DATABASE TT MODIFY FILE (NAME = N'TT_log', SIZE = 25MB, FILEGROWTH = 5MB) GO USE TT GO CREATE TABLE dbo.tbl ( a INT IDENTITY PRIMARY KEY , b INT , c CHAR(2000) ) GO IF OBJECT_ID('tempdb.dbo.#temp') IS NOT NULL DROP TABLE #temp GO SELECT t.[file_id], t.num_of_writes, t.num_of_bytes_written INTO #temp FROM sys.dm_io_virtual_file_stats(DB_ID(), NULL) t DECLARE @WaitTime BIGINT , @WaitTasks BIGINT , @StartTime DATETIME = GETDATE() , @LogRecord BIGINT = ( SELECT COUNT_BIG(*) FROM sys.fn_dblog(NULL, NULL) ) SELECT @WaitTime = wait_time_ms , @WaitTasks = waiting_tasks_count FROM sys.dm_os_wait_stats WHERE [wait_type] = N'WRITELOG' DECLARE @i INT = 1 WHILE @i < 5000 BEGIN INSERT INTO dbo.tbl (b, c) VALUES (@i, 'text') SELECT @i += 1 END SELECT elapsed_seconds = DATEDIFF(MILLISECOND, @StartTime, GETDATE()) * 1. / 1000 , wait_time = (wait_time_ms - @WaitTime) / 1000. , waiting_tasks_count = waiting_tasks_count - @WaitTasks , log_record = ( SELECT COUNT_BIG(*) - @LogRecord FROM sys.fn_dblog(NULL, NULL) ) FROM sys.dm_os_wait_stats WHERE [wait_type] = N'WRITELOG' SELECT [file] = FILE_NAME(o.[file_id]) , num_of_writes = t.num_of_writes - o.num_of_writes , num_of_mb_written = (t.num_of_bytes_written - o.num_of_bytes_written) * 1. / 1024 / 1024 FROM #temp o CROSS APPLY sys.dm_io_virtual_file_stats(DB_ID(), NULL) t WHERE o.[file_id] = t.[file_id]

The insertion of 5 thousand lines took approximately 42.5 seconds, while the delays when inserting into the log amounted to 42 seconds.

 elapsed_seconds wait_time waiting_tasks_count log_record ---------------- ---------- -------------------- ----------- 42.54 42.13 5003 18748

Physically, SQL Server logged 5,000 times and recorded a total of 20Mb.

 file num_of_writes num_of_mb_written ------- -------------- ------------------ TT 79 8.72 TT_log 5008 19.65

For such situations, Delayed Durability is needed. When it is enabled, logging occurs only when the Log Buffer is full. You can enable Delayed Durability for the entire database:

 ALTER DATABASE TT SET DELAYED_DURABILITY = FORCED GO

or for individual transactions:

 ALTER DATABASE TT SET DELAYED_DURABILITY = ALLOWED GO BEGIN TRANSACTION t ... COMMIT TRANSACTION t WITH (DELAYED_DURABILITY = ON)

Turn on at the base level and re-execute the script.

The expectations disappeared and the request worked on my machine for 170ms:

 elapsed_seconds wait_time waiting_tasks_count log_record ---------------- ---------- -------------------- ----------- 0.17 0.00 0 31958

Due to the fact that writing to the log was less intense:

 file num_of_writes num_of_mb_written ------- -------------- ------------------ TT 46 9.15 TT_log 275 12.92

Of course there is a fly in the ointment. Even before the data physically got into the log file, the client is already being notified that the changes are committed. In case of failure, we can lose data equal to the size of the buffer and beat the base.

In my case, the safety of test data was not required. For the test base on which the autotests ran, they set DELAYED_DURABILITY in FORCED and the next run of all the tests worked in 2.5 minutes.

And while all the delays associated with writing to the log began to have a minimal impact on performance:

 wait_type wait_time waiting_tasks_count percentage -------------------- ----------- -------------------- ----------- PAGEIOLATCH_EX 16.031 61 43.27 WRITELOG 15.454 787 41.72 PAGEIOLATCH_UP 2.210 36 5.96 PAGEIOLATCH_SH 1.472 2 3.97 LCK_M_SCH_M 0.756 9 2.04 ASYNC_NETWORK_IO 0.464 735 1.25 PAGELATCH_UP 0.314 8 0.84 SOS_SCHEDULER_YIELD 0.154 2759 0.41 PAGELATCH_EX 0.154 44785 0.41 LCK_M_SCH_S 0.021 7 0.05 PAGELATCH_SH 0.011 378 0.02

Let's summarize some results on Delayed Durability :

Available in all editions since SQL Server 2014 .
It can be used if you have a bottleneck when writing to the transaction log (delayed writing in large blocks can be more efficient than many small ones).
Simultaneously, the executed transactions are less likely to compete for IO operations when writing to the transaction log.
When enabled, the COMMIT operation does not wait for a record in the transaction log and we can get a significant performance boost in OLTP systems.
You can safely turn on Delayed Durability , if you are willing to play Russian Roulette and, under “successful” circumstances, lose approximately 60Kb of data in the event of a failure.

UPDATE 2016-07-12: The data deletion script made more functional

 DECLARE @SQL NVARCHAR(MAX) , @FK_TurnOff NVARCHAR(MAX) , @FK_TurnOn NVARCHAR(MAX) SELECT @SQL = ( SELECT CHAR(13) + CHAR(10) + IIF(p.[rows] > 0, IIF(t2.referenced_object_id IS NULL, N'TRUNCATE TABLE ', N'DELETE FROM ') + obj_name, '' ) + CHAR(13) + CHAR(10) + IIF(IdentityProperty(t.[object_id], 'LastValue') > 0, N'DBCC CHECKIDENT('''+ obj_name + N''', RESEED, 0) WITH NO_INFOMSGS', '' ) FROM ( SELECT obj_name = QUOTENAME(s.name) + '.' + QUOTENAME(o.name), o.[object_id] FROM sys.objects o JOIN sys.schemas s ON o.[schema_id] = s.[schema_id] WHERE o.is_ms_shipped = 0 AND o.[type] = 'U' AND o.name != N'__MigrationHistory' ) t JOIN sys.partitions p ON t.[object_id] = p.[object_id] AND p.index_id IN (0, 1) LEFT JOIN ( SELECT DISTINCT f.referenced_object_id FROM sys.foreign_keys f ) t2 ON t2.referenced_object_id = t.[object_id] FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)') SELECT @FK_TurnOff = CAST(x.query('off/text()') AS NVARCHAR(MAX)) , @FK_TurnOn = CAST(x.query('on/text()') AS NVARCHAR(MAX)) FROM ( SELECT [off] = CHAR(10) + 'ALTER TABLE ' + obj + ' NOCHECK CONSTRAINT ' + fk , [on] = CHAR(10) + 'ALTER TABLE ' + obj + ' CHECK CONSTRAINT ' + fk FROM ( SELECT fk = QUOTENAME(f.name) , obj = QUOTENAME(SCHEMA_NAME(f.[schema_id])) + '.' + QUOTENAME(OBJECT_NAME(f.parent_object_id)) FROM sys.foreign_keys f WHERE f.delete_referential_action = 0 AND EXISTS( SELECT * FROM sys.partitions p WHERE p.[object_id] = f.parent_object_id AND p.[rows] > 0 AND p.index_id IN (0, 1) ) ) t FOR XML PATH(''), TYPE ) t(x) IF @SQL LIKE '%[az]%' BEGIN SET @SQL = ISNULL(@FK_TurnOff, '') + @SQL + ISNULL(@FK_TurnOn, '') PRINT @SQL --EXEC sys.sp_executesql @SQL END

If you want to share this article with an English-speaking audience:
Speed Up Autotests From 11 to 2.5 Minutes

Source: https://habr.com/ru/post/303156/

All Articles

Delayed Durability or a story about how to speed up the execution of autotests from 11 to 2.5 minutes

More articles: