SQL Server wait statistics or please tell me where it hurts

How many times have you experienced performance problems with SQL Server and decided exactly where to look?

One of the most rarely used SQL Server performance troubleshooting methodologies is called “Waits and Queues” (also known as “Waiting Statistics”). The basic premise of the methodology is that SQL Server constantly keeps track of which threads have to wait. You can ask SQL Server for this information in order to reduce the list of possible causes of performance problems. “Expectations” is what SQL Server is tracking. "Queues" are resources that streams can access. The system usually captures a huge amount of expectations, and they all mean waiting for access to various resources. For example, waiting for PAGEIOLATCH_EX means that the stream is waiting to read a page of data from disk to the buffer pool. Waiting for LCK_M_X means that the thread is waiting to be able to impose an exclusive lock on something.

The great news is that SQL Server knows exactly what the performance issues are, and all you need to do is ask him ... and then correctly interpret what he will say, which can be a little more difficult.

The following information is for people who are worried about each expectation and understand what causes it. Expectations always arise. This is how the SQL Server scheduling system works.
')
The thread uses the processor and has the status of “running” (RUNNING) until it is faced with the need to wait for access to the resource. In this case, it is placed in an unordered list of threads in the "suspended" state (SUSPENDED). At the same time, the next thread in the queue of threads waiting for access to the processor, organized according to the FIFO principle (the first to arrive - first dropped), and having the status of “ready for execution” (RUNNABLE) gets access to the processor and becomes “running”. If a thread in the "paused" state receives a notification that its resource is available, it becomes "ready for execution" and placed at the end of the queue of threads ready for execution. The flow continues its cyclical movement along the chain “is being executed” - “suspended” - “is ready for execution” until the task is completed. You can see the processes and their states using the Dynamic Management View (DMV) sys.dm_exec_requests.

SQL Server tracks the time elapsed between the exit of the thread from the “running” state and its return to this state, defining it as “wait time” and time spent in the “ready to run” state, defining it as “time signal wait time ”, i.e. how long it takes for the thread after receiving a signal about the availability of resources in order to gain access to the processor. We need to understand how much time the thread spends in the “suspended” state, called the “resource wait time”, subtracting the signal wait time from the total wait time.

An excellent source of information that I recommend reading about this is the new (2014) waiting statistics document "Tuning SQL Server Performance Using Expectancy Statistics: A Guide for a Newbie" (English), which I advise you to read. There is also a much older document ( "Adjusting performance using expectations and queues" (English) with a lot of useful information, but quite obsolete at the moment. The best manual on various types of expectations (and classes of short-term locks) - my comprehensive library of expectations (English) and short-term blocking (English).

You can query SQL Server for accumulated wait statistics using DMV sys.dm_os_wait_stats. Many people prefer to wrap a DMV call into some kind of summary code. Below is the latest version of my script as of 2016, which works with all versions and includes wait types for SQL Server 2016 (see the version of the script for use in Azure here ):

WITH [Waits] AS (SELECT [wait_type], [wait_time_ms] / 1000.0 AS [WaitS], ([wait_time_ms] - [signal_wait_time_ms]) / 1000.0 AS [ResourceS], [signal_wait_time_ms] / 1000.0 AS [SignalS], [waiting_tasks_count] AS [WaitCount], 100.0 * [wait_time_ms] / SUM ([wait_time_ms]) OVER() AS [Percentage], ROW_NUMBER() OVER(ORDER BY [wait_time_ms] DESC) AS [RowNum] FROM sys.dm_os_wait_stats WHERE [wait_type] NOT IN ( N'BROKER_EVENTHANDLER', N'BROKER_RECEIVE_WAITFOR', N'BROKER_TASK_STOP', N'BROKER_TO_FLUSH', N'BROKER_TRANSMITTER', N'CHECKPOINT_QUEUE', N'CHKPT', N'CLR_AUTO_EVENT', N'CLR_MANUAL_EVENT', N'CLR_SEMAPHORE', -- Maybe uncomment these four if you have mirroring issues N'DBMIRROR_DBM_EVENT', N'DBMIRROR_EVENTS_QUEUE', N'DBMIRROR_WORKER_QUEUE', N'DBMIRRORING_CMD', N'DIRTY_PAGE_POLL', N'DISPATCHER_QUEUE_SEMAPHORE', N'EXECSYNC', N'FSAGENT', N'FT_IFTS_SCHEDULER_IDLE_WAIT', N'FT_IFTSHC_MUTEX', -- Maybe uncomment these six if you have AG issues N'HADR_CLUSAPI_CALL', N'HADR_FILESTREAM_IOMGR_IOCOMPLETION', N'HADR_LOGCAPTURE_WAIT', N'HADR_NOTIFICATION_DEQUEUE', N'HADR_TIMER_TASK', N'HADR_WORK_QUEUE', N'KSOURCE_WAKEUP', N'LAZYWRITER_SLEEP', N'LOGMGR_QUEUE', N'MEMORY_ALLOCATION_EXT', N'ONDEMAND_TASK_QUEUE', N'PREEMPTIVE_XE_GETTARGETSTATE', N'PWAIT_ALL_COMPONENTS_INITIALIZED', N'PWAIT_DIRECTLOGCONSUMER_GETNEXT', N'QDS_PERSIST_TASK_MAIN_LOOP_SLEEP', N'QDS_ASYNC_QUEUE', N'QDS_CLEANUP_STALE_QUERIES_TASK_MAIN_LOOP_SLEEP', N'QDS_SHUTDOWN_QUEUE', N'REDO_THREAD_PENDING_WORK', N'REQUEST_FOR_DEADLOCK_SEARCH', N'RESOURCE_QUEUE', N'SERVER_IDLE_CHECK', N'SLEEP_BPOOL_FLUSH', N'SLEEP_DBSTARTUP', N'SLEEP_DCOMSTARTUP', N'SLEEP_MASTERDBREADY', N'SLEEP_MASTERMDREADY', N'SLEEP_MASTERUPGRADED', N'SLEEP_MSDBSTARTUP', N'SLEEP_SYSTEMTASK', N'SLEEP_TASK', N'SLEEP_TEMPDBSTARTUP', N'SNI_HTTP_ACCEPT', N'SP_SERVER_DIAGNOSTICS_SLEEP', N'SQLTRACE_BUFFER_FLUSH', N'SQLTRACE_INCREMENTAL_FLUSH_SLEEP', N'SQLTRACE_WAIT_ENTRIES', N'WAIT_FOR_RESULTS', N'WAITFOR', N'WAITFOR_TASKSHUTDOWN', N'WAIT_XTP_RECOVERY', N'WAIT_XTP_HOST_WAIT', N'WAIT_XTP_OFFLINE_CKPT_NEW_LOG', N'WAIT_XTP_CKPT_CLOSE', N'XE_DISPATCHER_JOIN', N'XE_DISPATCHER_WAIT', N'XE_TIMER_EVENT') AND [waiting_tasks_count] > 0 ) SELECT MAX ([W1].[wait_type]) AS [WaitType], CAST (MAX ([W1].[WaitS]) AS DECIMAL (16,2)) AS [Wait_S], CAST (MAX ([W1].[ResourceS]) AS DECIMAL (16,2)) AS [Resource_S], CAST (MAX ([W1].[SignalS]) AS DECIMAL (16,2)) AS [Signal_S], MAX ([W1].[WaitCount]) AS [WaitCount], CAST (MAX ([W1].[Percentage]) AS DECIMAL (5,2)) AS [Percentage], CAST ((MAX ([W1].[WaitS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgWait_S], CAST ((MAX ([W1].[ResourceS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgRes_S], CAST ((MAX ([W1].[SignalS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgSig_S], CAST ('https://www.sqlskills.com/help/waits/' + MAX ([W1].[wait_type]) as XML) AS [Help/Info URL] FROM [Waits] AS [W1] INNER JOIN [Waits] AS [W2] ON [W2].[RowNum] <= [W1].[RowNum] GROUP BY [W1].[RowNum] HAVING SUM ([W2].[Percentage]) - MAX( [W1].[Percentage] ) < 95; -- percentage threshold GO

The result of the query will show the expectations, grouped by percentage of all expectations in the system, in descending order. The expectations that are (potentially) worth paying attention to are at the top of the list and represent a large part of the expectations that SQL Server spends time. You see a large list of expectations that are removed from consideration - as I said earlier, they always arise and those listed above, we can, as a rule, ignore.

You can also reset the statistics accumulated by the server using this code:

 DBCC SQLPERF (N'sys.dm_os_wait_stats', CLEAR); GO

And, of course, you can come up with storing results every few hours or every day and do some time analysis to find out the direction of the changes, or automatically track problems in case they start to appear.
You can also use Performance Dashboard to display the results graphically in SQL Server 2005 and Data Collector in SQL Server 2008. In SQL Server 2000, you can use DBCC SQLPERF (N'waitstats') .

After you get the results, you will begin to think how to interpret them and where to look. The document I referred to earlier has a wealth of information on most types of expectations (except those added in SQL Server 2008).

Now I would like to provide the results of a study that I published some time ago. I asked people to run the code presented above and report the results to me. I received a huge amount of results - from 1823 servers - thanks!

Below is a graphical representation of the results:

I am not at all surprised by the top 4 results, as I have seen them again and again on the systems of my clients.

In continuation of my article, I am going to list the most popular types of expectations provided by the survey respondents, in descending order, and comment in a few words what exactly they can mean if they are basic to your system. The format of the list shows the number of systems from the respondents in which the specified type of waiting is the main one.

505: CXPACKET
Means parallelism, but not necessarily a problem. The coordinator thread in a parallel query always accumulates these expectations. If parallel threads are not busy or one of the threads is blocked, then the waiting threads also accumulate waiting for CXPACKET, which leads to a more rapid accumulation of statistics on this type - this is the problem. One thread can have more work than the others, and for this reason the entire request is blocked until the long thread finishes its work. If this type of wait is combined with large wait numbers PAGEIOLATCH_XX, then it can be a scan of large tables due to incorrect nonclustered indexes or due to a poor query execution plan. If this is not the reason, you can try using the MAXDOP option with values of 4, 2, or 1 for problematic queries or for the entire server instance (set on the server with the “max degree of parallelism” parameter). If your system is based on a NUMA scheme, try setting MAXDOP to a value equal to the number of processors in a single NUMA node in order to determine if this is the problem. You also need to determine the effect of installing MAXDOP on mixed load systems. To be honest, I would play with the “cost threshold for parallelism” parameter (raising it to 25 for a start) before lowering the MAXDOP value for the entire instance. And don't forget about the Resource Governor in the Enterprise version of SQL Server 2008, which allows you to set the number of processors for a specific group of server connections.
304: PAGEIOLATCH_XX
This is where SQL Server waits for a page of data from disk to memory. This type of wait may indicate a problem in the I / O system (which is the first response to this type of wait), but why should the I / O system serve such a number of reads? It is possible that pressure is exerted by the buffer pool / memory (not enough memory for a typical load), a sudden change in execution plans, leading to large parallel scans instead of searching, swelling of the plan cache, or some other reasons. You should not assume that the main problem in the system I / O.
275: ASYNC_NETWORK_IO
Here, SQL Server waits for the client to finish receiving data. The reason may be that the client requested too much data, or simply gets it sooo slow due to bad code - I almost never saw the problem in the network. Clients often read one row at a time — the so-called RBAR or “row by agonizing row” (Row-By-Agonizing-Row) —but instead of caching data on the client and notifying SQL Server of the end of reading immediately.
112: WRITELOG
The log management subsystem waits for the log to disk. As a rule, it means that the input / input system cannot ensure timely recording of the entire log volume, but on high-loaded systems this may be caused by general log recording restrictions, which may mean that you should divide the load between several databases, or even make your transactions a little longer to reduce the number of log records per disk. To verify that the cause is in the I / O system, use DMV sys.dm_io_virtual_file_stats to examine the I / O latency for the log file and see if it is the same as the WRITELOG delay time. If the WRITELOG lasts longer, you have received internal contention for writing to disk and must share the workload. If not, find out why you are creating such a large transaction log. Here (English) and here (English) you can draw some ideas.
(translator's example: the following query allows you to get input / output delay statistics for each file of each database on the server in a simple and convenient way:
Hidden text
```
 -- : .   > 20  USE master GO SELECT cast(db_name(a.database_id) AS VARCHAR) AS Database_Name , b.physical_name --, a.io_stall , a.size_on_disk_bytes , a.io_stall_read_ms / a.num_of_reads '.   ' , a.io_stall_write_ms / a.num_of_writes '.   ' --, * FROM sys.dm_io_virtual_file_stats(NULL, NULL) a INNER JOIN sys.master_files b ON a.database_id = b.database_id AND a.file_id = b.file_id where num_of_writes > 0 and num_of_reads > 0 ORDER BY Database_Name , a.io_stall DESC 
```
109: BROKER_RECEIVE_WAITFOR
Here the service broker is waiting for new messages. I would recommend adding this wait to the exclude list and re-run the query with wait statistics.
086: MSQL_XP
Here, SQL Server is waiting for the execution of extended stored procedures. This may indicate problems in the code of your extended stored procedures.
074: OLEDB
As the name suggests, this is waiting for interaction using OLEDB — for example, with an associated server. However, OLEDB is also used in DMV and the DBCC CHECKDB command, so do not think that the problem is necessarily in connected servers — this could be an external monitoring system that over-uses DMV calls. If this is indeed a connected server, then analyze the expectations on the connected server and determine what the problem is with the performance on it.
054: BACKUPIO
Shows when you backup directly to tape, which is sooo slow. I would prefer to filter this wait. (comment of the translator: I met this type of expectations when writing a backup to a disk, but the small database was very long backed up, not having time to complete a technological break and causing problems with users' productivity. If this is your case, the input system may be I / O used for backups, you need to consider increasing its performance or review the maintenance plan (do not perform full backups during short technological interruptions, replacing them with differential ones)
041: LCK_M_XX
Here, the thread simply waits for access to impose a lock on the object and means problems with locks. This can be caused by unwanted escalation of locks or bad code, but it can also be caused by the fact that I / O operations take too long and hold locks longer than usual. Look at the resources associated with locks using DMV sys.dm_os_waiting_tasks. Do not assume that the main problem is in locks.
032: ONDEMAND_TASK_QUEUE
This is normal and is part of a background task system (such as a pending reset, cleaning in the background). I would add this wait to the exclude list and re-execute the query with the wait statistics.
031: BACKUPBUFFER
Shows when you backup directly to tape, which is sooo slow. I would prefer to filter this wait.
027: IO_COMPLETION
SQL Server is waiting for I / O completion and this type of wait can be an indicator of a problem with the I / O system.
024: SOS_SCHEDULER_YIELD
Most often this is a code that does not fall into other types of expectations, but sometimes it can be a concurrency in a cyclic lock.
022: DBMIRROR_EVENTS_QUEUE
022: DBMIRRORING_CMD
These two types indicate that the database mirroring control system is sitting and waiting for what to do. I would add these expectations to the exclusion list and re-execute the query with the wait statistics.
018: PAGELATCH_XX
This is a competition for access to copies of pages in memory. The most well-known cases are the PFS, SGAM, and GAM competition arising in the tempdb database under certain types of loads (English). In order to find out which pages are competing for, you need to use DMV sys.dm_os_waiting_tasks in order to find out which pages are blocking. On problems with the tempdb base, Robert Davis (his blog , twitter ) wrote a good article showing how to solve them (Eng.) Another common reason I saw was a frequently updated index with competing index inserts using a serial key (IDENTITY).
016: LATCH_XX
This is a competition for any non-page structures in SQL Server — so this is not related to I / O and data in general. The reason for this type of delay can be quite difficult to understand and you need to use DMV sys.dm_os_latch_stats.
013: PREEMPTIVE_OS_PIPEOPS
Here, SQL Server switches to proactive scheduling mode in order to query Windows for something. This type of wait was added in the 2008 version and has not yet been documented. The easiest way to find out what it means is to remove the initial PREEMPTIVE_OS_ and search for what's left in MSDN - this will be the name of the Windows API.
013: THREADPOOL
This type says that there are not enough worker threads in the system to satisfy the request. Usually the reason is a large number of highly parallelized requests trying to be executed. (note of the translator: this may also be the intentionally curtailed value of the max worker threads server parameter)
009: BROKER_TRANSMITTER
Here Service Broker is waiting for new messages to be sent. I would recommend adding this wait to the exclude list and re-run the query with wait statistics.
006: SQLTRACE_WAIT_ENTRIES
Part of the listener (trace) SQL Server'a. I would recommend adding this wait to the exclude list and re-run the query with wait statistics.
005: DBMIRROR_DBM_MUTEX
This is one of the undocumented types and there is competition for sending the buffer, which is divided between the mirroring sessions (database mirroring). It may mean that you have too many mirroring sessions.
005: RESOURCE_SEMAPHORE
Here the request is waiting for the memory to be executed (the memory used to process the request operators, such as sorting). This may be a lack of memory at competitive load.
003: PREEMPTIVE_OS_AUTHENTICATIONOPS
003: PREEMPTIVE_OS_GENERICOPS
Here, SQL Server switches to proactive scheduling mode in order to query Windows for something. This type of wait was added in the 2008 version and has not yet been documented. The easiest way to find out what it means is to remove the initial PREEMPTIVE_OS_ and search for what's left in MSDN - this will be the name of the Windows API.
003: SLEEP_BPOOL_FLUSH
This wait can often be seen and it means that the control point limits itself in order to avoid overloading the I / O system. I would recommend adding this wait to the exclude list and re-run the query with wait statistics.
002: MSQL_DQ
Here, SQL Server waits until a distributed query is executed. This may mean problems with distributed queries, or it may just be the norm.
002: RESOURCE_SEMAPHORE_QUERY_COMPILE
When there are too many competing recompilations of queries in the system, SQL Server limits their execution. I don’t remember the level of the restriction, but this wait can mean excessive recompilation or perhaps too frequent use of one-time plans.
001: DAC_INIT
I have never seen this before and BOL says that the reason for the initialization of the administrative connection. I can not imagine how this can be the priority of waiting on someone else's system ...
001: MSSEARCH
This type is normal for full-text operations. If this is a preferential wait, it may mean that your system spends the most time on performing full-text queries. You may consider adding this type of wait to the exclusion list.
001: PREEMPTIVE_OS_FILEOPS
001: PREEMPTIVE_OS_LIBRARYOPS
001: PREEMPTIVE_OS_LOOKUPACCOUNTSID
001: PREEMPTIVE_OS_QUERYREGISTRY
Here, SQL Server switches to proactive scheduling mode in order to query Windows for something. This type of wait was added in the 2008 version and has not yet been documented. The easiest way to find out what it means is to remove the initial PREEMPTIVE_OS_ and search for what's left in MSDN - this will be the name of the Windows API.
001: SQLTRACE_LOCK
Part of the listener (trace) SQL Server'a. I would recommend adding this wait to the exclude list and re-run the query with wait statistics.

Hope it was interesting! Let me know if you are interested in something specifically or that you have read this article and enjoyed it!

Source: https://habr.com/ru/post/216309/

All Articles

SQL Server wait statistics or please tell me where it hurts

More articles: