Why large databases do not work as you want, or about the unfulfilled dreams of SQL queries

In the operation team, they thought that our experience with Microsoft SQL in a loaded environment can no longer be hidden, and therefore this article was born. In it, I will talk about the nuances of working with this DBMS from practice.

Microsoft SQL Server has long found its place in the Yandex.Money product family and successfully solves the problem of collecting scattered information about all operations from a variety of individual services. Without such an assembly together it would be impossible to track the payment, collect statistics or solve a problem.

All of the above will be useful to administrators of large databases - those who care about the fast and accurate work of Microsoft SQL Server 2012-2014 analytics.

If the information turns out to be useful and interesting for you, be sure to let us know in the comments so that the author does not relax.

Why I am telling all this, or a brief description of the author.

My name is Slava, I am the head of the group for administering Microsoft SQL servers in Yandex.Money.

I have been working with this DBMS since version 6.5, on MS SQL 7-2000 I created a monitoring system for 20-30 servers; After the release of MS SQL 2005 and the transition from DMO to SMO, monitoring was completely rewritten .

In the team Yandex. Money - from the end of 2012.

For further understanding, I will say a few words about the architecture of our storage system. Each component of the Yandex.Money payment system knows only its own piece of information about the payment, so you need to somehow regularly collect information from all components, aggregate it, and identify relationships.

From this data, cubes with hundreds of dimensions are built daily, reports and registers are generated, and reconciliations take place. After that, “combed” data is given to users and financial systems for further work. Microsoft SQL Server with its powerful Integration Services and the ETL (Extract - Transform - Load) process was a great fit for these tasks.

We also actively use Analysis Services to build OLAP cubes . They provide data in a multi-dimensional form with any cut, simple for analysts, financiers, product managers and management.

Why minute requests can work an hour

The most common reason for this is outdated statistics (information about the state of the table columns) and, as a result, not optimal plans.

image alt text

A small reminder of the statistics, plans and how it all relates.

When executing any query to the database, the execution plan is built from a sorted list of operations performed. Choosing a specific operation, a separate component “query optimizer” selects among the important input data statistics that describes the distribution of values for the columns of the database table.

This estimate of the number of elements allows the query optimizer to build more or less optimal execution plans. Therefore, outdated statistics spoils the whole thing.

An incorrect estimate can lead to redundant disk I / O operations due to a lack of memory allocated for transfer to TempDB. In addition, a DBMS can choose a sequential execution plan instead of a parallel one — these are just some of the possible consequences.

On Habré there are already articles on automating the updating of statistics , so I will not dwell on this point in detail. But updating statistics can be time consuming and loading, and the result is unstable due to the fact that the data in the table changes faster than updating statistics. In such cases, I simply globally turn on the trace flag, which changes the 20% threshold for updating statistics to a dynamically changing percentage of table changes:

DBCC TRACEON (2371,-1)

If the statistics flag is on, they will be updated more frequently with an increase in the number of lines.

It helps a lot, but on tables with a billion rows and more, the process can be extremely slow, despite the normal design.

In this case, the inclusion of asynchronous statistics updates helps:

 ALTER DATABASE dbName SET AUTO_UPDATE_STATISTICS_ASYNC ON

During the update, SQL statistics will continue to work without waiting for the process to complete. According to our tests, queries to the database without enabling this option are very unstable. Try this experiment yourself.

But even with good plans and up-to-date statistics, there may be unexplained delays in executing requests. In my experience, the most frequent reason for this (after incorrect plans) was demanding of server memory requests, which were in the queue for execution due to a formal lack of requested resources.

When simultaneously running five or more requests, voracious to memory, and also if there is a limit "in one hand" by default of 25% - the first four are given all the memory. All other requests can only queue up with the RESOURCE_SEMAPHORE indicator, waiting for the memory to be freed.

About half of the available server RAM is used to make requests. In systems where RAM is more than 128 GB, 25% is very significant.

According to my observations, from 20 to 80% of the allocated memory, the request simply does not use - it is perfectly visible if you run the command while the requests are running:

 select * from sys.dm_exec_query_memory_grants

After executing the command, you will see currently processed queries in the database with the expectations of resources, if they occurred during the work.

image alt text

The result of processing the query on one of our test databases.

Request requested and received 9 GB, but used only 10%. Pay attention to the ideal_memory_kb field - it contains unfulfilled dreams of requests.

Effectively deal with irrational memory consumption when executing requests with the help of the Resource Governor . It allows you to set restrictions on CPU usage and memory usage.
I also recommend using the stored procedure sp_WhoIsActive . It collects server status information through dynamic views (DMV). Very simple and powerful tool.

I use the sp_WhoIsActive procedure with this set of parameters:

 sp_whoisactive @not_filter = 'ReportServer', @not_filter_type = 'database', @get_plans=1, @find_block_leaders = 1, @get_task_info = 2, @get_additional_info=1--, @get_locks = 1

Such requests can be identified by simply observing active requests at the time of excessive load. You can also create a task that will run the query select * from sys.dm_exec_query_memory_grants once every 5-10 minutes, saving the results in a table. The resulting values can be analyzed and find problematic requests for the difference between granted_memory_kb and max_used_memory_kb .

Next, you need to identify the connections that take the memory beyond the required, and redirect them using ResourceGovernor to the group where the memory is issued with a margin of 10-20% instead of 80%.

The value for the REQUEST_MAX_MEMORY_GRANT_PERCENT parameter (specified as a percentage of the available memory) can be calculated using the following script:

 SELECT res.name, sem.target_memory_kb /1024 as target_memory_Mb, sem.available_memory_kb / 1024 as available_memory_MB, sem.granted_memory_kb/1024 as granted_memory_Mb, sem.used_memory_kb / 1024 as used_memory_Mb, sem.grantee_count, sem.waiter_count FROM sys.dm_exec_query_resource_semaphores sem join sys.resource_governor_workload_groups res on sem.pool_id = res.pool_id where sem.resource_semaphore_id =0

As a result, you can determine how much memory for requests the server has and each group that you add or configure.

After all the above manipulations, the memory allocation between requests will become more efficient, with a minimum of situations where requests are just waiting for memory allocation and do nothing.

AlwaysON in loaded databases

About the pros and cons of AlwaysON , the environment of high availability for the database, did not write just lazy. However, information about the practical use of this technology for terabyte databases is not so much. When back in 2013 we introduced high-availability groups in Yandex.Money, there was no information even about actual operation in combat environments. Our main database at that time was just about 4 TB, so many things had to be reached on our own.

Typical cluster node configuration.

The high availability cluster consisted of two nodes with the following characteristics: 192 GB of memory, 2 “shelves” with SAS disks per server and a separate shelf for backup with SATA drives, a network of 4 Gigabit interfaces in TEAM.

Over four years, the volume of this database grew to 20 TB, so the servers hardly coped with the new load, and the optimization of the indices caused additional problems, which are discussed below. The performance of the database during the processing of cubes on the network was such that traffic interfered with other components in the network until the cluster was transferred to a separate powerful switch.

In peaks, it reached the point that nodes lost each other and the database on the second node was disconnected (hello to cluster quorum). But with this figured out the addition of new network cards; and in the new machines they simply put 10 GB cards right away.

Excluding the listed architectural problems, the overall impression of the AlwaysON technology is positive:

Traffic between nodes is compressed to 80-90%.
Read-Only (RO) replicas are quickly synchronized and returned to service after a long shutdown.
To clone a database into a test environment, you can use log shipping (delivery of transaction logs) in manual mode.
AlwaysON is very reliable, but Windows Server Failover Clustering (WSFC) sometimes crashes and pulls AlwaysON groups along.

To minimize the number of unreasonable clustering service solutions to extinguish a cluster in a two-node configuration, we developed a solution with a forced voiding of the secondary node:

 Import-Module FailoverClusters $node = "Srv1" (Get-ClusterNode $node).NodeWeight = 0

With such parameters, it becomes calmer when working on secondary nodes.

Out of sync in groups AlwaysON

During operation of large databases with load distribution in groups of AlwaysON, out-of-synchronization of the main and secondary nodes may occur. If a long SELECT operation is performed on the RO node to a table that is updated at the same time on the main node, the change log chain (LSN) will be blocked. Desynchronization will affect all tables and will hold for as long as the query is running - at this time the data in the database on the replica will become irrelevant.

There are usually no obvious signs of desynchronization, but on the Read-Only nodes a new process appears, launched on behalf of sa. Also, cascade locks often occur, which can be quickly detected using the sp_whoisactive procedure described above.

The reason is in a Select running with an isolation level of READ COMMITTED (that is, with a default value).

As a solution, you can use the WITH (NOLOCK) parameter in the query itself or change the isolation level for the database or session on SNAPSHOT to isolate with long queries to the tables.

To find out about out of sync before users, we use a script to monitor. Once every 5 minutes, he checks the synchronization status and sends an e-mail in case of a problem.

An example of a script for monitoring an unknown author from the Internet

 declare @Delay int set @Delay = 2 select * into #tmpag_availability_groups from master.sys.availability_groups select group_id, replica_id,replica_server_name,availability_mode into #tmpdbr_availability_replicas from master.sys.availability_replicas select replica_id,group_database_id,database_name,is_database_joined,is_failover_ready into #tmpdbr_database_replica_cluster_states from master.sys.dm_hadr_database_replica_cluster_states select * into #tmpdbr_database_replica_states from master.sys.dm_hadr_database_replica_states select replica_id,role,is_local into #tmpdbr_availability_replica_states from master.sys.dm_hadr_availability_replica_states select ars.role, drs.database_id, drs.replica_id, drs.last_commit_time into #tmpdbr_database_replica_states_primary_LCT from #tmpdbr_database_replica_states as drs left join #tmpdbr_availability_replica_states ars on drs.replica_id = ars.replica_id where ars.role = 1 SELECT --'<tr><td align="center">' + AR.replica_server_name + '- ' + AG.name + ' ' + dbcs.database_name + '</td>' AS SRV_AG_DB, AR.replica_server_name + '- ' + AG.name + ' ' + dbcs.database_name AS SRV_AG_DB, --CASE dbcs.is_failover_ready WHEN 1 THEN 0 ELSE ISNULL(DATEDIFF(ss, dbr.last_commit_time, dbrp.last_commit_time), 0) END AS [EstimatedDataLoss], --'<td align="center">' + Cast(ISNULL(dbr.last_redone_time, 0) AS varchar(50)) + '</td>' As LastRedoneTime , Cast(ISNULL(dbr.last_redone_time, 0) AS varchar(50)) As LastRedoneTime , --'<td align="center">' + Cast((CASE dbcs.is_failover_ready WHEN 1 THEN 0 ELSE ISNULL(DATEDIFF(ss, dbr.last_commit_time, dbrp.last_commit_time), 0) END)as nvarchar(10)) + '</td>' AS [EstimatedDataLoss2], Cast((CASE dbcs.is_failover_ready WHEN 1 THEN 0 ELSE ISNULL(DATEDIFF(ss, dbr.last_commit_time, dbrp.last_commit_time), 0) END)as nvarchar(10)) AS [EstimatedDataLoss2], --'<td align="center">' + Cast(ROUND(ISNULL(CASE dbr.redo_rate WHEN 0 THEN -1 ELSE CAST(dbr.redo_queue_size AS float) / dbr.redo_rate END, -1),-1) as nvarchar(10)) + '</td> </tr>' AS [EstimatedRecoveryTime] Cast(ROUND(ISNULL(CASE dbr.redo_rate WHEN 0 THEN -1 ELSE CAST(dbr.redo_queue_size AS float) / dbr.redo_rate END, -1),-1) as nvarchar(10)) AS [EstimatedRecoveryTime] INTO #tt FROM #tmpag_availability_groups AS AG INNER JOIN #tmpdbr_availability_replicas AS AR ON AR.group_id=AG.group_id INNER JOIN #tmpdbr_database_replica_cluster_states AS dbcs ON dbcs.replica_id = AR.replica_id LEFT OUTER JOIN #tmpdbr_database_replica_states AS dbr ON dbcs.replica_id = dbr.replica_id AND dbcs.group_database_id = dbr.group_database_id LEFT OUTER JOIN #tmpdbr_database_replica_states_primary_LCT AS dbrp ON dbr.database_id = dbrp.database_id INNER JOIN #tmpdbr_availability_replica_states AS arstates ON arstates.replica_id = AR.replica_id WHERE --(AG.name='Nastro') and ISNULL(CASE dbr.redo_rate WHEN 0 THEN -1 ELSE CAST(dbr.redo_queue_size AS float) / dbr.redo_rate END, -1) > @Delay --(AG.name=@AGN) and ISNULL(CASE dbr.redo_rate WHEN 0 THEN -1 ELSE CAST(dbr.redo_queue_size AS float) / dbr.redo_rate END, -1) > @Delay IF EXISTS (SELECT * from #tt) BEGIN declare @tableHTML nvarchar(max) set @tableHTML =N'<H3><FONT SIZE="3" FACE="Tahoma">AlwaysOn Status </FONT></H3>' set @tableHTML = @tableHTML +N'<table border="1" bgcolor=D7D1F8>' + -- change the background color if you want N'<FONT SIZE="2" FACE="Calibri">' + N'<tr><th align="center">Server Group DB</th>' + N'<th align="center">LastRedoneTime</th>' + N'<th align="center">EstimatedDataLoss</th>' + N'<th align="center">EstimatedRecoveryTime</th>' + N'</tr>' + ISNULL(CAST ( ( SELECT * from #tt FOR XML PATH('tr'), TYPE ) AS NVARCHAR(MAX) ),'') + N'</FONT>' + N'</table>' ; --send email EXEC msdb.dbo.sp_send_dbmail @profile_name = 'mail', -- change here !! @recipients='Admin1@yandex.ru;Admin2@yandex.ru', -- change here !! @subject = 'AlwaysON Report', @body = @tableHTML, @body_format = 'HTML' ; END DROP TABLE #tmpdbr_availability_replicas DROP TABLE #tmpdbr_database_replica_cluster_states DROP TABLE #tmpdbr_database_replica_states DROP TABLE #tmpdbr_database_replica_states_primary_LCT DROP TABLE #tmpdbr_availability_replica_states drop table #tmpag_availability_groups SELECT * from #tt drop table #tt

Another common cause of out of sync is the maintenance of the indices. For example, in the reorganization of the cluster index in the table with a capacity of 500 GB, we observed out of sync for 10 hours.

The easiest way to minimize this time is to use the MAXDOP option when creating or rebuilding indexes. I usually put the value "2" for indexes from 10-20 GB, and for smaller ones - "4".

Routing Read-Only Requests

The nuance with read requests from Read-Only database replicas is that it is not enough to set the parameter ApplicationIntent = ReadOnly in the connection string - you will need to configure the routing as well . Of course, this setting applies to those who, once set up, no longer touch, but His Majesty Chance broke and not such schemes. For example, in the course of work it may be necessary to redirect requests from the RO replica to the main one.

For this, we used to use the settings of the replica itself, simply banning requests to the RO. But the approach had to be changed after several unpleasant situations due to the following features of switching by replica settings:

The switch was tough, current requests to the RO were lost, which is not very cool.
When returning traffic to RO, blocking of all requests with the hadr_sync_commit indicator for each of them often occurred. And now it was not at all fun, because, while they are hanging, the commit does not occur. It was necessary to change the addresses of connections in client applications until the replica normalizes.

Later it turned out that it is more correct to redirect RO requests through a change of routing.

Here is an example for a configuration with a distributed load of 2 nodes:

 ALTER AVAILABILITY GROUP [AGGroupName] MODIFY REPLICA ON N'PrimaryDB01′ WITH (PRIMARY_ROLE (READ_ONLY_ROUTING_LIST=('SecondaryDB01','PrimaryDB01'))); -- ,  RO       SecondaryDB01         ALTER AVAILABILITY GROUP [AGGroupName] MODIFY REPLICA ON N'PrimaryDB01′ WITH (PRIMARY_ROLE (READ_ONLY_ROUTING_LIST=('PrimaryDB01'))); --          PrimaryDB01;

In the process of changing the routing requests are regularly completed, and the following go immediately to the main replica.

Over the 4 years of the system’s operation, there was once a failure on the RO replica. One of the sections with CRC error in the data file fell. The main node continued to work, but the base on the RO replica is not.

Fault handling at the base of the RO node and automatic switching of traffic to the main node can be implemented using the same routing:

create a task with a request to any table in the database;
if the request fails, check the database on the main node in the same way;
If everything is fine on the main node, then we switch traffic to it. On weekends or at night, this will allow you to sleep well and deal with the problem with a fresh mind.

As for restoring the database after such a failure, it is easier to remove a large database from the availability group (on the main node it will continue to work even through the Listener) and restore the backup on the Read-Only node in no-recovery mode. Then it is enough to add the restored database to the accessibility group in the Join Only mode.

Addition:
Slow in the app, fast in SSMS ... - Thanks for the link AlanDenton

Source: https://habr.com/ru/post/326436/

All Articles

Why large databases do not work as you want, or about the unfulfilled dreams of SQL queries

Why minute requests can work an hour

AlwaysON in loaded databases

Out of sync in groups AlwaysON

Routing Read-Only Requests

More articles: