Help, my database is corrupt. Now what?

A damaged database is probably one of the worst nightmares of most database administrators. The result of damage are downtime, yelling managers and all sorts of other unpleasant things.
In this article I will explain what to do with a damaged database and describe some of the things that need to be done, some types of damage and how they can be fixed.

How to detect that the database is corrupt

Usually, damage is perfectly detected when attempting to access a damaged page. Requests, backups, or re-indexing procedures end with errors with high severity levels.
Here are a couple of examples of system messages for detecting database corruption:

SQL Server detected a logical consistency-based error / error: incorrect checksum (expected: 0xfdff74c9; actual: 0xfdff74cb). It took place during a read of page (1: 69965) in database ID 13 at offset 0x0000002229a000 in file 'D: \ Develop \ Databases \ Broken1.mdf'.

Attempt to fetch logical page 1: 69965 in database 13 failed. It belongs to the allocation unit 72057594049069056 not to 281474980642816.

The main problem is that if database integrity checks are not performed on an ongoing basis, the damage can be detected after hours, days, and even months, after it has been formed, at that moment when it will be difficult to fix something. .
I will not describe the situation when the database went into the "suspect" state ( "suspicious" in the Russian edition of SQL Server - approx. Translator ). A description of all sorts of reasons why the database can go into “suspect” and the many options for correcting this are the topic of a separate article if not a book.

What to do if the database is still damaged

Do not panic
Do not detach it.
Do not restart SQL Server
Do not start recovery immediately
Run integrity check
Find the cause

Do not panic

The most important thing when detecting a database damage is not to panic. Any decisions must be carefully weighed, all possible factors must be taken into account. Damn just worsen the situation by not making a deliberate decision.
')

Do not detach database

In most cases, when SQL Server detects database corruption, this means that there are actually damaged pages in the database. Attempting to convince SQL Server that this is not the case, by detaching (detach) and reattaching (attaching) the database, backing up and restoring, restarting the SQL Server service, or restarting the server, will not cause the error to disappear.
If the database is corrupted and SQL Server detects this when attaching, it will not be able to attach it. There are several ways to make him see this database, but it is much better not to disconnect it.

Do not restart SQL Server

In the same way as during a detach-attach, restarting the SQL Server service will not be able to correct the detected errors (if any).
Restarting the service can make the situation worse. If SQL Server detects errors during the recovery phase of the database after restarting, it will mark it as “suspect”, which will greatly complicate the database recovery process.

Do not start recovery immediately

You may be tempted to simply run DBCC CHECKDB with one of the “recovery” parameters (usually allowing data loss) and hope that everything will get better ( in my experience, the first thing recommended in the non-core SQL Server forums is to run DBCC CHECKDB REPAIR_ALLOW_DATA_LOSS - a comment of the translator ). In many cases, running such a recovery is not recommended. It does not guarantee the correction of all errors and may lead to unacceptable data loss.
This recovery is the last step in correcting errors. It should be launched only if you have no other choice, but not in the first place.

Run integrity check

In order to decide how to fix the database, we definitely need to know what is damaged. The only way we can figure this out is to run DBCC CHECKDB with the All_ErrorMsgs parameter (in SQL Server 2005 SP3, SQL Server 2008 SP1 and in older versions, this parameter is enabled by default, it is not necessary to specify it). Remember that if you run DBCC CHECKDB without the No_InfoMsgs parameter, the output of this procedure will contain information on the number of rows and pages in each table, which is unlikely to interest you in analyzing errors.
DBCC CHECKDB may take a long time to run on large databases, but you must wait until this procedure completes. A competent recovery strategy can be built only if there is information about all the problems in the database.

Find the cause

After the errors are corrected, the work cannot be considered finished. If the cause of these errors is not established, they may occur again. Usually, the main cause of errors are problems with the I / O subsystem, but they can also be caused by improper operation of “low-level software” (like antivirus), human actions, or bugs of SQL Server itself.

What's next

Further corrective actions depend entirely on the results of CheckDB. A little further, I will show a few of the most frequently occurring errors (note that this article does not claim to be the title of a complete description of all kinds of errors).
The described errors range in order of increasing severity level - from the least serious to the most serious. In general, for the most serious errors found by CheckDB, there is a description of the available methods for resolving them.
If you suddenly find an error not described in the article, pay attention to the last section - “Search for help”.

Invalid information about free space on the page

Msg 2508, Level 16, State 3, Line 1
RSVD page count for object “Broken1”, index ID 0, partition ID 76911687695381, allocate unit ID 769116876953838 (type In-row data) is incorrect. Run DBCC UPDATEUSAGE.

In SQL Server 2000, the number of rows and pages in a table or index stored in metadata might not be true (and even negative) and DBCC CHECKDB did not see anything wrong with that. In SQL Server 2005, this number should be correct and CheckDB will issue a warning if it finds a discrepancy.
This is not a serious problem and is very easily resolved. As they say in the message, you just need to run DBCC UPDATEUSAGE in the context of the desired database and the warning will disappear. This error is often found in databases updated with SQL Server 2000 and should not appear in databases created in SQL Server 2005/2008.

Msg 8914, Level 16, State 1, Line 1
Incorrect PFS free space information page (1: 26839) in object ID 181575685, index ID 1, partition ID 293374720802816, alloc unit ID 769116876953838 (type LOB data). Expected value 0_PCT_FULL, actual value 100_PCT_FULL.

This error appears when the PFS page (Page Free Space), which takes into account how full the pages in the database are, contains incorrect values. This error, as mentioned earlier, is not serious. The algorithm that determined how full the pages were was was not always correctly worked out in SQL Server 2000. To solve this problem, you need to run DBCC CHECKDB with the REPAIR_ALLOW_DATA_LOSS parameter and, if this is the only error in the database, no data will actually be affected.

Damage to nonclustered indexes only

If all the errors found by CheckDB are related to indices with ID = 2 and more, this means that only nonclustered indexes were damaged. Since the information contained in nonclustered indexes is “redundant” ( the same data is stored on the heap, or in a clustered index - approx. Translator ), this damage can be corrected without losing any data.
If all errors found by CheckDB refer to nonclustered indexes, the recommended “recovery level” for DBCC CHECKDB is REPAIR_REBUILD. Examples of such errors (in fact, errors of this type are much larger):

Msg 8941, Level 16, State 1, Line 1
Table error: Object ID 181575685, index ID 4, page (3: 224866). Test (sorted [i] .offset> = PAGEHEADSIZE) failed. Slot 159, offset 0x1 is invalid.

Msg 8942, Level 16, State 1, Line 1
Table error: Object ID 181575685, index ID 4, page (3: 224866). Test (sorted [i] .offset> = max) failed. Slot 0, offset 0x9f overlaps with the prior row.

In this case, the damage can be completely corrected by deleting the damaged nonclustered indexes and re-creating them. Rebuilding the index (ALTER INDEX REBUILD) in on-line mode (and sometimes off-line) reads the pages of the old index to create a new one and, therefore, ends with an error. Therefore, it is necessary to remove old indexes and recreate them.
This is exactly what DBCC CHECKDB will do with the REPAIR_REBUILD parameter, but the database must be in single user mode. That is why it is usually better to manually perform these operations so that you can continue to work with the database until the indices are re-created.
If you do not have enough time to re-create the necessary indexes and there is a “clean” (without errors) full backup and backups of the transaction log with an unbroken chain of logs, you can restore the damaged pages from them.

Damage to LOB pages

Msg 8964, Level 16, State 1, Line 1
Table error: Object ID 181575685, index ID 1, partition ID 72057594145669120, alloc unit ID 72057594087800832 (type LOB data). The off-row data node at page (1: 2444050), slot 0, text ID 901891555328 is not referenced.

The error indicates that there are LOB pages (Large OBjects) that are not referenced by any data page. This can happen if a cluster index (or a heap) was previously damaged and its damaged pages were deleted.
If CheckDB says only about such errors, then you can run DBCC CHECKDB with the parameter REPAIR_ALLOW_DATA_LOSS - these pages will be destroyed. Since you still do not have pages with data that link to these pages, there will be no more data loss.

Errors due to out of range

Msg 2570, Sev 16, State 3, Line 17
Page (1: 1103587), slot 24 in object ID 34, index ID 1, partition ID 281474978938880, alloc unit ID 281474978938880 (type "In-row data"). Column “modified” for date data type. Update column to a legal value.

These errors indicate that the column has values outside the acceptable range. This may be a datetime value, suggesting that more than 1440 minutes have passed since midnight, a Unicode string in which the number of bytes is not divisible by 2, or float / real with an incorrect precision value.
Checking for these errors is not performed by default, for databases upgraded from version of SQL Server 2000 or earlier, if the DBCC CHECKDB command with the DATA_PURITY parameter turned on has never been executed before.
CheckDB will not be able to correct these errors, since it is unknown what values to put in place of the wrong ones. Correction of such errors does not require much effort, but is done manually. Wrong values should be replaced by something acceptable. The main problem is finding invalid values. This knowledge base article has step-by-step instructions.

Clustered index or heap corruption

If the clustered index heap or leaf pages of the index are found to be damaged, this means that the data on them is lost. Leaf level pages of a clustered index contain directly data pages and for them redundancy is not provided.
If CheckDB reports damage to the cluster-level leaf level pages, the necessary “recovery level” for DBCC CHECKDB is REPAIR_ALLOW_DATA_LOSS.
Examples of such errors:

Server: Msg 8976, Level 16, State 1, Line 2
Table error: Object ID 181575685, index ID 1, partition ID 76911687695381, alloc unit ID 76911687695381 (type In-row data). Although its parent (1: 479) and previous (1: 715544) refer to it.

Server: Msg 8939, Level 16, State 1, Line 2
Table error: Object ID 181575685, index ID 0, page (1: 168576). Test (m_freeData> = PAGEHEADSIZE && m_freeData <= (UINT) PAGESIZE - m_slotCnt * sizeof (Slot)) failed. Values are 44 and 8028.

It should be remembered that if the errors returned by CheckDB refer to index id = 0 or 1, this means that the data is directly damaged.
This type of error is fixed, but the fix is to destroy lines or whole pages. When CheckDB deletes the error correction data, the constraints imposed by the foreign keys are not checked and no triggers work. Lines or pages are simply deleted. As a result, the data may be inconsistent, or logical integrity may be violated (not a single line can refer to LOB pages anymore, or nonclustered index lines may indicate “nowhere”). Due to these effects, such a restoration is not recommended.
If you have a “clean” backup, recovering from it is usually preferable to correct such errors. If the database is in the full recovery model and you have transaction log backups with an unbroken chain of logs (starting with the last “clean” full backup), you can backup the active part of the log and restore the entire database (or only damaged pages), resulting in no data being lost at all.
If there is no backup with intact data, you have only one option left - run DBCC CHECKDB with the REPAIR_ALLOW_DATA_LOSS option. This will require transferring the database to single user mode for the entire duration of this procedure.
And although you have no way to avoid data loss, you can see which data will be deleted from the clustered index. For this, see this post by Paul Ranadala.

Metadata corruption

Msg 3853, Level 16, State 1, Line 1
Attribute (object_id = 181575685) of row (object_id = 181575685, column_id = 1) in sys.columns does not have a matching row (object_id = 181575685) in sys.objects.

Such errors usually occur in databases upgraded from SQL Server 2000, when someone poked around directly in the system tables.
In system tables of any version of SQL Server, foreign keys are not used, so in SQL Server 2000 it was possible to remove a row from sysobjects (for example, a table) and leave rows in the syscolumns and sysindexes tables that refer to the deleted row.
In SQL Server 2000, CheckDB did not check the integrity of the system catalog and such problems often hung unnoticed. In SQL Server 2005, CheckDB checks the integrity of the system directory and such errors can occur.
Correcting these errors is not the easiest thing. CheckDB cannot fix them, since the only thing that can be done is to delete the records from the system tables, which, in turn, can cause the loss of a large amount of data. If you have a backup of this database made before the upgrade to SQL Server 2005 and the update was quite recent, you can deploy it to SQL Server 2000, manually tweak the system tables on it and transfer the database to SQL Server 2005 again.
If you do not have a database backup on SQL Server 2000 or the update is too long and data loss is unacceptable, there are two ways. The first is to edit the system tables in SQL Server 2005, but note that this is a rather complicated and risky process, since the system tables are not documented and are much more complex than in earlier versions. In this post you can find additional information.
The second way is to script all the database objects and export all the data, then create a new database, restore the objects and fill the data. This option is preferred.

Uncorrectable damage

CheckDB cannot fix everything. Any errors like the ones listed below are incorrigible and the only option is to restore the database from the backup, which does not have these damages. If you have a full backup and the chain of logs is not broken up to the current time, you can save the final fragment of the transaction log and the database can be restored without losing any data.
If there are no such backups, the only thing you can do is to script those objects and unload the data that is still available. It is likely that due to damage, not all data will be available, and, most likely, not all objects will be scripted without errors.

System table corruption

Msg 7985, Level 16, State 2, Line 1
System table pre-checks: Object ID 4. Could not read and latch page (1: 358) with latch type SH.
Check statement terminated due to unrepairable error.

Msg 8921, Level 16, State 1, Line 1
Check terminated. Failure was detected while collecting facts. Possibly tempdb out of space or a system table is inconsistent.

CheckDB depends on several critical system tables in order to get an idea of what should be in the database. If these tables themselves are damaged, then CheckDB cannot even guess what should be in the database and what to compare with the current state of affairs, let alone to fix something.

Damage to distribution maps

Msg 8946, Level 16, State 12, Line 1
Table error: Allocation page (1: 2264640) has invalid PFS_PAGE page header values. Type is 0. Check type, allocate the unit ID and page ID on the page.

Msg 8998, Level 16, State 2, Line 1
Pages Preventing IDF Pages from (1: 2264640) to (1: 2272727)

In this case, one or several pages determining the placement of data in the database ( distribution maps - approx. Translator ) are damaged. These pages are used to determine which pages and extents in the database are used and which are free. CheckDB cannot fix such errors, since it is almost impossible to determine (without these pages) which extents are used to host data, and which are not. Simply deleting such a “distribution map” is impossible, since deleting any one of them would entail deleting 4 GB of data.

Help search

If you are not sure what you need to do - ask for help. If suddenly you receive a message about database corruption, which is not clear to you and which is not described above - ask for help. If you are not sure that you have chosen the best recovery method - ask for help.
If you have a Senior DBA, refer to it. If you have a "mentor" - ask him. Ask for advice on the forums, but remember that not all the advice received on the forums is helpful. In fact, it is there that from time to time absolutely wrong and even dangerous decisions are published.
Contact Microsoft Support Finally. It will not be free, but they really know what can be done with a damaged database and it is likely that if your database is critical for the enterprise, then the cost of downtime during the independent search for a solution will be much higher than the cost of contacting support.

Conclusion

In this article, I gave a few examples of what can be done when a damaged database is detected and, more importantly, what should not be done. I hope that now you better understand what methods you can use to solve the problems described and how important it is to have good backups ( and choose the correct recovery model - note of the translator ).

Note: this is my first translation, which, moreover, was done not at once, but in several approaches, in the evenings when free time appeared, so the text as a whole may perhaps seem somewhat inconsistent to someone. If somewhere I was overly tongue-tied and some part of the text suddenly turns out to be difficult to understand, I will be happy to hear all the comments.
Regards, unfilled.
PS When I was about to click on the “Publish” button, I received an email from SQL Server Central with a comic like this.

Source: https://habr.com/ru/post/136979/

All Articles