Extreme data recovery from degraded 5th raid

Written on real events.

Any repetition of actions and rash decisions can lead to a complete loss of data. Not for HowTo-Schnick, this material is only to recreate the picture of the presentation of data on disk media.

So let's get started. Input data:

7 disks, 2 primary partitions on each;
1st section 7th and multiple mirroring (RAID1);
2nd partition RAID5, under which LVM is spinning.

Two disks fail overnight due to a jump in electricity and some other problems with iron. Attempts to assemble the discs back failed. the system worked on autopilot on the dead raid for two hours; in addition to everything, the disks came to life then they died again, the core did not work out which disk was in which place at the moment, i.e. what was written on them and how it happened - one can only guess.
In general, we have, completely lost raid. and mdadm is powerless here.

What has already been done, you will not return, you must somehow restore the data. because There are no backup copies as usual. Action plan:

Copy the surviving data to the new disk (s);
Restore the original order of the disks;
To isolate the killed disk (see point 1);
Calculate the raid format (meta), as it turned out the soft-raid, starting with some version, allocates 1 kb at the beginning of the disks for itself, before it stored this data in the disk header .;
Calculate the size of the chunk / stripe, was also changed from 64k to 512k;
put the disks together
Recover LVM and rewrite logical volumes

According to paragraph 1.

Everything is trivial, we buy new disks of a larger volume, make LVM on them, select a separate LV for each disk, and copy the 2nd section with dd. With the data we have to play long enough.

Now for the rest of the items.

The theory is as follows: in order to determine the order of the disks and the size of the chunks, it is necessary to find some log file that will contain the date and time of the event. No sooner said than done. in my case, the log files were in a separate lv-section. To work, we also need a catalog of sufficient size (in my case, I allocated 200 GB for myself). Getting to the study. we take the first 64k from each disk:

for i in abcdefg; do dd if=/dev/jbod/sd${i} bs=64 count=1024 of=/mnt/recover/${i}; done

We get 7 files of 64 KB each. Here we can immediately understand which md metadata format is used. for example, for metadata 1, 1.0, 1.1, 1.2, it looks like this:

mega@megabook ~ $ dd if=/dev/gentoo/a bs=1024 count=64 | hexdump -C
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00001000 fc 4e 2b a9 01 00 00 00 00 00 00 00 00 00 00 00 |.N+.............|
00001010 ee 6f de dc c3 94 9c 58 47 d0 cc 91 9c f7 c5 35 |.o.....XG......5|
00001020 6d 65 67 61 62 6f 6f 6b 3a 30 00 00 00 00 00 00 |megabook:0......|
00001030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00001040 96 e2 6c 4e 00 00 00 00 05 00 00 00 02 00 00 00 |..lN............|
00001050 00 5c 00 00 00 00 00 00 00 04 00 00 07 00 00 00 |.\..............|
00001060 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00001080 00 04 00 00 00 00 00 00 00 5c 00 00 00 00 00 00 |.........\......|
00001090 08 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
000010a0 00 00 00 00 00 00 00 00 64 23 f5 c9 5f 2a 64 68 |........d#.._*dh|
000010b0 e8 92 f2 1a 8c ca ad 98 00 00 00 00 00 00 00 00 |................|
000010c0 9a e2 6c 4e 00 00 00 00 12 00 00 00 00 00 00 00 |..lN............|
000010d0 ff ff ff ff ff ff ff ff f6 51 38 f5 80 01 00 00 |.........Q8.....|
000010e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00001100 00 00 01 00 02 00 03 00 04 00 05 00 fe ff 06 00 |................|
00001110 fe ff fe ff fe ff fe ff fe ff fe ff fe ff fe ff |................|
*
00001400 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
64+0
64+0
65536 (66 kB)00010000
, 0,000822058 c, 79,7 MB/c

mega@megabook ~ $ mdadm -D /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Sun Sep 11 20:32:22 2011
Raid Level : raid5
Array Size : 70656 (69.01 MiB 72.35 MB)
Used Dev Size : 11776 (11.50 MiB 12.06 MB)
Raid Devices : 7
Total Devices : 7
Persistence : Superblock is persistent

Update Time : Sun Sep 11 20:32:26 2011
State : clean
Active Devices : 7
Working Devices : 7
Failed Devices : 0
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 512K

Name : megabook:0 (local to host megabook)
UUID : ee6fdedc:c3949c58:47d0cc91:9cf7c535
Events : 18

Number Major Minor RaidDevice State
0 253 21 0 active sync /dev/dm-21
1 253 22 1 active sync /dev/dm-22
2 253 23 2 active sync /dev/dm-23
3 253 24 3 active sync /dev/dm-24
4 253 25 4 active sync /dev/dm-25
5 253 26 5 active sync /dev/dm-26
7 253 27 6 active sync /dev/dm-27

This is an example of metadata 1.2. A distinctive feature is that the blocks will be about the same content (except for the disk number), the rest of the place will be filled with NULLs. Information about the raid is located at: 0x00001000-0x00001ffff.
For earlier versions of metadata, information about the raid was recorded in the disk layout area and data started on devices immediately. This looks something like this for metadata 0.9:

~ # dd if=/dev/jbod/sdb bs=1024 count=1 | hexdump -C
1+0
1+0
1024 (1,0 kB), 0,0200084 c, 51,2 kB/c
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000200 4c 41 42 45 4c 4f 4e 45 01 00 00 00 00 00 00 00 |LABELONE........|
00000210 1b 72 36 1f 20 00 00 00 4c 56 4d 32 20 30 30 31 |.r6. ...LVM2 001|
00000220 66 6d 59 33 4a 35 6b 72 46 73 6d 52 51 41 47 66 |fmY3J5krFsmRQAGf|
00000230 4c 30 72 53 6b 69 59 6e 31 43 6c 72 66 61 66 70 |L0rSkiYn1Clrfafp|
00000240 00 00 fa ff ed 02 00 00 00 00 06 00 00 00 00 00 |................|
00000250 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000260 00 00 00 00 00 00 00 00 00 10 00 00 00 00 00 00 |................|
00000270 00 f0 05 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000280 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000400

')
A curious feature emerged after further study of the first 64 kb from the disks. As it turned out, LVM stores information on LV markup in clear text. It looks like this:

vg0 {
id = "xxxx"
seqno = 64
status = ["RESIZEABLE", "READ", "WRITE"]
extent_size = 8192 # 4 Megabytes
max_lv = 0
max_pv = 0

physical_volumes {

pv0 {
id = "xxxxx"
device = "/dev/md2" # Hint only

status = ["ALLOCATABLE"]
dev_size = 5848847616 # 2,72358 Terabytes
pe_start = 384
pe_count = 713970 # 2,72358 Terabytes
}
}

logical_volumes {
--//--
log {
id = "l8OVMc-BUAj-YrIT-w8mh-YkvH-riS3-p1h6OY"
status = ["READ", "WRITE", "VISIBLE"]
segment_count = 2

segment1 {
start_extent = 0
extent_count = 12800 # 50 Gigabytes

type = "striped"
stripe_count = 1 # linear

stripes = [
"pv0", 410817
]
}
segment2 {
start_extent = 12800
extent_count = 5120 # 20 Gigabytes

type = "striped"
stripe_count = 1 # linear

stripes = [
"pv0", 205456
]
}
}
--//--
}

Super! What we need from this. The main figures to be guided by are:

extent_size = 8192 # 4 Megabytes
pe_start = 384
stripes = [pv0, 410817]
extent_count = 5120

What are these numbers ...

extent_size is something like a cluster size. the minimum unit into which the entire VG space is divided. Why 4 megabytes and in which such parrots are measured? I also asked this question when I saw it, but as it turned out, everything is simple - the cluster size is 512 bytes * 8192 = 4 MB
pe_start - where the LVM header ends and where the data starts from.
extent_count - the number of blocks allocated for the section.
stripes = [pv0, xxxx] - from which place and on which PV the partition is located.

We do not forget that we have the 5th raid on 7i disks, we divide all the numbers by 6. (since the 7th is parity)
try to find at least some log. With the hands / eyes to look through - utopia, we write the script

dd if=/dev/recover/sda bs=512 skip=$[(8192*410817+384)/6] | hexdump -C | grep 'Aug 28' | head

Where:

skip=(extent_size*stripes+pe_start)/6
bs=512 --

after a short rustling of the disks, we get something like this at the output:

02d08100 41 75 67 20 32 38 20 30 30 3a 30 36 3a 30 36 20 |Aug 28 00:06:06 |
02d081c0 3d 0a 41 75 67 20 32 38 20 30 30 3a 30 36 3a 30 |=.Aug 28 00:06:0|
02d08410 41 75 67 20 32 38 20 30 30 3a 30 36 3a 30 37 20 |Aug 28 00:06:07 |
02d08570 70 0a 41 75 67 20 32 38 20 30 30 3a 30 36 3a 30 |p.Aug 28 00:06:0|
02d085d0 65 78 74 3d 0a 41 75 67 20 32 38 20 30 30 3a 30 |ext=.Aug 28 00:0|
02d086b0 64 3d 2a 29 29 22 0a 41 75 67 20 32 38 20 30 30 |d=*))".Aug 28 00|
02d08710 65 73 74 61 6d 70 0a 41 75 67 20 32 38 20 30 30 |estamp.Aug 28 00|
02d08770 73 3d 30 20 74 65 78 74 3d 0a 41 75 67 20 32 38 |s=0 text=.Aug 28|
02d089c0 6f 72 64 3d 2a 29 29 22 0a 41 75 67 20 32 38 20 |ord=*))".Aug 28 |
02d08a20 6d 65 73 74 61 6d 70 0a 41 75 67 20 32 38 20 30 |mestamp.Aug 28 0|

take the address: 02d08770, divide by 512 with the remainder, we get:

mega@megabook ~ $ echo $[0x02d08770/512]
92227
mega@megabook ~ $ echo $[0x02d08770/512/2048]
45

we throw in a couple of megabytes and see what happens there:

for i in abcdefg; do dd if=/dev/recover/sd${i} of=/mnt/recover/${i} bs=512 count=1024 skip=$[(8192*410817+384)/6+(48*2048)] ; done

We get 7 files of 512kb. Open it with a text editor and watch the dates, which of the files starts with checksums (parity). We arrange the order of the disks, as well as look at the size of the chunk. If checksums begin through 64kb, then 64kb, if not, then most likely 512 or larger. Repeat the action with an offset of 1 block:

for i in abcdefg; do dd if=/dev/recover/sd${i} of=/mnt/recover/${i}.1 bs=512 count=1024 skip=$[(8192*410817+384)/6+(48*2048)+1] ; done

We build the table on paper. which of the discs is the first, where is the parity block. It would not be amiss to say that there are 4 orders of sequence for 5x RAIDs: left asynchronous, left synchronous, right asynchronous, and right synchronous. Details are described here: www.accs.com/p_and_p/RAID/LinuxRAID.html .

As for the rejection of a broken disk, the task is creative and the above material should be enough.

Now, knowing the size of the chunks, the type of meta-data table and the order of the sections, you can try to play with mdadm to recreate the raid. The strategic trick is that mdadm can create degraded raids if you write the word missing instead of a real disk. Using the obtained knowledge we try to create a raid. Be sure to specify the type of metadata! And in no case do not repeat the following command without a thorough study of the above material! . To verify the correctness of the array assembly, I used a virtual machine partition of 2 GB.

lvcreate -L2G -nnagios recover
mdadm -C /dev/md0 -l 5 -n 7 --metadata 0.9 -c 64 /dev/recover/[af] missing
dd if=/dev/md0 bs=512 skip=$[8192*XXXX+384] count=$[8192*512] | dd of=/dev/recover/nagios
cfdisk /dev/recover/nagios

If cfdisk says that the disk partitioning table is not correct, destroy the array and repeat the creation of the array with the offset of the disks ... first throw to the end. those.

mdadm -S /dev/md0
mdadm -C /dev/md0 -l 5 -n 7 --metadata 0.9 -c 64 /dev/recover/[bf] missing /dev/recover/a

Copy the section, look at the content, and so on until we find the desired sequence. You can of course sit with a calculator and on the basis of the data that we calculated when viewing the section of logs, as well as addresses, calculate the exact order.
Please note that instead of the seventh sdg disk I wrote missing, and the order of the disks should not be ag, but what you got in the previous step. Let me remind you that missing will force the raid to abandon the recalculation of parity-blocks, because will consider itself degraded and work in emergency mode.

After you find which one of the sections you have first, select a small LV section and copy it in the same way as I copied nagios. try to mount. In case you don’t get the wrong drive (missing), it’s likely that the dmesg will be informed that there are problems with the disk log (since one of the disks does its bit in the form of data curves). repeat the breakpoint, create and copy discs with the replacement of the missing position. those. add, for example, add the sdg disk, which I had missing in place, and instead of the sdf disk, we write missing:

mdadm -C /dev/md0 -l 5 -n 7 --metadata 0.9 -c 64 /dev/recover/[be] missing /dev/recover/sdg /dev/recover/a

And so on until the data obtained will not be as relevant as possible.
On the sim, I think, to finish the story.
I will only add a few words on the restoration of LVM (so as not to suffer with these addresses).
everything is simple. If you achieve relevance - it can activate itself, if not, you get backups, usually in / etc / lvm / backup / VG-NAME. I, considering that my root partition was on the same LVM, made this directory in / boot / lvm and made symlinks on it. and then everything is simple:

vgcfgrestore -f /path/to/backup-file vg-name

If this does not help, for example, it starts to swear at the checksums, you can hack this case a little:

dd if=/dev/zero of=/dev/md0 bs=512 count=10
pvcreate /dev/md0
pvdisplay /dev/md0 | grep 'PV UUID'

then we edit the backup file, change the UUID there, create a VG with the same name as it was before on this section

vgcreate vg0 /dev/md0
vgcfgrestore -f /path/to/backup-file vg0

after these scams, everything should be restored and VG will be relevant at the time of the backup file creation.

In general, on a sim, I consider the material sufficiently stated, I’ll make a reservation that these are all extreme measures and at least you should have a full copy of the disks, i.e. work to lead not on the original disks, but on their cast.

Extreme data recovery from degraded 5th raid

Written on real events.

According to paragraph 1.

Now for the rest of the items.

See also:

More articles: