Repair remote server after removing part of OS

Recently, there were a couple of interesting articles on Habré (for example, one , two, and in a number of colleagues ’blogs), after reading which there was a thought, not to share with the community the experience of getting out of interesting situations encountered while working in outsourcing.

Today, about restoring access to a virtual machine from linux after removing a part of the operating system.

If the content is interesting to the audience, then I will try to write a number of similar articles.
')

Formulation of the problem

In early 2012, it was necessary to do a v2v migration of several linux virtual machines between different data centers. I don’t remember for what reason, but chose a method through rsync of all partitions (of course, then witchcraft with kernel settings / modules / initrd / grub and so on. Standard procedures).

Since the servers were “voluminous”, these are important, replication is not possible, the channels are narrow, strict downtime requirements, deadlines yesterday, etc. etc., then painted a detailed work plan in several stages of synchronization with subsequent desynchronization of changes. The important point was the lack of access to the virtual machine console (target, where they migrated from) because it was in a private cloud. The strict conditions were dictated by the Customer (the classic desire to do everything quickly, cheaply and efficiently), so we secured ourselves as much as possible: warned of the risks, made backups ourselves and asked to make butt backup, service users were notified of the work, the whole operation process was worked out to the smallest details, “Windows” at night are specified - nothing foreshadows trouble.

Beginning of work

Further, as per the law of Murphy: if something can happen, it will happen.

The human factor intervened: at the stage of synchronization, the administrator confused the rsync arguments and accidentally swapped the source and destination addresses. We were saved from data loss by the fact that synchronization was from the / usr mount point, and the administrator managed to notice and quickly interrupted the execution of the command. As a result, the directories / usr / {bin, lib *, sbin} were hit - some of the files were deleted. The main problem was that, due to the lack of many libraries, it was impossible to either open a new session, run rsync, or restore everything through the package manager, in general, only that + basic utilities without special dependencies remained. The evening stopped being languid ...

Then it started: brainstorming, hypothesis testing and way out was organized. Everyone was "fun." At the same time, the option of working from scratch and launching what was left was worked out, but this would mean a failure to meet deadlines and the loss of altered data.

Recovery description

Immediately make a reservation, this algorithm is not a panacea and is not necessarily the optimal solution, but it has a big plus - it quickly and successfully worked.

The idea is as follows: find a donor, copy the missing data to start rsync, run rsync, restore all the libraries / commands, go through the package integrity / diff th system from backup, then remove the necessary data.

Immediately, scans of the “broken” server were raised to work, a working server with a similar OS was found, which they decided to use as a donor for copying libraries and programs from it.

The issue with the transport was open - nothing works on the "bit" server. Is that telnet. Then they remembered that through it you can easily send GET requests. If you can, then you need it! Since everything in * nix is stored as files, then you can take, convert the archive with the necessary data into a unicode text file and transfer it as plain text (it’s better not to find a working method to transfer data relatively consistent through telnet).

On the problem server there was an inverse task, to accept data, convert it into archive, unpack, bring rsync to working condition, then everything is like clockwork.

On the production server, we performed a series of actions (conditionally):

We collect in archive that is required:
```
tar zcf /tmp/usr.tgz /usr... 
```

Convert from binary to text format:

 cat /tmp/usr.tgz | busybox uuencode -m - > usr.txt

We move to the web server, where there is access from the broken server:
```
 mv usr.txt /var/www/html 
```

On a broken server (conditionally):

Trying to download the file:

 telnet workserver 80 > usr.txt GET /usr.txt

We look at how many lines are occupied by the service headers head -30 usr.txt (you can immediately cut off base64 via sed, but for safety purposes, we decided to look with our hands):

  Trying workserver... Connected to workserver. Escape character is '^]'. HTTP/1.1 200 OK Date: Fri, 17 Jan 2013 13:13:50 GMT Server: Apache Last-Modified: Fri, 17 Jan 2013 13:11:32 GMT Accept-Ranges: bytes Content-Length: 262635116 Connection: close Content-Type: text/plain; charset=utf-8 begin-base64 644 - bmV0LnNoLm1lAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Delete the first http lines with the command (delete lines 1 through 12, inclusive, before the begin-base64 line):
```
 sed -e '1,12d' usr.txt 
```

Convert back from text format to binary:

 busybox uudecode -o /root/usr.tgz usr.txt

Unpacking /root/usr.tgz and manually transfer files that are deleted;
Check whether rsync has earned, if not - see what is missing, if yes - start rsync, synchronize service folders with a working copy;

We are looking for what has changed (conditionally):

 rpm -Va | tr -s ' ' | sed -e 's/\ d\ /\ /g' | sed -e 's/\ c\ /\ /g' | cut -f2 -d' ' > va2.xt rpm -qf $(cat va2.xt) | sort -n | uniq > reinstall.pkg yum -y reinstall $(cat reinstall.pkg)

Depending on the affected packages, a reboot may be required (after removing backup changes, working out options for rolling back to another server, and 3-fold checks that everything is loaded, preferably after reinstalling the kernel).

Results

We were able to keep within the allotted time. Business problems are not felt. Nevertheless, we carried out a thorough analysis of the situation and in the future such errors did not recur.
On the technical side, the following points were noted:

Now, in all situations where the original data is stored, the file systems are remounted before read only into read only mode;
Works are performed in screen;
All administrators conducting the work, copy only from a text editor (vi, notepad, etc) to exclude any errors (in this case, the administrator used the history to enter commands and slightly modified). Regularly no departures from the plan. Before that, everything is checked on test benches, checked by experienced administrators;
On critical operations, a second, more experienced administrator is always on duty to analyze what is happening and, if necessary, intercept management;
A number of packages / software have been compiled that are installed on the involved machines (after performing work on some, they are removed for safety reasons) and are used in case of non-standard situations;
Now this task has been added to the list of training tasks when teaching employees how to restore * nix-like systems, i.e. in addition to RHCE certification, each administrator must be able to solve it + it is interesting to give it to the linux-admins at the interview :)

Source: https://habr.com/ru/post/325746/

All Articles

Repair remote server after removing part of OS

Formulation of the problem

Beginning of work

Recovery description

Results

More articles: