Introductory: we had a server leased to hosting.ua on which a dozen or so sites of our clients were spinning (total attendance of tens of thousands), incl. shops with decent turnover, hosted bug tracker, gira, CSN.
They say that all people are divided into 2 types: 1-make backups, 2- already do. We were somewhere in the middle.
For the history of our work, we already had hard drives crashing, motherboards were burning, and the file system was flying. Therefore, the daily backup system was set up on the hard drive on the same server, and, occasionally, a copy was made to another server in the same data center. With such a system, the loss of a combat hard drive is a nuisance, but the phenomenon is rather rare, and what would be gone and backups on two sevres simultaneously, something incredible should happen (as I then thought).
In all articles, backup recommendations are written, backups need to be stored physically in different places. "In case of fire" actions of the authorities or other natural disasters, sounds, you see, funny ...
Saturday. Evening.
At 21.30 I received the first message from the server availability monitoring system, tried to find out what had happened - it turned out that the data center completely lies. Neither the host site nor my server did not respond. I decided that these were problems with the channel (which had already happened more than once) and quietly went to Russia on a “weekend” (well, what could happen, a fire or something?).
Sunday. Morning.
Arriving at the first transit point at midnight, I noticed that the monitoring system did not stop, and I just turned off SMS alerts. Waking up in the morning in a village near the border, I was unpleasantly surprised by the amount of SMS that the server is still not available. I dialed to the administrator, despite the earlier morning, and asked to find out what is there.
After 10 minutes, I received a message containing one word of six letters "denoting a complete collapse of all hopes." I immediately called back and after the conversation it became clear that this word very accurately describes what happened. According to rumors (!), A fire occurred in the data center, the automatic fire extinguishing system did not work, and what survived after the fire was plentifully watered by water by firefighters ... There is no official information, the support does not respond.
')
Sunday. Day.
Then everything is like in war
1. Ordered a new server with “instantaneous” activation, it’s not particularly important which one, it’s important quickly
2. Notify key customers about what happened, our plan of action and possible consequences
3. Called to the office "special. naz "administrator and key developers
4. We started attempts to contact the hoster to find out what happened from official sources and what condition our server is in.
5. On the new server, DNS was raised and all domains under its control were transferred to it.
6. Raised records of the mail of your domain, so as not to lose the correspondence
7. In response, all requests began to give a page with 50
2 3 error and explanatory text.
8. Sat down at writing the Yandex cache grabber to save the indexed content in case of loss of all the information. And to prevent the inaccessibility of the site for search engines for the period of restoration work.
Thanks to the coordinated work of the office - I did not have to return urgently, although I had to hang quite a lot on the phone.
Monday.
By mid-Monday, we had plundered key sites, and finally managed to contact the hoster, who told us that our desk (C) had not suffered much from the fire and there is a chance that the data had survived. When they gave it to me, I exhaled for the first time.
Tuesday.
On Tuesday, I was already in the office, and in the morning we began to raise the “plywood version” of the main site. By lunchtime, visitors had already seen the content in a decent design and could follow the links, while trying to order something, they received a message about the accident and a request to call directly to the store.
In parallel with this, attempts continued to gain access to the information that lay in the data center. The problem was complicated by the distance (we are in Minsk, the DC in Odessa), and the fact that the hosting billing was destroyed, and, strictly speaking, they did not know where whose server was (despite the fact that 2.5 days had passed since the accident). We were saved by the fact that on Monday we were able to agree in hot pursuit that we would be given our hard drive in exchange for bail and a statement. There was no time to lose, and the search began for a person who could solve these problems on the spot. As it turned out, we had several options, but all the options did not answer hard. The owner of the largest store has already booked tickets for his assistant so that he flew to Odessa, I just asked everyone who saw: “do you have reliable linuxoids in Odessa?”. In a strange way, a person was found (let's call him Admin), who undertook to help us.
The first entry into DC was unsuccessful, they said that they would not give any hard drives to anyone. After that we had to call up again and remind about promises. From the second arrival (in the evening) the winchester was taken away. In the form that it was received, it was impossible to turn it on, so he immediately went to the workshop for urgent repairs.
Wednesday.
After 15 hours and $ 250, he returned to the admin, who, after restoring the file system, put the data to our server.
At that moment, everyone exhaled. At night, the flagship site was already working and in the evening, finally, the first official message from hosting.ua appeared on their website (that a fire had occurred, and information would be published here). By the end of Thursday, most of the sites were raised, and gradually we are finishing the restoration work on the rest.
What have we learned?
1. Openness helps a lot. The fact that we were able to relatively quickly report on the state of emergency and our actions saved a lot of nerves to us and our clients. If we were silent like hosting.ua, we would have lost almost all of our clients.
2. To keep backups in physically different places is very important. The second time we will not make such a mistake, backups will be on different continents (in case of war). Insurance costs about $ 40 / month. Losses from a 3-day downtime cost more than 2 years of “insurance.”
3. You need to have plans in case of emergency. What would everyone know in advance who does what. In our case, I was lucky that I was in touch, and I had a laptop with all the passwords for managing domains. If we were not “lucky” in this part - the consequences would have been much more dramatic.
4. Grabber is good =)
I hope this experience will help someone learn from our mistakes and go into the category of
doing backups without serious shocks.