📜 ⬆️ ⬇️

Availability of web projects - good night, ProductOwner

You are a ProductOwner and are responsible for a group of web projects. When websites are hanging, unavailable or Clients are presented with debugging information like “Exception in object COrderController constructor ...” - they start calling you on your mobile phone, tweeting, etc.:

Even more fun when you are pulled ... in the evening at dinner, at a different time during the execution of marital duties, on vacation :-)
Let us examine the popular cases, the key principles of ensuring the availability of web projects and try to build a checklist "Tranquil vacation."


We reduce the response time


In an amicable way, you need to know about problems with a web project BEFORE that the Site User ran into it - and try to correct the situation. Worse, when tens (thousand) Users, your manager learned about the problem with your web project, and even managed to discuss “hanging” on Twitter, make a screenshot of the error.
Often happens as follows:

Thank you for saying, now Apache will overload ... - i.e. until you push it ..., no one will do anything.
As a result, a simple web project is up to tens of minutes. We'll have to explain to the Users why your cool, high-tech web project with valuable Customer data is hanging.
There is a great inexpensive business process that allows you to react proactively. We act like this:
  1. We agree with the system administration service to set up automated monitoring of your web projects. To do this, there is a lot of free effective software. We use nagios , many use zabbix and others. Setting up such software is a few hours. What to test? The easiest case: the page load time of your web project and the presence of a unique signature on it, for example, a phone number in the footer.
  2. “Someone” should respond in case of notification of a problem from the monitoring system. If “someone” eats or smokes for half an hour, web projects will hang out for half an hour. The setting for sending sysadmin sms on mobiles is very helpful. On mail.ru , although with restrictions, you can send sms from your mailbox, no more than once every half hour. If you buy a subscription to the SMS mailing service, they will be delivered faster and without restrictions. You can set up the process of sending an SMS monitoring system to mobile system administrators - 30 minutes. If you are not given the go-ahead to the SMS sending service, you can at least get the monitoring system to send mail to all system administrators, you and your colleagues - you can hope that one of the system administrators will be in the workplace and will respond.
  3. “Someone” should react to the problem with your web project ... on weekends, at night, on holidays (for example, in early January). Often faced with cases where the hovering of web projects on weekends or New Year holidays were found out ... not immediately, but after a few hours or days. Just one of the system administrators was not at work :-) In this case, you can do this - agree with the technical support service on the organization of duty during off-duty hours - in this case there will be “someone” who can and, most importantly, should react.

At this point, we can hope that the problem with your web project before or at the same time with the Clients will EXACTLY BE RECOGNIZED by someone from the system administration service and will respond AT ANY TIME.
In serious organizations, to solve the problem of rapid response, you can try to coordinate with the IT service the “agreement” on SLA, where the above tasks should be included.
I recommend placing the monitoring machine in another data center - follow. If, as is often the case with domestic hosting (and, as we remember, in Amazon, recently, one data center “fell” hard, ” where our machines were), the data center will be de-energized for several hours, then our monitoring machine will also shut down and no one inside the company will not know anything if the incident happens on the weekend :-)

Proactive monitoring - outside


Surely your web project provides Clients with various services: sending keys, mail notifications on orders, downloading files, etc. - these services also need to be included in the monitoring system. The “muzzle” of the website may open, but downloading files by Clients in the personal section does not work.
Therefore, we demand the availability of continuous monitoring of the N services of our web project and we hope that having received the notification “Order processing service is not working”, we learned about the problem right away and those responsible had already begun to deal with it.
')

Proactive monitoring - from within


Often, web projects break down ... gradually. The server disk space has decreased, the internal services that are responsible for backup and work under load have ceased to work - no one responded to this, but it was possible ...
Therefore, it is important to ensure that automatic monitoring checks not only the availability of your sites, but also the performance of servers, services, databases, etc., etc. It is useful to make sure that the system administration service does this or starts doing it systemically.
Again, to solve this problem, free software is used, which can be set up quickly enough.
As a result, we hope that some failures that indirectly affect the performance of web projects are constantly monitored, corrected and do not accumulate to a critical mass: the “server hardware”, the “health” of hard drives, network routers, etc. are checked.

Do not touch your hands - the development is carried out separately


Terrible in its cynicism, but a common case - the developers make changes to the code of the web project directly on the "combat" servers, often breaking the project's functionality during the day, deleting (of course by chance) the site pages and data ...
For developers, the easiest way is to go in and correct / break and immediately see the result of both the developer and the Client :-)
How to deal with this nightmare:

If developers ask you for the resources to create a subsystem for internal automated code testing (approximately from this universe is also a “continuous integration” technology - ContinuosIntegration) and the team can be trusted - go ahead. This will reduce the risk of destruction of the project in places A, B, C after making changes in the functional D.
Recently, a fashionable bike is spinning, that web projects are permanently in a raw state (beta) and it is not terrible that Clients find the bugs of an unfinished and quickly put into operation functionality and there is no need to keep groups of testers - in fact, if emails from Clients about the loss of orders and problems to disassemble you -… :-)

Datacenters and their number


Your web projects “live” on servers that are most likely located in one, “very reliable” data center. Unfortunately, the datacenters break down - lightning strikes them, Uncle Vasya on the excavator cuts off the power cables, the cleaner goes crazy and pours water on the server, etc. - in general, your web projects can become unavailable for a while from several hours to several days.
If you are ready to go through this - read the next chapter. The following scheme works well, which allows you to survive the data center crash with minimum downtime (you can achieve idle time in a few minutes if you try).

An elegant and simpler solution to this problem of “fast migration between DCs” is provided by Amazon . Datacenters there are connected by high-speed highways and “pick up” the machines in another data center, having data at hand from fresh snapshots, in minutes!

How to lose customer data forever?


Yes, of course everyone knows that you need to back up data from a web project. Most likely they are doing it ... Have you tried to restore them? :-) But do you know that the data may not be restored due to the “corruption” of archive copies?
To combat irresponsibility and poor-quality backup organization, it is useful to coordinate with the it-unit a plan for conducting “restoration exercises” - in which, for example, once a month, test restoration is performed on a separate machine of your web projects.
Even better, include in the monitoring system we have created several tests that will “beat out” if for any reason backup copies are not made or cannot be read during recovery.
And to break the backup process, with a high level of professionalism, it is easy and I personally encountered a situation in which everyone thinks that backups are being made, but in fact - the disk was full for a long time :-)
It is interesting to ask the it service: “How long will a web project recover from a backup in the event of data loss or an accident in a data center?” Technically, when organizing replication (see above), you can ensure this in minutes (or tens of minutes) . However, you can hear the following answer: “We will restore the data from the previous day, and the database will be uploaded from the backup of hour 3” :-). Be carefull.

Total


If desired, and a certain assertiveness, you can quickly set up a business process of monitoring and restoring web projects entrusted to you. Especially if you are located in the cloud. The technical possibility ... of ProductOwner's permanent creative enjoyment is available :-)

Source: https://habr.com/ru/post/127062/


All Articles