Joint action with ISPsystem: ISPmanager 5 Lite license free of charge to all cloud-based VPS, incident report dated August 31, 2015

Our partners from ISPsystem offered to carry out a joint action - to provide an ISPmanager 5 Lite license free of charge to all cloud-based VPS in the Netherlands and the US before November, and we thought, why not ...

Especially when the VPS itself is not much more expensive than a license, since a big sale is going on now. But that's not all, we decided to reduce prices for the entire line, not just for servers S and M, since we introduced new storage into operation, exclusively on SSD-drives, now cloud-based VPS have become even more productive and faster and most importantly the service has become stable (not so long ago there were big problems on the cloud platform because of the SAN storage, as a result of which some of our subscribers suffered, about the incident under the cat) :

ORDER CLOUD SERVER FOR MAGIC PRICE

S
')
Kernels (vCPU) 1 Core
Memory (vRAM) 1 GB
40 GB disk quota (SSD Storage)
1000 Mbps port
Premium traffic 4 TB
Cisco ASA 5500 firewall included!

~~$ 9.00~~ $ 3.99 / month

M

Kernels (vCPU) 2 Core
Memory (vRAM) 2 GB
60 GB disk quota (SSD Storage)
1000 Mbps port
Premium traffic 6 TB
Cisco ASA 5500 firewall included!

~~$ 19.00~~ $ 7.99 / month

L

Kernels (vCPU) 4 Core
Memory (vRAM) 4 GB
80 GB disk quota (SSD Storage)
1000 Mbps port
Premium traffic 8 TB
Cisco ASA 5500 firewall included!

~~$ 39.00~~ $ 19.99 / month

XL

Core (vCPU) 8 Core
Memory (vRAM) 8 GB
160 GB disk quota (SSD Storage)
1000 Mbps port
Premium traffic 10 TB
Cisco ASA 5500 firewall included!

~~$ 59.00~~ $ 32.99 / month

Clouds fall, ours is not an exception: a report on the incident on a cloud platform

In connection with the unfortunate incident related to the Virtual Cloud Server / VPS service we provide (S, M, L, XL),
after the completion of work to restore the correct functioning of the service, we decided to highlight this issue more extensively in public.

[ORIGINAL MESSAGE]
The problem first appeared on August 31, 2015 at about 20:00 CEST.
A problem was noticed with one of the storage platforms, which led to the unstable operation of virtual
machines.

The employees of the DC together with the equipment supplier immediately started a detailed analysis of the situation that has arisen,
after triggering the increased load on one of the nodes.
Work was carried out to reduce the load, and restore the correct operation of the platform.

[UPDATE SEPT 1st, 09:45 CEST]
After careful analysis of the data center employees and equipment supplier, the faulty
equipment was replaced at 9:00 CEST. However, contrary to the calculations and the expectation of the load did not fall
to normal, and work continued.

[UPDATE SEPT 1st, 12:15]
At the time of the repair work, it was decided to limit the bandwidth of the warehousing, to reduce the load.

[UPDATE SEPT 1st, 17:15 CEST]
The investigation of the incident is still underway, the reasons for the failure have not yet been identified, but the data center staff along with
the equipment supplier is making every effort for the speedy resumption of the platform.

[UPDATE SEPT 1st, 23:00 CEST]
It was possible to stabilize the work of the platform, the engineers plan to bring all the VPS into working condition within a few hours.
To ensure the stability of the platform, at the time of the work, the possibility of
enabling / disabling / restarting the server by clients to prevent load increase.

[UPDATE SEPT 2nd, 09.30 CEST]
The data center engineers worked all night to stabilize the platform.
The work was restored, some of the affected VPS returned to normal operation. The remaining VPS are now in automatic recovery. Engineers also double-check the performance of each VPS affected by this incident in manual mode.

Report on plans to move to another storage platform - full SSD.

[UPDATE SEPT 2nd, 14.00 CEST]
The platform’s performance has been restored, and all affected VPS will be fully restored between 16: 00-17: 00 CEST today.

Soon the migration of all VPS to the new storage platform will begin. The platform has already been tested, and began preparations for migration.

[UPDATE SEPT 2nd, 15:30 CEST]
The problem with a high load is repeated because of what affected the performance of significant
part of the VPS.

[UPDATE SEPT 2nd, 18.40 CEST]
Repetition of the problem occurred at 15:30 hrs. CEST. After analysis and restoration work of the data center engineers and
The equipment supplier managed to stabilize at 17:30 CEST.

Preparatory work on the preparation of the migration to the new platform has already been completed, and it is planned to start the migration after 20:00 CEST.

[UPDATE SEPT 3rd, 01:00 hrs. CEST]
As previously reported, work on the process of migrating VPS to the new SSD platform has already begun.
Migration of the first batch of VPS has already been successfully completed and data center employees are working to restore their full working capacity.
According to the plan, it will take about two hours to restore the correct operation of the first batch of VPS on the new platform.

[FINAL UPDATE SEPT 3rd, 09:30 hrs. CEST]
Apology…
The problem that led to the failure of the VPS was the hardware failure of the part of the storage platform, and therefore an excessive load was created on the nodes, which led to errors in the operation of the VPS.
Plans for the transition to a more productive and reliable full SSD storage were taken earlier, and this incident only accelerated the move.
Most of the servers have already been migrated to the new platform.
Within an hour, the restrictions on VPS controls will be disabled: reboot, shutdown / on, and restrictions on the resources consumed, which were forced to apply for quiet migration.
The migration of the following parts of the VPS will be carried out and further, approximately will be spent
4 hours for each VPS pool, each downtime will not exceed 5-15 minutes.

This incident is not typical for us or for our partners (service providers).
We, employees of the data center, and its equipment supplier have made every effort to minimize the VPS downtime, respectively, and the losses of our customers.

We once again apologize for this incident, all affected subscribers received compensation in the form of free maintenance for up to 3 months, we hope for your understanding, hardware failures in new products are very difficult to completely eliminate, no one is insured from mistakes. Clouds fall, at all, sooner or later, the main thing is the measures taken. Make backups and reserve. For our part, we will try to make the service as stable as possible.

Yours sincerely, UA-Hosting team.

Source: https://habr.com/ru/post/267019/

All Articles

Joint action with ISPsystem: ISPmanager 5 Lite license free of charge to all cloud-based VPS, incident report dated August 31, 2015

More articles: