Jet9 test results and service commissioning

Two months ago, we began a public beta test of web hosting on the Jet9 platform. During this time, with the help of test participants, we checked the operation of the platform's subsystems: a fault-tolerant cluster, CDN and web accelerators, the environment of the web hosting sites, and received assessments of user interaction with the platform. In some cases, the expected results were confirmed, in other cases, the desired deficiencies were revealed. At the same time, we optimized the web hosting environment for typical PHP / MySQL sites and improved the work of custom containers.

A week ago, the testing was completed, we summarized the results and now we provide Jet9 hosting in the working mode with provision of all services and declared SLAs for customers.

Test results

We have described the Jet9 device in general terms in the article Testing Jet9 - fault-tolerant hosting of sites with geographic optimization . The platform consists of three layers:

frontends - geographically distributed network of web accelerators
backends - a working environment for web hosting sites and applications
failover clusters with nodes in different data centers

In simplified form, the scheme looks like this:

')

Frontends are distributed throughout Russia and other countries of the world. Each frontend serves the site visitors who are closest to it and returns data either from the cache of the web accelerator or executes a request for the backend serving the site.

On the back end are custom web containers with one or more sites. The web container is allocated server resources in accordance with the tariff; inside it, the Apache, MySQL, PostgreSQL or other applications belong to the user.

Backends work on high availability clusters according to the master backup scheme, where each of the cluster nodes is located in a separate independent data center. Cluster nodes have common network connectivity (HA SLA Standard) or are located in different autonomous systems (HA SLA Business and HA SLA Corporate).

Separately, we use these subsystems to operate internal TrueVDS services, provide clients with services such as High Availability Virtual Servers, or deployed and served in customer projects. In Jet9, all three subsystems were refined and integrated into one platform. Their management and interaction is carried out automatically both in normal mode and in case of accidents. The site owner or application developer does not need to understand how all this is implemented inside, how to configure and maintain it. Therefore, testing had two goals: to evaluate the convenience of working with hosting for regular users, and to test the assembled platform in normal and emergency modes.

Convenience of working with hosting for ordinary users

As the default control panel, we took ISPManager 5, which is popular and familiar to many. With the help of plug-ins, we integrated it with our platform and provided users with a familiar web site management interface. All the complex mechanics of setting up a web container, cluster operation, placement and configuration of a site in a CDN are hidden by the standard form for adding a site consisting of a field with a domain name and the "Add" button. For typical PHP-sites on 1C-Bitrix , UMI.CMS , Wordpress , Drupal , etc. no further action is required.

Since, for users, the system looks no different from regular hosting, there were no difficulties and additional questions when testing. The site was created and hosted in the usual way. The work of CDN and Web accelerators was transparent for both the site administrator and visitors. But there were still flaws - some details were missed in the letters sent, for example, there was no data for access via FTP, and it was difficult to find information about the DNS servers used. The missing information was added to the letter templates and entered in the FAQ . As can be judged by the current volume of the FAQ, a large number of questions have not yet arisen, and the work with the sites did not cause any difficulties for users.

Platform Test

High availability clusters are implemented on a bunch of Pacemaker, Corosync and DRBD. We have been using this scheme for a long time and have studied its behavior in various situations quite well. For clusters, the presence or absence of frontends makes no difference, the logic of their work does not depend on it. But for the work of the frontends, it is essential that the backends are placed on the master backup clusters - therefore, the frontend must correctly transmit requests to the desired backend, and be able to respond correctly if the backend moves from the master to the backup.

Therefore, we carried out tests of the platform in normal mode, and when imitating accidents in the cluster and on the frontend. They also studied the behavior of the system in real emergency situations - when the channel disappears between data centers and when communication degrades in one of the uplink directions.

The interaction of subsystems in the normal mode

In normal mode, there are three types of processes in which all layers of the platform are involved:

create and delete user accounts
creating and deleting sites
service requests to sites

In these processes, separate operations for subsystems can be distinguished:

Account creation
- Backend
  - Creating a custom web container with dedicated resources and isolation from other containers
  - Formation of a web environment for the selected preset (for example, LAMP)
Add site
- Backend
  - Creating a site environment in a custom web container
  - Generation of a DNS zone with records for geo-acceleration
  - Register web site container (discovery) in the backend
- Frontends
  - Assigning frontends that will serve this site
  - Domain DNS zone placement
  - Binding on the frontend domain to the master and backup cluster
Request to site
- Frontends
  - Return to the visitor of content from the closest to him frontend
  - Determining the mode of the backend and sending a request to the master or backup, depending on their availability
- Backend
  - Submit a site request to a user's web container

Before the public testing, we conducted internal testing, during which most of the shortcomings were eliminated. But the beta test participants helped us discover two more bugs: one for us, the other for some domain registrars.

We had a bug when tying a domain to frontends, when during the race conditions the operation was marked as completed, although the domain was not yet attached. This error did not affect the work of already linked domains and appeared quite rarely - only when the site was connected. Thanks to the users who placed several dozen domains in the first days of testing, this error was quickly detected and eliminated. There were no other errors in the work of the platform.

Registrar bug arises in the process of domain delegation to new servers, while verifying the zone and DNS servers. In our scheme, to improve reliability, each DNS server had several A-records. Some registrars in verifying the delegation correctly processed such a naming scheme and completed the delegation without errors. Other registrars refused to delegate the domain, claiming incorrect configuration. Probably, they take only the first of several A-records when checking, and if it turns out to be the same for all servers, they consider this a mistake. Although this bug is in checking DNS servers as a registrar, but in order to simplify the lives of customers of such registrars, we had to use the old server naming scheme - with a single A-record.

Reaction to various types of accidents

We experienced a platform response to the following types of accidents:

Main server power down
Failure in the core data center internal routing
Partial failures in uplink routing
Failures on the communication channel between the master and the cluster backup
Failure of the backup server
Failure of one of the frontends

In this case, the platform is required to determine whether the problem really arose, whether any reaction is required on it, and, if the problem exists, change the topology to eliminate the problem. In the event of a crash of the cluster master, not only is the migration of the backend to the backup server, but also the frontend reconfiguration to redirect requests to the new backend.

All the indicated situations were worked out correctly, the restoration of work was carried out during the scheduled time - up to one and a half minutes in the most difficult cases. For Internet services assembled on a master backup cluster without frontends, even with the optimally chosen solution for organizing DNS and IP routing, the master’s failure can lead, though not to a complete cessation of service, but to a noticeable degradation of the quality of work for about hours Adding a frontend layer in the Jet9 platform, among other things, reduced the time of service degradation in case of accidents to a few minutes.

Improving the web hosting environment

Simultaneously with the beta testing of the Jet9 platform, we improved the custom web containers used in it. In addition to guaranteed resources and load isolation between users, we have implemented the ability to fully manage the software used in the container:

Use own httpd and mysqld
Choose arbitrary versions of programs (supported and updated by us)
- PHP 5.2, 5.3, 5.4, 5.6, 7RC2
- MySQl5.6, MariaDB 10.0, Percona 5.6
- PostgreSQL 8.4, 9.4
Configure Apache at httpd.conf
- switching modules on and off
- tuning the number of httpd processes, fcgi processes
Enable, disable or configure PHP modules
Configure MySQL Server Configuration

Among other things, it allowed us to optimize the web environment for maximum performance for demanding CMS resources for PHP.

Putting the service into operation

Public beta testing confirmed the correct operation of the Jet9 platform and found several shortcomings. Many thanks to the testing participants for their help! After improving the work of web containers and eliminating the identified shortcomings, Jet9 web hosting has been put into service since September 7, and now we are accepting orders .

In addition to hosting on a failover cluster, we also added hosting tariffs of the usual level of reliability (Standard Server). They also work on optimized and managed web containers with guaranteed resources and scaling, and are integrated with a network of geographic optimization and web accelerators. But due to a simpler configuration and the absence of double hot backup equipment, its cost is much lower and corresponds to the average market prices for VPS or dedicated servers of similar capacity.

Once again we express our gratitude to all participants in the testing and remind you that in order to receive discounts for participating in testing, it is enough to indicate in the comments to the issued order the login used during testing, or send this information in a letter of support.

Source: https://habr.com/ru/post/266863/

All Articles