We test Jet9 - fault-tolerant hosting sites with geographic optimization

We have created a platform for running Jet9 web applications and are now conducting a public beta test of a web host built on this platform. Here we will talk about what it is, what tasks it solves, and how everything is organized.

In subsequent articles, we will tell you more about the Jet9 device, about the used technical solutions for various components, about the pitfalls that we encountered, and how to eliminate or circumvent them.

The purpose of these publications is to involve specialists in testing and get bug reports, inform potential clients about the project and share experience with colleagues. As materials appear here, we will add materials on our site.

What is Jet9

We called the service "Failover Hosting with Geo-Optimized Sites". Despite the verbosity, this reflects a smaller part of the possibilities, but the most noticeable.
')
The main functions of Jet9: increased fault tolerance, integrated CDN / ADN, guaranteed allocation of resources in a wide range. All this in one ready-made solution, without the need for the customer to independently organize the interaction of a large stack of components and change the architecture of the site. As a result, this ensures a stable fast operation of the site with a minimum of downtime or degradation of work. The solution is focused on web projects that already have such requirements, but it is too difficult or too expensive to implement and maintain them yourself.

The private installation of the platform (Private Jet9) is designed for small and medium-sized projects that require from several servers to several hundred to work. Web hosting (PaaS Jet9) provides minimum rates for small sites with low attendance, as well as large tariffs, which make up almost all the resources of a powerful server, for resource-intensive sites with high load - up to several hundred requests per second and hundreds of thousands of users per day.

As hardware for Jet9 web hosting, standard TrueVirtual V8 and TrueVirtual T4 servers with network storage and local SSD cache are used.

When developing, several know-how and patent pending inventions were used, but the main work consists in a large amount of research, painstaking engineering work on connecting many components and their refinement, programming the missing nodes, as well as in long testing behavior in various combinations of conditions, documenting everything on all stages and drawing up regulations for routine and emergency procedures.

How does Jet9 work

Web environment

Control Panels and Web Stacks

As a user interface, you can use control panels designed for shared web hosting. The ISPManager 5 control panel is currently used. SSH, SFTP, FTPS are available for automated deployment.

The web environment currently corresponds to the generally accepted LAMP set: Linux, Apache, Mysql, PHP. In addition to the usual CGI, you can run FastCGI applications (perl, python). That is all that is available on a regular web hosting. Private Jet9 has the ability to use Unicorn, Thin and Puma application servers for Ruby on Rails, Tomcat and Jetty for Java / JavaEE, WSGI applications in Python, PostgreSQL, MongoDB, CouchDB. But in web hosting at the current testing stage, these stacks are not available, only LAMP.

Resource Accounting and Load Isolation

Once upon a time we did accounting for and isolation of hosting clients on FreeBSD 4.1. I had to add a lot of patches in the kernel, in apache, and in some system utilities. About this one could write a lot. And now it turns out short: cgroup for different classes of memory, for the processor, for disk operations; rlimit on processes and open files. Each user has its own apache instance, which simplifies the organization of user privileges for the operation of a web server and scripts, and simplifies the control of resource consumption.

Kernel updates are required only for more flexible access control and additional user isolation, and are used only for Jet9 web hosting.

Some errors from the ideal load sharing through cgroups are compensated by the mandatory transfer of at least 10% of all resources to the general reserve, due to which each user is provided with 10% overload above the tariff limits.

Backups

Backups are stored in an independent long-term archival storage system, located in the third geographically remote data center. Incremental copying is done every day. Used copies of the file system via rsync. Block device snapshots are not used due to the fact that if file system meta-data is damaged, both the replica in the cluster and the backup copy will be damaged.

The rotation of archives is done according to a multicycle scheme, which ensures the thinning of old copies in such a way that fewer old copies are preserved, and newer copies are preserved. That is, when storing seven archives, it will contain copies of approximately the following ages: 1 year, 6 months, 3 months, 1 month, 6 days, 2 days, 1 day.

Failover Cluster

Each web backend runs on top of a HA cluster with replicable storage. In two independent data centers there are two sides of the cluster - the master and backup. At one time only one of the parties can work - either a master or a backup. A split brain ban has been adopted as a policy for the work of the parties to the cluster for web hosting - a situation where both the master and backup work simultaneously. This policy is a consequence of the accepted requirement to ensure consistent consistency. For private installations, other policies that allow the split brain can be used to ensure maximum service availability even at the cost of data inconsistency.

On each side of the cluster there is its own storage, with which all the work is being done, and the changes of which are replicated in real time to another data center, and in another availability zone from master to backup. For us, this is a more convenient option than the alternative - shared storage, distributed in both data centers. Cluster organization on replicated storage is, in general, much more complicated than on shared storage, but it provides a significant advantage - lower requirements for latency of communication between data centers, significantly lower bandwidth requirements, and as a result, the possibility of building more productive systems. Now we have three data centers, two of which have a direct connection, and both are connected to the third via the Internet. HA clusters are used both on pairs of master backups connected via a direct connection and on pairs connected via the Internet.

When we started using pacemaker for internal services, heartbeat was used for it and we independently introduced additional arbitration mechanisms to protect against splitbrain. In Jet9, we switched to pacemaker and corosync with a quorum. Pacemaker is a good, powerful product, but it has a lot of inconveniences and features that complicate its use with a large number of clusters and on unreliable or complex networks. Therefore, we had developed our own cluster controller, better prepared for solving our problems. Now it is still too little run-in and for production we continue to use pacemaker.

The master and backup use different IP addresses from different routing networks so that routing failure does not lead to unavailability of both master and backup. This is a more reliable alternative than a migrating IP address, since in the latter case a sequential structure of two points of failure is obtained - external routing (BGP) and internal (OSPF).

In Jet9 web hosting, local storage uses fast SSD + bcache with writeback for caching.

Geographic optimization

All sites are served through a network of web accelerators - a geographically distributed network of caching web servers that deliver content from websites at maximum speed and cache it for re-distribution. Requests to the site are served through the web accelerator, which is closest to the visitor, and thus all pages open much faster.

Unlike the CDN for static files, which require the completion of site scripts in order to upload files to themselves, CDN Jet9 works transparently and without alterations with web sites and itself receives and distributes all the content. Connecting CDN and web accelerators to the HA cluster is done automatically when the site is created and does not require any DNS settings or site settings.

An additional advantage over foreign services is normal coverage in Russia. That is the closest mirror to Tyumen is not in Holland, but in Yekaterinburg. For the test installation Jet9 uses a small network - the UK, Moscow, St. Petersburg and Novosibirsk. In production, Rostov-on-Don, Samara, Yekaterinburg, Holland are added to them. For the time being, the issue with the Far East is poorly questionable - a large imbalance in the cost of communication and population is not economically justified, but we will continue.

For geographic balancing, a hybrid scheme is used - DNS anycast plus calculation of speed and distances on DNS servers. Squid is used as a reverse proxy and, in addition, for SSL, Nginx.

Testing

The main goal of this test for us is to find problems in the work of the API for managing domains on frontends and backends, in integrating ISPManager with our web environment on backends, and in integrating ISPManager with web accelerators and geo-balancing on frontends. Planned testing period: 1 - 2 months.

In the work of HA clusters and web accelerators separately, we do not expect any shortcomings, since we have been using them in production for a long time. But we only did support web site control panels, domain management APIs on frontends and backends, and a bunch of frontend backends this spring. Therefore, with considerable probability there may be problems that we have not identified on our own.

In the process of testing, we will regularly, with prior notice, arrange failures at various sites in order to check the correctness of the automatic response to these failures, and how this may affect the functioning of the sites.

Testing participants will receive an additional 10% discount for 2 years on all Jet9 products - both on web hosting and on licenses for private installations.

To join testing, it is enough to register an order for jet9.ru with the required amount of resources and your own details. Separately, writing about testing is not necessary, now all applications are automatically processed as test ones.

In addition to direct testing, it is very interesting to get tricky technical questions. As if there is a question to which we do not know the answer, then a bug may be hidden under this question. And then, in the following articles, where we will talk more about each component, we will also add the questions asked here with the answers.

Source: https://habr.com/ru/post/262231/

All Articles