Reliable infrastructure for a cloudy b2b startup

In the start-up business, everything is focused on winning the market. Any effort should be directed to what is needed here and now. This also applies to the server infrastructure. Many backup servers in geographically remote data centers are, of course, cool and reliable. But when you have dozens of customers, what's the point?

We proceeded from the same approach when we started developing the Okdesk cloud service. The product saw the light on the “minimally viable infrastructure”: a virtual machine in the western data center where the application and the DBMS were installed (over time, the search engine moved to the next virtual machine). Around this farm, minimum monitoring via ping-admin and regular backups to the cloud of another provider was set up.

In the article: about the reasons for the move, the choice of how to switch data centers, the choice of data center and the first results.
')

With such an infrastructure, we held on “without falling” and showed the availability of service at 99.97% (2 hours of inactivity per year, including planned ones), while all the time we never received complaints from customers about “brakes” or “downtime” ".

But with the growth of the client base, not only the revenue and power of the virtual machine grew. Nervous tension also grew. The dream was becoming more and more uneasy, and while walking with the child, the internal duty officer was waiting for a signal to the telephone about the infrastructure problem. When the number of active Okdesk clients approached a hundred, it became clear that it was impossible to continue this way. “Minimally viable infrastructure” has exhausted itself: with a serious infrastructural problem, a simple service could be 2 hours or more. And one thing - when the server "lies" 2 hours for ten clients, and quite another - when there is no access from hundreds. You can somehow survive the first, and after the second, it will be much more difficult to restore your reputation than the service’s performance.

So, we decided to implement a more reliable infrastructure for Okdesk. And that's what came out of it ...

First of all business requirements

How to start designing a cloud service infrastructure? No, not with the study of modern approaches and technologies :) You need to start with the definition of business requirements.

Okdesk is a SaaS for managing customer requests in service businesses. And that means a business-critical system for consumers. Based on this, we formulated the following requirements:

in case of problems in the data center (both local - when a specific server fails, and global - when access to the entire data center is degraded or degraded), the inoperability time from the client's point of view should not exceed 15 minutes. A slight degradation of performance at the time of solving problems in the main data center is allowed;
in case of accidents, as a result of which production-DBMS is irretrievably lost, no loss of customer data is allowed more than a few seconds before the accident.

The two requirements written above became guides in the design of the infrastructure and the choice of the final solution.

From implementation requirements

General approach

We did not reinvent the wheel and in the first step applied the following approach. In the main data center, an application and a DBMS are installed on two physical servers. The backup data center has one physical server with a database and application. Data replication is configured between the DBMS in the backup data center and the DBMS in the main data center. In the event of an accident in the main data center, we will redirect traffic to the backup data center (see the next section), the time for switching traffic according to the plan should not exceed 5 minutes (the decision to switch traffic is taken by a person, so the time is not included incident response).

The next step, when the load on the application requires more powerful servers (after about 4-6 months according to our calculations), we plan to switch to the cluster model (balancer + nodes with the application). It would be possible to use one powerful server, it is cheaper. But if one large server fails, it will be necessary to switch traffic to the backup data center (several minutes of unavailability of the service), whereas in the case of failure of one small node, the maximum that customers can feel is temporary and non-critical performance degradation. But in the backup data center everything is exactly the opposite. As we grow, we plan to increase the capacity of the backup server: the risk of a simultaneous “fall” of the main data center and a backup minimum, so it is permissible to save a little.

How to switch between data centers?

As we found out, there are 3 of the most common ways to organize a quick switching of an IP address assigned to a domain name:

VRRP;
DNS hosters with a short TTL
Web services (by type of cloudflare).

Let us dwell on the pros and cons of each of the methods.

VRRP

Pros:
- fast switching;
- low cost.

Minuses:
- not provided by all data centers;
- configured by the data center, in case of need for changes, the process may be delayed;
- it is not clear how well things are with reservations.

DNS hosters with a short TTL (for example, route53 from Amazon)

Pros:
- full control over the logic of work;
- low cost;
- several DNS servers where the zone is delegated: you can talk about reservations.

Minuses:
- some DNS servers ignore the TTL transmitted from the zone holder. According to our information, the number of such resistant servers is very small. But they are there (i.e., not for all users, switching between IPs can work quickly).

Web services (on the example of Cloudflare)

Pros:
- easy setup;
- almost instant switching between IP addresses.

Minuses:
- there is another theoretical point of failure. From our point of view in this regard, the conditional cloudflare is more likely to fail than the dns of Amazon or vrrp;
- because all requests to the service will pass through the cloudflare, and our users are mainly located in the CIS, this will lead to an increase in response time from the server.

As a result, we decided to stay on route 53 from Amazon. The solution is not perfect, but you always have to sacrifice something. In doing so, we will continue to explore other possible options.

Monitoring

For historical reasons, our monitoring consists of the following components:

Ping-admin - can do http requests to specified addresses and, if there is something wrong in the answer, sends an SMS and calls to the specified numbers. It is inexpensive and easy to set up. Not bad as a starting minimum monitoring, we can advise it to those who are just starting up and want to always know if its service is working or not;
Monit - installed on the server and monitors the basic metrics: the amount of free space, CPU load and the size of free RAM. Free, but requires minimal knowledge of unix systems and the ability to read manuals
Scoutapp - application profiling in terms of performance and bottlenecks in the program code. Allows you to analyze data in different time slices and parse individual requests.
Client-server utility (appeared after monit'a), controls a large number of server characteristics, can control changes in business parameters (for example, the number of objects in the database over a period of time), can draw beautiful graphs and send notifications.

Some functions of our monitoring are duplicated, but this is perhaps the case when more is better than less.

Choosing a data center

When choosing a data center based on the following requirements:

availability of a site in Russia (to comply with the law on the protection of personal data);
the presence of at least 2 sites in general.

Based on these requirements, it was chosen between Selectel and Servers .

Each of the providers has its pros and cons.

Selectel:

Pros:
- a wide range of dedicated servers;
- acceptable prices;
- many years on the market, a lot of extra. services (for example, vrrp, which is not in Servers.ru);
- among customers there is Vkontakte - one can expect that this motivates the provider to be at the forefront of technology.

Minuses:
- Recently, we have seen a lot of messages from friends and colleagues in the shop about problems with stability.

Servers:

Pros:
- data centers in several countries, servers in different data centers can be combined into one network “out of the box” (ie, this is not an additional service). So far, our main business in the CIS and the presence of foreign DCs does not matter much. But in the foreseeable future we plan to enter other markets, so using one provider’s infrastructure is a good help;
- new branded Dell servers;
- flexibility and customer focus in support (answers arrive quickly) and in sales (for example, you can easily agree on changing the standard configuration: add disks or memory).

Minuses:
- there is a big gap between server configurations. So, for example, the simplest configuration includes 1 processor with 4 cores, and the next 2 processors with 16 cores (with a difference in price 2 times);
- other things being equal, prices are higher.

As a result, we chose Servers, since the difference in the price for hosting in absolute terms turned out to be insignificant compared to the risks of unstable work.

Interim results

As a result of the move, we not only began to sleep more easily, we significantly improved performance. For example, the average response time decreased by 4 times.

Well, Okdesk customers can continue to not worry about the safety and availability of data.

ps in the framework of the article we did not dive into the technical aspects of the infrastructure. We wanted to emphasize that the choice of technical solutions should be based on business requirements. As far as possible, we will answer questions on technical aspects in the comments.

Source: https://habr.com/ru/post/343266/

All Articles