How to develop a cloud service: ensure smooth operation and add features

If we use the well-known comparison, the development of cloud service is a task similar to replacing the engine on a flying plane. But there is no alternative - you will hopelessly fall behind if you do not constantly change. Service MoySklad is constantly changing for 7 years. In this post we will talk about how this happens.

Loads

If your service works in B2B (that is, for companies, and not for individual users), it is quite simple to predict its load. She has no unexpected jumps like habraeffekta. If we talk about our region - trade - during the year there are two well-known peaks: before the New Year holidays and on March 8. The peaks are several tens of percent (and not several times).
')

Since the load varies little (in the picture - the actual load schedule of one of the servers), there is no great need for elasticity, which is provided by various PaaS. Therefore, instead of Amazon EC2 or other cloud hosting sites, we rent regular physical dedicated servers. Apparently, with a stable high load, this is the optimal solution.

If everything goes well, the user base grows continuously. This is a constant trend, each month the load and the volume of bases increases by 5-10%. In practice, this means that every few months we must either install new servers or improve the architecture of the service (or both).

But the main reason for which you need to have a solid margin for the load - this is an error. Some bugs appear on the production site along with updates. Most of all, I remembered the exotic problem, when the stack of traces in the log did not catastrophically slow down the application on the server with AMD processors. (On servers with Intel, everything worked fine.)

Other mistakes wait patiently for their chance. In practice, most of the problems we brought exotic patterns that looped the code for printing documents. Therefore, we have determined the minimum performance margin for all servers and components - by 2-3 times. In most cases, it allows users to work normally, even if there are any problems inside.

In my opinion, there is no need to immediately prepare the architecture for super-high loads. In the B2B world, nothing happens instantly, and with smooth growth there is time to gradually adapt the architecture to the growing load.

MoySklad was launched as one monolithic Java EE application and one database. A few years after launch, the only database server stopped coping with the load. We said “ok” and divided the base into several physical servers. After some time, the application itself approached the load limit. We said “ok” and carried out long and hard tasks to a separate server - data import and export, API. After some time, we split the main application into several servers.

You do not have to spend too much time preparing yourself for future ultrahigh loads. Most likely, then all the same, everything will have to be redone.

Technology

The development of architecture is most helped or hampered by technology. The right technology makes life simple and enjoyable. Wrong bring pain and suffering.

I will give an example. To merge several JVMs into a cluster, we used the Infinispan distributed cache in JBoss. Up to a certain point, everything went well, but then (perhaps due to an increase in load) regular failures began. The problem was that about once a week, Infinispan was losing contact between individual Java machines.

Infinispan has a huge amount of settings, and on the Internet there are tips on how to tweak these settings to solve the same or similar problems. We spent several months trying out reasonable options. Kesh continued to fall off regularly.

The solution was simple. In one week, we moved the implementation of the cache from Infinispan to Hazelcast . This implementation started immediately.

If the technology does not work, it is best to replace it as soon as possible.

Releases

An important task of product development is to keep it balanced. What does it mean?

Like any fairly mature product, MyWarehouse has a hefty backlog of things that should have been done a long time ago. So voluminous that working with him is quite difficult. In the next release, you can pull new features more or less randomly, but how much is the optimal approach?

Seriously reduce chaos helps understanding that all features are clearly divided into three groups.

1) Platform features. Outwardly, they do not give users anything, and often interfere, because their release is potentially associated with the most severe glitches and brakes. However, if you constantly do not implement improvements to the platform, soon enough the product staggers on thin legs and collapses heavily under the ever-increasing load.

2) Recycling functionality. They do not give new opportunities, but they put in order what was done crookedly. Recycling cause massive suffering among users who are accustomed to work in the old way. Still, it is necessary to do them, because without them the product will quickly become hopelessly confusing and accessible for development only for the chosen ones - geeks.

3) Really new features. Yes, minimum is more, minimum viable product taxis, users buy a solution to their problems, and not a set of features, so on. But in the end, a particular product is chosen precisely because of the set of possibilities that more or less accurately correspond to the user's tasks. New features should appear continuously.

These three groups of features are easier to balance. You just need to organize the development in three parallel threads. How best to group them into releases?

The main advice from our experience is to never combine new features and platform changes in one update. Such combined releases are too complex, difficult to prepare and difficult to test. Therefore, we divide releases into platform and functional ones.

The feature of platform releases is under a heavy load, on real user data, something can go wrong. Therefore, we make them roll back. If problems arise, the old version returns in a few minutes and the development team starts preparing a second attempt.

Functional releases are harder to roll back. As a rule, they require changes in the database schema and data conversions, so returning the old version will not be so easy. But this is not necessary - it is easier to test a functional release.

Small life hacking: better to roll out massive updates on the weekend. On Saturday, and especially Sunday, the service is several times smaller than the users.

Results

We showed our approaches and techniques for the continuous development of a successful cloud service. Once again I will repeat three important rules of work in the conditions of constant updates:

1. The main reason for which it is necessary to have a substantial supply of resources under load is possible development errors, and not a hypothetical future increase in the number of users. It is not worth spending time on proactively preparing for future ultrahigh loads.
2. There is no point in trying to get external libraries and components to work for too long: if the technology does not work, it is best to replace it as soon as possible.
3. If you have already done everything and are ready for updates, never combine new features and platform changes in one release.

Source: https://habr.com/ru/post/230991/

All Articles

How to develop a cloud service: ensure smooth operation and add features

Loads

Technology

Releases

Results

More articles: