As in 2009, we began to build a cloud, and where we made a mistake

In October of 2009, we rechecked everything. We had to build a data center for 800 racks. Based on our intuition, market forecasts and the American situation. It seems like it sounded logical, but it was scary.

Then there was no cloud computing in Russia, just like cloud hosting. Actually, the word itself is almost never used in the market. But we have already seen in America that there such installations are in demand. We had big projects for creating HPC clusters for aircraft designers with 500 knots, and we believed that the cloud is an equally large computing cluster.
')
Our mistake was that in 2009 no one thought that clouds would be used for anything other than distributed computing. Everyone needs CPU time, we thought. And they began to build architecture in the same way that HPC clusters were built for scientific research institutes.

Do you know how such a cluster differs from modern cloud infrastructures ? The fact that he has very little access to the disks and the whole reading is more or less consistent. The task is set one, is divided into pieces, and each machine makes its own piece. At that time, no one seriously took into account that the load profile on the disk subsystem for HPC clusters and clouds is fundamentally different: in the first case these are sequential read / write operations, in the second case - full random. And this was not the only problem we had to face.

Network architecture

The first important choice was: InfiniBand or Ethernet for the network inside the main cloud site. We have long compared and selected InfiniBand. Why? Because, I repeat, we looked at the cloud as an HPC cluster, firstly, and because then everything was collected from 10Gb connections. InfiniBand promised miraculous speeds, simplified support and reduced network operation costs.

The first leverage in 2010 was on 10G Ethernet. At that time, we were the first to use Nicira's first SDN solution in our world, and later VMware purchased for a lot of money, which is now called VMware NSX. As we were learning to build clouds, the Nicira team learned to do SDN the same way. Needless to say, it didn’t do without problems, even once a couple of times everything went down notably. The then network cards "fell off" with long-term operation, which only added to our drive to work - in general, tin. Some long time after the next major update from Nicira, exploitation lived on valerian. However, by the time 56G InfiniBand was launched, we, together with our colleagues from Nicira, had successfully treated some of the problems, the storm had eased and everyone sighed with relief.

If today we were designing a cloud, then we would put it on Ethernet, probably. Because the correct history of architecture was still in this direction. But it was InfiniBand that gave us huge advantages, which we could use later.

First height

In 2011–2012, the first stage of growth began. “We want, as in Amazon, but cheaper and in Russia” - this is the first category of customers. "We want a special magic" - the second. Due to the fact that everyone then advertised the clouds as a miracle tool for the smooth operation of the infrastructure, we had some misunderstandings with customers. The entire market quickly from big customers got over the head due to the fact that big customers got used to the near-zero downtime of physical infrastructure. The server fell - reprimanded the head of the department. And the cloud at the expense of an additional layer of virtualization and a certain pool of orchestration runs slightly less stable physical servers. Nobody wanted to work with VM failures, since in the cloud they set everything up manually and nobody used automation and cluster solutions that could improve the situation. Amazon says: “Everything in the cloud may fall,” but the market is not comfortable with it. Customers believed that the cloud is magic, everything should work without interruptions and virtual machines should migrate between data centers themselves ... All came with one server instance to one virtual machine. And then the level of IT development at that time was such that there was little automation: they did everything with their hands once according to the ideology “it works — don't touch”. Therefore, when restarting the physical host, you had to manually raise all the virtual machines. Our support was also involved in this for a number of customers. This is one of the first things that was solved by the internal service.

Who came to the cloud? All sorts of people. One of the first began to come distributed online stores. Then people began to bring business-critical services in a normal architecture. Many saw the cloud as a file-share platform, something like a data center reserve. Then they moved as the main, but left the second site as a reserve. Those customers who have already laid out on such an architecture are still very much satisfied with most of them. A properly configured migration scheme in case of failures was our pride - it was very cool to observe how some major accident occurs in Moscow, and the customer’s services automatically migrate and deploy on the reserve.

Disks and flash

The first growth was very fast. Faster than we could predict when designing an architecture. We pretty quickly bought iron, but at some point we rested the ceiling on the disks. Just then we laid the third data center, it was the second under the cloud - the future Compressor, certified T-III uptime.

In 2014, there were very large customers, and we faced the following problem - subsidence of storage systems. When you have 7 banks, 5 retail chains, a travel company and some research institutes with geological exploration, all of this can suddenly coincide with load peaks.

The typical data storage architecture of the time did not assume that users have quotas for write speed. On the record or reading, everything was put in the order of the live queue, and then the data storage was processed. And then there was a “black Friday” of sales, and we saw that the speed dropped by the users of the storage system almost 30 times - the retailer scored almost all the power by writing with its requests. The site of the medical center fell, the pages opened for 15 minutes. It was necessary to do something urgently.

Even on the most high-performance disk arrays, usually expensive, there was no possibility of differentiating the priorities in performance. That is, customers could still influence each other. It was necessary either to rewrite the driver in the hypervisor, or to invent something else - and urgently.

We solved the problem by purchasing all-flash arrays with a bandwidth under a million IOPS. It turned out 100,000 IOPS per virtual disk. Productivity was enough for the eyes, but it was necessary to think of the limit on R / W anyway. At the level of the disk array at that time (end of 2014) the problem was unsolvable. Our cloud platform is built on non-proprietary KVM, and we were free to climb into its code. In about 9 months, we carefully rewrote and tested the functionality.

At this point, the combination of InfiniBand and All-flash gave us a completely wild thing - we were the first in our market to introduce the service on disks with guaranteed performance with the most severe fines prescribed by the SLA. And in the market competitors looked at us with round eyes. We said: “We give 100 000 IOPS to disk”. They are: "This is impossible ..." We: "And we are still doing it guaranteed." They: "You are all of that, plague, you are crazy." For the market it was a shock. Of the 10 major contests, 8 we won because of the disks. Then they hung the medals on their chests.

16 arrays, each with a million IOPS gives, 40 terabytes each! They are still directly connected to the servers via InfiniBand. Exploded where no one ever thought. Six months drove on the tests, not even a hint was not.

The fact is that when the array controller crashes on InfiniBand, the routes are rebuilt for about 30 seconds. You can reduce this time to 15 seconds, but not further, because there are limitations of the protocol itself. It turned out that when a certain number of virtual disks were reached (which the customers created for themselves), a rare heisenbag with an all-flash-storage controller appears. When prompted to create a new disk, the controller can go crazy, get 100% load, go to the thermal shutdown and generate the very 30-second switch. Disks fall off from virtualok. Sailed. For several months we, together with the storage vendor, have been looking for a bug. As a result, we found, and they ruled for us microcode controllers on arrays. We during this time around these arrays really wrote a whole layer, which allowed to solve the problem. And I had to rewrite almost the entire management stack.

Demotivators about arrays are still hanging around support.

Our days

Then there were problems with software for remote workstations. There, the decision was proprietary, and the dialogue with the vendor was as follows:
- Could you help us?
- Not.
- You are full of devils, we will complain about you.
- You are welcome.
At this point, we decided that we should abandon proprietary components. Then the need to close their development. Now we are investing in open-source projects - as in the history of the fact that at one time we provided almost a half-yearly ALT Linux budget, sometimes our request sharply accelerated the development of the necessary development. In parallel, our development on this wave, we brought to the state, as our European colleagues said, "damn amazing."

Today we are looking at the cloud with an experienced eye and understand how to develop it further for several years to come. And, again, we understand that we can do anything with KVM , because there are development resources.

Links

Source: https://habr.com/ru/post/352042/

All Articles