Today we wanted to tell you about how our guys increased the cluster productivity for software testing by 4 times in three hours, simply “looking their brains”.
Upd. This post is - NOT a SCALE test - this is a real story from practice with funny moments. We have increased the density of Vmok 4 times, if you expect to see comparative testing, graphs and performance analysis, you are not here. Here today, rather a mental read.
Let's make a few remarks so that it is clear where our “legs grow” topic comes from. The peculiarity of the work of Virtuozzo is that the development department and all programmers are in Moscow (a legacy of SWsoft and our alma mater, PhysTech), and the head office is in Seattle (USA). But for today's post, this is important only because our HPC cluster for software testing is also in the USA, and the main “customers” of test problems are in Moscow. And despite all the remote access, this could be a problem, because between these two points there are 11 time zones, and when the work day begins in Seattle, it ends in Moscow, which means physically changing something on the servers is not easy.
')

Launched, but not sharpened
But let's be more specific: in order to test the new versions of the Virtuozzo software, a large cluster of 10 machines was launched on which we installed our virtualization system, and at the VM level, we load our software again for numerous test runs. Despite the continuous monitoring of this process by development engineers, more than 99% of the load on the cluster is created by automated bots that seek to launch as many subtasks of testing as possible at any given time.
The cluster was launched relatively recently, and there is no permanent Virtuozzo staff at the data center site where we rent a place. And it seems that this should not be a problem - you can still do it remotely ... well, apart from the physical reconfiguration, our guys needed it, since we managed to run only 5-7 embedded VMs when we wanted a lot more.
It turned out that 10 servers with Xeon L5640 and Xeon X5650 processors can take on a fairly high load, even taking into account the fact that they use the Virtuozzo Storage data storage system. But the distribution of memory and disks between them was carried out without taking into account the upcoming tasks, and the installed additional network cards could not provide a performance boost, since they were simply “not where they should be”.

After analyzing the cluster, we realized that in vain we did not compile a preliminary work model for its assembly, because:
- The traffic of access to the VM of users (mainly bots) was mixed with the traffic of the storage system, clogging the channel
- Virtual machines run pointlessly on nodes with a small amount of RAM, overloading them
- Additional network cards just stood idle due to the lack of traffic redistribution rules
To defeat all this disgrace, it was decided to rebuild a number of servers according to the following rules:
- Install 2 (or 4 for servers with VZ Storage) network cards in all servers
- In servers with less powerful processors, insert the most capacious disks and combine additional network interfaces (for VZ Storage) into bonds
- In servers with more powerful processors insert less capacious disks, but a maximum of RAM.
From Brighton Beach to Deribasovskaya
To carry out this “castling”, he needed “his own man” in Seattle, and they became our colleague Kirill Kolyshkin. He fortunately had access to the data center, and although he was not the cluster administrator, he was glad to help us.
We sat down at the end of the working day with full readiness to work, but Kirill was stuck in a traffic jam and got to the data center only at 20-30 Moscow time. Friday evening, I want to go home, but I have to work. And we begin to discuss in general chat what needs to be installed and where.
“How do I know how? In this case, I fulfill the role of an iron engineer, I do not understand anything in your systems ”, is one of the most important phrases of our engineer.
Yes, he worked blindly and on orders, so we had some very interesting moments. In order not to spoil the feeling of the process, we will simply cite conversations from the chat, in which all porridge was cooked:
kir [9:15 PM] I dropped a couple of bolts, I wanted to ask someone where to find them
[9:15] okay, I'll look for it myself
[9:30] keep looking for bolts
[9:40] figs with them, with bolts
[9:19] guys, I hit my head on the server
[9:19] I'll go stop the blood
[9:19] (this is not a joke)
At the same time, we have learned a lot about our systems that stand quietly in the USA:
kir [9:51 PM] The car 118 has a curved rail on the right, I almost fell on my leg, barely put it back
apershin [9:52 PM] didn’t give out helmets at the entrance?)) as in hazardous industries)))
kir [9:52 PM] he's actually in one half there, hanging, or rather lying on the previous one
Without humor, of course, in such a situation it is impossible, but once we even went too far. Still, the chat was unprotected ...
Alexandr: Americans - again these crazy Russian hackers are plotting something - surely an attack on Hilary's headquarters))))
apershin [11:05 PM] We will add, they will arrive there after Cyrus)))
Cyril, of course, really wanted to leave the server room and stop doing things that actually do not concern him:
[11:41] I'm ready to leave here.
[11:42] tell me when it will be possible
[11:42] And that lunch time is long past
[11:45] Uncle, and uncle
[1:11] same not mange pa sys zhyur
"Kir [10:47] in short all the screws on my cart"
A few hours and the result
But we could not let Cyril go too early, because not everything worked right away. It turned out that we had more network cards than we thought, it turned out that not all cables worked well and, finally, it turned out that the servers have different BIOS settings, and some of them simply did not restart after changing the configuration.
We checked the links, changed the patch cords, re-installed the system, and as a result, closer to one in the night in Moscow, they released Cyril with a bruised leg, injured head and an empty stomach to deal with his working questions (he had already missed lunch snacks).
What we got in the end is a more productive cluster for testing: instead of 5-7 virtual loops in each environment, we were able to run 15-20 pieces. At the same time, Storage worked on a separate network through a dedicated switch, without interfering with the requests of bots and users. So, our team has proved its cohesion, and servers began to work much more efficiently due to the optimal distribution of components. So do not be afraid of remote work with servers - the main thing is that there should be a reliable person on the spot who is not afraid of injuries or hunger.