Luna. High-speed installation of operating systems

Hello. I present to the public a new utility for high-speed bare-metal provisioning in the north.

TL; DR

Competitor xCAT / Warewulf / Rocks. Uses BitTorrent to distribute OC images. Supported OS - RHEL-family. Debian / Ubuntu is in the works. The most extensive test at the moment: a cold boot HPC cluster of 512 nodes is performed in 4 minutes. Automatic detection of node name based on switch-port par.

Link: https://github.com/dchirikov/luna

Slightly more details.

So. Once, installing the next cluster with the help of xCAT and frustrating to endless post-scripts on the bash, there was a burning desire to write your own and better.

At first it was a classic pet-project, but now it came out in people, and my company and I have already installed a little less than 10 clusters from 30 to 500 nodes. The first release that is not a shame to show people we rolled out not so long ago, so there is a reason to show off on Habré :)

Having experience with other systems and communicating with the engineers of customers, I chose 2 killer features: images instead of kickstart and bittorent for casting.

The first uses, for example, Bright, as well as xCAT for the diskless method. The idea of a torrent I threw one of the customers from Sweden.

She is brilliant in its simplicity. An HPC cluster is when we have hundreds of compute nodes with exactly the same configuration. That is, the image of the system is the same. Therefore, instead of each node drawing its copy of the operating system from the master server and tearing down the interface to this poor server, let the nodes share what has already been downloaded. The working images on each node differ slightly: literally the hostname and ip-addresses.

Therefore, we create this image on the master node. This is done elementary - creating a chroot environment is quite a trivial task. Then the whole tree is packed in a tarball and pushed into a torrent.
The secret sauce is the dracut module that lies in the initrd. He knows how to communicate with the server (and knows how to find it from / proc / cmdline), has a self-written lightweight torrent-client, and also knows how to execute simple commands. Literally curl | bash curl | bash . Frequently asked question: “Does a torrent client always work?” Answer: "No, only when loading, before pivotroot, all services related to the luna module stop their work."

Why do we need to load the nodes quickly? First of all, this beauty ^ W reduces the time to put the cluster into operation. For a rather limited installation time, we have the opportunity to conduct many tests with different parameters and configs. HPC is a thing in itself. If you run HPL (Linpack) on 4 nodes, it’s not at all a fact that it will start at 400. Secondly, the customer can only enable the nodes when there is work for them. And for loading it is not necessary to wait for half an hour: the load time does not depend on the size of the task set in the scheduler (Slurm, for example). There is no more difference to turn on 2 nodes or 100.

Using OS images also reduces pain when experimenting with a cluster — all those packages and configs that are installed in the chroot environment will be on the node. There are no more surprises when xCAT sly refused to install a package or simply fell off for an unknown reason.
Moreover, it is possible to configure the node and "grab" it into an image. Therefore, installing drivers or some kind of exotic packages that will definitely serve up RDMA devices during installation is greatly facilitated.

Technical details.

If it’s still interesting, I’ll go further on technical details, utilities and architecture.

luna is the main configuration utility. She has a bunch of sub-objects. I will describe the main and, approximately in that order in which it is necessary to start what to configure a cluster.

luna cluster - I decided to abandon text configs, so everything that is expected to be seen in configs is configured here: port ranges, directory for storing files, network for dhcp, etc.

luna network - luna network description. Networks are assigned to interfaces in groups, which, in turn, are assigned to nodes.

luna osimage is an operating system image. Describes the path to the directory tree. Includes the version and kernel parameters to load. Also stores the exclude-list for "grabbing" the node.

luna group - description of node groups. As I said, the nodes have little individuality, so the basic configuration occurs here. Interfaces are created (for example, em2), a network is assigned, additional parameters are written which go to the ifcfg-em2 file. BMC / IPMI parameters are assigned. There are also 3 types of scripts: pre-, part-, post-. These are ordinary bash scripts that are executed at different stages of the installation.

The first is rarely used. In theory, here can be described something that should be executed before configuring the node. This could be, for example, a BMC reboot or BIOS firmware (however, I strongly do not recommend this!).

part-script - from partitioning. Its purpose is to configure disks (if any) and mount them under / sysroot. If there are no disks, then mount tmpfs. After everything is mounted, the image of the OS will boot and unfold.

post-script - write fstab, install grub and all that. At this stage, we can chroot-nuty in / sysroot and do what we need is no longer in a dracut-environment, but, practically, in a full-featured OC.

Yes, I forgot to mention that the dracut module brings up the network interfaces and the ssh daemon, so it can be used as a rescue environment. Most binaries and utilities are available: awk, sed, parted, etc. Moreover, there is a service mode when a node is not trying to load an image, but remains in dracut.

The last object I would like to mention is the luna node . Here individual parameters of the node are configured: name, ip-addresses, mac-address, switch and port in which the node is connected.

About the mac-address is to say more in detail. In fact, this is the central identifier of the node. But manually typing 6 octets for a thousand nodes is very sad. Therefore, there are 2 other destination modes. One - when the admin can choose the name of the node from the list at startup. Luna uses iPXE so it’s possible to write fairly advanced menus. The second method is to assign the switch and port to the node. That is, for example, any node connected to the first port of the first switch will be named as node001 and its mac address will be written into the database. By the way, MongoDB is used as the base.

In addition to luna , there are several executable files:

Lweb is a web service that distributes tasks (bash scripts) and boot menus. They are sent for execution on nodes. lweb includes a torrent tracker. All traffic is localized inside the cluster - nothing leaks out.
ltorrent - torrent- "server" for distributing images.
lpower - wrapper over ipmitool for quick reload / power on / off node. Takes logins and passwords from the luna database.
lchroot - wrapper over chroot. Allowed to quickly "fail" in the chroot OS-image weapon. In addition, can fake uname -r :)

 [root@master ~]# uname -r 3.10.0-327.36.1.el7.x86_64 [root@master ~]# lchroot compute IMAGE PATH: /opt/luna/os/compute chroot [root@compute /]$ uname -r 4.8.10-1.el7.elrepo.x86_64

Uff ... Perhaps that's all. I can say what is planned next.

Performance. At this memo, working with the database is far from the most efficient way. This does not affect the loading of the node, but it does affect the CLI. Therefore, there is work to do.
My colleague prepared a pull-request with Ubuntu support. Most likely we will add.
Change the logic of working with interfaces. They are sometimes strange and unnerving me :)
Add more tests. The project has already come out of toddlers, so you should use the best practices.
We think about restricting BitTorrent traffic between switches. So far everything is working, but who knows when exascale will fall on us :)

Ps why luna? Therefore .

Source: https://habr.com/ru/post/324750/

All Articles

Luna. High-speed installation of operating systems

TL; DR

Slightly more details.

Technical details.

More articles: