How to move from ESXi to KVM / LXD and not go crazy
Maxnet System has been using the free version of VMware - ESXi since version 5.0 for a long time as a hypervisor. The paid version of vSphere frightened the licensing model, while the free version had a number of flaws that were not available in the paid version, but you could put up with them. But when in new versions of ESXi the new web interface refused to work with the old one, and monitoring of the RAID-arrays stopped showing signs of life, the company decided to look for a more universal and open solution. The company already had a good experience and a pleasant impression of the LXC - Linux Containers. Therefore, it became obvious that the dream hypervisor will be hybrid and combine for different loads KVM and LXD - an evolutionary continuation of LXC. In search of information on KVM, the company faced misconceptions, rakes and harmful practices, but tests and time put everything in its place.
About how to cope with the move from ESXi to KVM and not pierce the wheel on the rake, will tell Lev Nikolaev ( maniaque ) - the administrator and developer of highly loaded systems, an information technology trainer. Let's talk about the Network, storage, containers, KVM, LXD, LXC, provisioning and convenient virtual machines.
Prologue
Immediately we denote the key thoughts, and then analyze them in more detail. ')
Network. As long as the speeds of your interfaces do not exceed 1 Gbps, you will have enough bridge. As soon as you want to squeeze more - it will limit you.
Storage. Create shared network storage. Even if you are not ready to use 10 Gb / s inside the network, even 1 Gb / s will give you 125 MB / s of storage. For a number of loads this will be enough with a margin, and the migration of virtual machines will be an elementary matter.
Container or KVM? Pros, cons, pitfalls. What kind of load is better to put in a container, and which ones should be left in KVM?
LXD or LXC . Is LXD LXC? Or another version? Or a superstructure? What is it all about? Let's dispel the myths and understand the differences between LXD and LXC.
Convenient provisioning . Which is more convenient: take the same image or install the system from scratch every time? How to do it quickly and accurately every time?
Convenient virtual machine. There will be scary stories about loaders, sections, LVM.
Miscellaneous . Many small questions: how to quickly drag a virtual machine from ESXi to KVM, how to migrate well, how to properly virtualize disks?
Reason for moving
Where did we get the crazy idea of ​​moving from ESXi to KVM / LXD? ESXi is popular among small and medium businesses. This is a good and cheap hypervisor. But there are nuances.
We started with version 5.0 - it is convenient, everything works! The next version is 5.5 too.
Since version 6.0 - is already more difficult. On the ESXi, the Web-interface did not immediately become free, only from version 6.5, before it needed a utility under Windows. We put up with it. Who runs OS X buys Parallels and installs this utility. This is a well-known pain.
Periodically flew monitoring. It was necessary to restart the management services in the server console - then CIM Heartbeat reappeared. We suffered, as it did not always fall off.
Version ESXi 6.5 - trash, waste and atrocities. Horrible hypervisor. And that's why.
Angular falls out with an exception still at the entrance to the Web interface. As soon as you enter your username and password - immediately an exception!
The ability to remotely monitor the status of a RAID array as it is convenient for us does not work. It used to be convenient, but in version 6.5 everything is bad.
Weak support for modern network cards from Intel . Intel and ESXi network cards cause pain. The ESXi support forum has a thread of agony about this. VMware and Intel are not friendly, and relations will not improve in the near future. The sad thing is that even paid solutions customers experience problems.
No migration within ESXi . Unless you consider the migration procedure with pause, copying and running. We put the car on pause, quickly copy it and run it in another place. But it is impossible to call it migration - after all there is a simple one.
After looking at it all, we got the crazy idea of ​​moving from ESXi 6.5.
the wish list
To begin with, we wrote a wish list for the ideal future we are leaving for.
Management from under SSH , and Web and other optional. The web interface is great, but when you are on a business trip with an iPhone, it’s inconvenient and difficult to enter the ESXi web interface and do something there. Therefore, the only way to manage everything is SSH, there will be no other.
Windows Virtualization. Sometimes clients ask for strange things, and our mission is to help them.
Always fresh drivers and the ability to customize the network card . Adequate desire, but unrealized under pure ESXi.
Live migration, not clustering . We want the ability to drag machines from one hypervisor to another without feeling any delays, downtime and inconvenience.
The wish list is ready, then the heavy search began.
Flour choice
The market revolves around KVM or LXC under different sauces. Sometimes it seems that Kubernetes is somewhere on top, where everything is fine, the sun and paradise, and at the lower level are Morlocks - KVM, Xen or something like that ...
For example, Proxmox VE is Debian, which has a kernel pulled from Ubuntu. It looks weird, but does it bring in production?
Our neighbors downstairs are alt linux. They came up with a beautiful solution: assembled Proxmox VE as a package. They just put the package in one team. This is convenient, but we do not roll Alt Linux in production, so we did not fit.
Take KVM
In the end, we chose KVM. Not taken, Xen, for example, because of the community - it is much more from KVM. It seemed that we would always find the answer to our question. We later found out that the size of a community does not affect its quality.
Initially, we expected that we would take Bare Metal machine, add Ubuntu, with which we work, and roll KVM / LXD from above. We were counting on the ability to run containers. Ubuntu is a well-known system and there are no surprises in terms of solving boot / restore problems for us. We know where to kick if the hypervisor does not start. Everything is clear and convenient for us.
KVM Crash Course
If you are from the world of ESXi, then you will find a lot of interesting things. Learn three words: QEMU, KVM and libvirt.
QEMU translates the desires of a virtualized OS into calls to the normal process. Works great almost everywhere, but slowly. QEMU itself is a separate product that virtualizes a bunch of other devices.
Next comes the QEMU-KVM bundle. This is the Linux kernel module for QEMU. It is expensive to virtualize all instructions, so we have a KVM kernel module that translates only a few instructions . As a result, it is significantly faster, because only a few percent of the instructions from the total set are processed. This is all the costs of virtualization.
If you just have QEMU, starting a virtual machine without a binding looks like this:
$ qemu < >
In parameters you describe a network, block devices. Everything is wonderful, but uncomfortable. Therefore, there is a libvirt.
The task of libvirt is to be a single tool for all hypervisors . It can work with anything: with KVM, with LXD. It seems that it remains only to learn the syntax of libvirt, but in fact it works worse than in theory.
These three words are all that is needed to raise the first virtual machine in KVM. But again, there are nuances ...
At libvirt there is a config where virtualkeys and other settings are stored. It stores the configuration in xml-files - stylish, fashionable and straight from the 90s. If desired, they can be edited by hand, but why, if there are convenient commands. It is also convenient that changes to xml files are wonderfully versioned. We use etckeeper - versionin directory etc. It is already time to use etckeeper.
LXC Crash Course
There are many misconceptions about LXC and LXD.
LXC is the ability of the modern kernel to use namespaces - to pretend that it is not at all the kernel that was originally.
These namespaces can be created any number for each container. Formally, the core is one, but it behaves like many identical cores. LXC allows you to run containers, but provides only basic tools.
Canonical, which stands behind Ubuntu and aggressively moves containers forward, has released LXD, an analogue of libvirt . This is a harness that makes it easier to launch containers, but inside it is still LXC.
LXD is a container hypervisor that is based on LXC.
Enterprise reigns in LXD. LXD stores the config in its database - in the /var/lib/lxd . There LXD keeps its config in the config in SQlite. It does not make sense to copy it, but you can write down the commands that you used to create the container configuration.
There is no upload as such, but most of the changes are automated by commands. This is an analogue Docker-file, only with manual control.
Production
What did we encounter when we swam into operation with this?
Network
How much hellish trash and intoxication on the Internet about the network in KVM! 90% of materials say to use bridge.
Stop using bridge!
What's wrong with him? Recently, I have a feeling that insanity is going on with containers: we put Docker over Docker so that you can run Docker in Docker while watching Docker. Most do not understand what bridge is doing.
It puts your network controller in promiscuous mode and accepts all traffic, because it does not know which is and which is not. As a result, all bridge traffic goes through a wonderful, fast Linux network stack, and there is a lot of copying. In the end, everything is slow and bad. Therefore do not use bridge in production.
SR-IOV
SR-IOV is the ability to virtualize within a network card . The network card itself is able to allocate a part of itself for virtual machines, which requires some iron support. That is what will prevent migrate. Migrating a virtual machine to where SR-IOV is missing is painful.
SR-IOV should be used where it is supported by all hypervisors, as part of the migration. If not, then you have macvtap.
macvtap
This is for those whose network card does not support SR-IOV. This is the light version of the bridge: different MAC addresses are hung on the same network card, and unicast filtering is used : the network card accepts not everything, but strictly according to the list of MAC addresses.
90% of the materials on how to build a network in KVM are useless.
If someone says that bridge is cool, don't talk to this person anymore.
With macvtap, the CPU saves about 30% due to the smaller number of copies. But with promiscuous mode there are some nuances. It is impossible to connect from the hypervisor itself from the host to the guest’s network interface. The Toshiaki report described in detail about this. But in short - it will not work.
From the hypervisor itself rarely go through SSH. There it is more convenient to start the console, for example, Win-console. It is possible to “watch” traffic on the interface - it is impossible to connect via TCP, but traffic on the hypervisor is visible.
If your speed is above 1 Gigabit - choose macvtap.
At interface speeds of up to or around 1 Gigabit per second, you can use the bridge. But if you have a 10 Gb network card and you want to dispose of it somehow, then only macvtap remains. There are no other options. In addition to SR-IOV.
systemd-networkd
This is a great way to store network configuration on the hypervisor itself . In our case, this is Ubuntu, but for other systems, systemd works.
We used to have the /etc/network/interfaces file in which we kept everything. One file is inconvenient to edit every time - systemd-networkd allows you to split the configuration into a scattering of small files. This is convenient because it works with any versioning system: sent to Git and see when and what change happened.
There is a flaw that our networkers have discovered. When I need to add a new VLAN in the hypervisor, I go and configure. Then I say: "systemctl restart systemd-networkd". At this moment everything is fine with me, but if BGP sessions are raised from this machine, they break. Our networkers do not approve.
For the hypervisor, nothing terrible happens. Systemd-networkd is not suitable for border border servers, servers with raised BGP, and for hypervisors it is excellent.
Systemd-networkd is far from the finals and will never be completed. But this is more convenient than editing one huge file. An alternative to systemd-networkd in Ubuntu 18.04 is Netplan. This is a “cool” way to configure a network and step on a rake.
Network device
After installing KVM and LXD on the hypervisor, the first thing you see is two bridges. One made himself a KVM, and the second - LXD.
LXD and KVM are trying to deploy their network.
If you still need a bridge - for test machines or to play, kill the bridge, which is turned on by default and create your own - the one you want. KVM or LXD do it terribly - slip dnsmasq, and the horror begins.
Storage
No matter what implementations you like - use shared storage.
For example, on iSCSI for virtual machines. You will not get rid of the “point of failure”, but you will be able to consolidate storage at one point . This opens up new interesting opportunities.
For this, it is necessary to have interfaces at least 10 Gbit / s inside the data center. But even if you only have 1 Gbit / s - do not be upset. This is about 125 MB / s - quite good for hypervisors that do not require a high disk load.
KVM is able to migrate and drag storage. But, for example, in workload mode, moving a virtual machine to a couple of terabytes is a pain. For migration with shared storage, there is only enough RAM to transfer, which is elementary. This shortens the migration time .
As a result, LXD or KVM?
Initially, we assumed that for all virtual machines where the kernel coincides with the host system, we take LXD. And where we need to take another core - take KVM.
In reality, the plans did not take off. To understand why, let's take a closer look at LXD.
Lxd
The main advantage is the saving of memory on the core. The core is the same and when we launch new containers the core is the same. At this pros ended and the cons began.
A block device with rootfs needs to be mounted. It's harder than it looks.
There is really no migration . It is, and is based on the wonderful gloomy instrument criu, which our compatriots are sawing. I am proud of them, but in simple cases criu doesn't work.
zabbix-agent behaves strangely in a container . If you run it inside a container, then you will see a series of data from the host system, and not from the container. So far nothing can be done about it.
When looking at the list of processes on the hypervisor, it is impossible to quickly figure out from which container a particular process is growing . It takes time to figure out what the namespace is, what and where. If the load somewhere jumped more than usual, then quickly do not understand. This is the main problem - a limitation in response capabilities. For each case, a mini-investigation is conducted.
The only plus point in LXD is saving memory on the core and reducing overhead.
But Kernel Shared Memory in KVM saves memory.
I see no reason to introduce serious production and LXD. Despite Canonical's best efforts in this area, LXD in production brings more problems than solutions. In the near future the situation will not change.
But you can't say that LXD is evil. It is good, but in limited cases, which will be discussed later.
Criu
Criu is a gloomy utility.
Create an empty container, it will arrive with a DHCP client and tell it: “Suspend!” Get an error, because there is a DHCP client: “Horror, horror! It opens the socket with the raw sign - what a nightmare! ”Worse than ever.
Impressions of containers: no migration, Criu works through time.
I like the recommendation from the LXD team about what to do with Criu so that there are no problems:
- Get a fresher version from the repository!
Is it possible to put it somehow from the package in order not to run into the repository?
findings
LXD is wonderful if you want to create a CI / CD infrastructure. We take LVM - Logical Volume Manager, do a snapshot with it, and start the container on it. Everything works great! For a second, a new clean container is created, which is set up for testing and rolling chef - we actively use it.
LXD is weak for serious production . We cannot understand what to do with LXD in production if it does not work well.
Choose KVM and only KVM!
Migration
I will say this briefly. For us, the migration turned out to be a wonderful new world that we like. Everything is simple there - there is a command for migration and two important options:
If you type “KVM migration” in Google and open the first material, you will see the command for migration, but without the last two keys. You will not see any mention that they are important: “Just execute this command!” Execute the command — does it really migrate, but just how?
Important migration options.
undefinesource - remove a virtual machine from the hypervisor from which we are migrating. If you reboot after such a migration, the hypervisor from which you left will restart this machine. You'd be surprised, but this is normal.
Without the second parameter — the persistent — the hypervisor where you moved, does not at all consider that this is a permanent migration. After the reboot, the hypervisor will not remember anything.
- virsh dominfo <vm> | grep persistent
Without this option, the virtual machine - circles on the water. If the first parameter is specified without the second, then guess what will happen.
There are many such moments with KVM.
Network: you are always told about the bridge - it's a nightmare! You read and think - how so?
Migration: about her, too, will not say anything intelligible, until you yourself beat your head against this wall.
Where to begin?
Late to start - I'm talking about something else.
Provisioning: how to deploy it
If you are satisfied with the standard installation options, then the preseed mechanism is beautiful.
Under ESXi, we used virt-install. This is a regular way to deploy a virtual machine. It is convenient because you create a preseed file that describes the image of your Debian / Ubuntu. Start a new car by feeding it an ISO distribution kit and a preseed file. Then the machine itself rolls out. You connect to it via SSH, catch it in the chef, roll kukbuks - that's it, you rushed to the prod!
But if you have enough virt-install, I have bad news. This means that you have not reached the stage when you want to do something differently. We came to understand that virt-install is not enough. We have come to some "golden image", which we clone and then run virtuals.
And how to arrange a virtual machine?
Why did we come to this image, and why is provisioning important at all? Because there is still a poor understanding in the community that there are big differences between a virtual machine and a regular machine.
The virtual machine does not need a complicated boot process and a smart bootloader . It is much easier to attach the disks of a virtual machine to a machine that has a full set of tools than trying to get out somewhere in recovery mode.
The virtual machine needs the simplicity of the device . Why do we need partitions on a virtual disk? Why do people take a virtual disk, and put there partitions, not LVM?
The virtual machine needs maximum extensibility . Usually virtualka grow. This is a “cool” process - increasing the partition in the MBR. You delete it, at this moment wiping the sweat from your forehead, and you think: “Just not write it down now, just not write it down!” - and create it again with new parameters.
LVM @ lilo
As a result, we came to LVM @ lilo. This is a downloader that allows customization from a single file. If to configure the GRUB config you edit a special file that manages the template engine and builds the monster boot.cfg, then with Lilo there is one file and nothing more.
LVM without partitions allows you to make the system perfect and simple. The problem is that GRUB cannot live without MBR or GPT, and it is freezing. We tell him: “GRUB is established here,” but he cannot, because there are no partitions.
LVM allows you to quickly expand and make backups. Standard dialogue:
- Guys, how do you make a virtual backup?
- ... we take a block device and copy.
- tried to deploy back?
- Well, no, everything works for us!
You can lick the block device in a virtual machine at any time, but if there is a file system there, then any entry to it requires three movements - this procedure is not atomic.
If you do snapshot of a virtual machine from the inside, then it can talk to the file system so that it comes to the right consistent state. But it is not suitable for everything.
How to build a container?
For start and creation of the container there are regular means from templates. LXD offers a Ubuntu 16.04 or 18.04 template. But if you are an advanced fighter and want not a regular template, but your custom rootfs, which you can customize, the question arises: how to create a container from scratch in LXD?
Container from scratch
Prepare rootfs . This will help debootstrap: explain which packages are needed, which are not - and set.
Explain LXD that we want to create a container from a specific rootfs . But first we create an empty container with a short command:
A thoughtful reader will say - where are the rootfs my-container? Where is it indicated where it lies? But I did not say that this is all!
Mount the container rootfs to where it will live. Then we specify that the rootfs container will live there:
lxc config set my-container raw.lxc "lxc.rootfs=/containers/my-container/rootfs"
Again this is automated.
Container life
The container does not have its own kernel , so its loading is simpler : systemd, init and fly!
If you are not using standard tools for working with LVM, then in most cases you will need to install container rootfs in the hypervisor to start the container.
I sometimes find articles in which “autofs” are advised. Do not do this. Systemd has automount units that work, but autofs does not. Therefore, systemd automount units can and should be used, and autofs is not worth it.
findings
We like KVM with migration . With LXD, not yet on the way, although for tests and building infrastructure we use it where there is no production load.
We like the KVM performance . It’s more usual to look at the top, to see a process there that is related to this virtual machine, and to understand who is doing what with us. This is better than using a set of strange utilities with containers to find out what is behind the underwater knocks.
We are delighted with the migration. This is largely due to the common repository. If we migrated while dragging the disks, we would not be so happy.
If you, like Lev, are ready to talk about overcoming the difficulties of operation, integration or support, now is the time to submit a report to the autumn DevOpsConf conference. And we in the program committee will help prepare the same encouraging and useful talk like this.
We are not waiting for the Call for Papers deadline and have already accepted several reports to the conference program . Subscribe to the newsletter and the telegram channel and you will be informed about the news on the preparation for DevOpsConf 2019 and do not miss new articles and videos.