Secure storage with DRBD9 and Proxmox (Part 1: NFS)

Probably anyone who has ever been puzzled by the search for high-performance software-defiined storage sooner or later heard about DRBD , and maybe even dealt with it.

True, at the peak of popularity of Ceph and GlusterFS , which work well in principle, and most importantly right out of the box, everyone just forgot a little about him. Moreover, the previous version did not support the replication of more than two nodes, and because of which there were often problems with the split-brain , which obviously did not add to its popularity.

The decision and the truth is not new, but quite competitive. With relatively low CPU and RAM costs, DRBD provides really fast and secure synchronization at the block device level. For all this time, LINBIT - DRBD developers are not standing still and are constantly modifying it. Starting with the DRBD9 version, it ceases to be just a network mirror and becomes something more.

First, the idea of creating one distributed block device for several servers has receded into the background, and now LINBIT is trying to provide orchestration and management tools for multiple drbd devices in a cluster that are created on top of LVM and ZFS partitions .

For example, DRBD9 supports up to 32 replicas, RDMA, diskless nodes, and new orchestration tools allow you to use snapshots, online migration and much more.

Despite the fact that DRBD9 has integration tools with Proxmox , Kubernetes , OpenStack and OpenNebula , at the moment they are in a transitional mode, when new tools are not yet supported everywhere, and the old ones will soon be announced as deprecated . These are DRBDmanage and Linstor .

I will take advantage of this moment so as not to go into great detail in each of them, but in more detail to consider the setting and principles of working with DRBD9 itself. You will still have to deal with this, if only because the fail-safe configuration of the Linstor controller implies installing it on one of these devices.

In this article I would like to tell you about DRBD9 and the possibility of its use in Proxmox without third-party plug-ins.

DRBDmanage and Linstor

Firstly, it is worth mentioning once again about DRBDmanage , which is very well integrated into Proxmox . LINBIT provides a ready-made DRBDmanage plugin for Proxmox which allows you to use all its functions directly from the Proxmox interface.

It looks really awesome, but unfortunately it has some downsides.

First, the volume names, LVM-group or ZFS-pool must have the name drbdpool .
Inability to use more than one pool per node
Due to the specifics of the solution, the controller volume can only be located on a regular LVM and nothing else
Periodic dbus glitches, which are closely used by DRBDmanage to interact with nodes.

As a result, LINBIT made the decision to replace all the complex logic of DRBDmanage with a simple application that communicates with the nodes using the usual tcp connection and works without any magic there. This is how Linstor appeared.

Linstor really works very well. Unfortunately, the developers chose java as their main language for writing Linstor-server, but don't let that frighten you, since Linstor itself only deals with the distribution of DRBD configs and cutting LVM / ZFS sections on nodes.

Both solutions are free and distributed under the free GPL3 license .

You can read about each of them and about setting up the above-mentioned plug-in for Proxmox at the official Proxmox wiki.

Failsafe NFS Server

Unfortunately at the time of this writing, Linstor has a ready-made integration only with Kubernetes . But at the end of the year, drivers for the rest of Proxmox , OpenNebula , OpenStack are also expected.

But so far there is no ready-made solution, and we don’t like the old one anyway. Let's try to use DRBD9 in the old manner to organize NFS access to a common partition.

However, this solution is also not without advantages, because the NFS server will allow you to organize competitive access to the storage file system from several servers without the need to use complex cluster file systems with DLM, such as OCFS and GFS2.

In this case, you will be able to switch the roles of Primary / Secondary nodes simply by migrating the container with the NFS server in the Proxmox interface.

You can also store any files inside this file system, as well as virtual disks and backups.

In case you use Kubernetes, you can arrange ReadWriteMany access for your PersistentVolumes .

Proxmox and LXC containers

Now the question is: why Proxmox?

In principle, to build such a scheme, we could use Kubernetes as well as the usual scheme with a cluster manager. But Proxmox provides a ready-made, very multifunctional and at the same time simple and intuitive interface for almost everything you need. It can clustering out of the box and supports the softdog based fencing mechanism. And when using LXC containers, it allows to achieve minimum timeouts when switching.
The resulting solution will not have a single point of failure .

In essence, we will use Proxmox primarily as a cluster-manager , where we will be able to treat a separate LXC container as a service running in a classic HA cluster, with the only difference that its bundled system also comes with its root system . That is, you do not need to install several eczemals of the service on each server separately; you can only do this once inside the container.
If you have ever worked with cluster-manager software and providing HA for applications, you will understand what I mean.

General scheme

Our solution will resemble the standard replication scheme of some database.

We have three nodes
On each node distributed drbd device .
On the device, the usual file system ( ext4 )
Only one server can be master
The wizard runs the NFS server in the LXC container .
All nodes access the device strictly via NFS.
If necessary, the wizard can move to another node, along with the NFS server.

DRBD9 has one very cool feature that simplifies things a lot:
The drbd device will automatically become Primary at the moment when it is mounted on a node. If the device is marked as Primary , any attempt to mount it on another node will result in an access error. This ensures blocking and guaranteed protection against simultaneous access to the device.

Why is this all so much easier? Because when starting the container, Proxmox automatically mounts this device and it becomes Primary on this node, and when the container is stopped, it unmounts the device and the device becomes Secondary again.
So we no longer need to worry about switching Primary / Secondary devices, Proxmox will do this automatically , Hurray!

DRBD setup

Well, well, the idea sorted out now move on to the implementation.

By default, the module of the eighth version of drbd is supplied with the Linux kernel , unfortunately it does not suit us and we need to install the module of the ninth version.

Connect the LINBIT repository and install everything you need:

 wget -O- https://packages.linbit.com/package-signing-pubkey.asc | apt-key add - echo "deb http://packages.linbit.com/proxmox/ proxmox-5 drbd-9.0" \ > /etc/apt/sources.list.d/linbit.list apt-get update && apt-get -y install pve-headers drbd-dkms drbd-utils drbdtop

pve-headers - kernel headers necessary for building the module
drbd-dkms - kernel module in DKMS format
drbd-utils - basic utilities for managing DRBD
drbdtop is an interactive tool like top for DRBD only

After installing the module, check if everything is in order with it:

 # modprobe drbd # cat /proc/drbd version: 9.0.14-1 (api:2/proto:86-113)

If you see the eighth version in the output of the command, something went wrong and the in-tree kernel module is loaded. Check the dkms status for the reason.

Each node will have the same drbd device running on top of the usual partitions. First we need to prepare this section for drbd on each node.

As such a partition can be any block device , it can be lvm, zvol, a disk partition or the entire disk. In this article I will use a separate nvme disk with a partition for drbd: /dev/nvme1n1p1

It is worth noting that device names tend to change sometimes, so it’s better to take the habit of using a permanent symlink on a device right away.

You can find such a symlink for /dev/nvme1n1p1 like this:

 # find /dev/disk/ -lname '*/nvme1n1p1' /dev/disk/by-partuuid/847b9713-8c00-48a1-8dff-f84c328b9da2 /dev/disk/by-path/pci-0000:0e:00.0-nvme-1-part1 /dev/disk/by-id/nvme-eui.0000000001000000e4d25c33da9f4d01-part1 /dev/disk/by-id/nvme-INTEL_SSDPEKKA010T7_BTPY703505FB1P0H-part1

We describe our resource on all three nodes:

 # cat /etc/drbd.d/nfs1.res resource nfs1 { meta-disk internal; device /dev/drbd100; protocol C; net { after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; } on pve1 { address 192.168.2.11:7000; disk /dev/disk/by-partuuid/95e7eabb-436e-4585-94ea-961ceac936f7; node-id 0; } on pve2 { address 192.168.2.12:7000; disk /dev/disk/by-partuuid/aa7490c0-fe1a-4b1f-ba3f-0ddee07dfee3; node-id 1; } on pve3 { address 192.168.2.13:7000; disk /dev/disk/by-partuuid/847b9713-8c00-48a1-8dff-f84c328b9da2; node-id 2; } connection-mesh { hosts pve1 pve2 pve3; } }

It is advisable to synchronize drbd to use a separate network .

Now create the metadata for drbd and launch it:

 # drbdadm create-md nfs1 initializing activity log initializing bitmap (320 KB) to all zero Writing meta data... New drbd meta data block successfully created. success # drbdadm up nfs1

Repeat these actions on all three nodes and check the status:

 # drbdadm status nfs1 role:Secondary disk:Inconsistent pve2 role:Secondary peer-disk:Inconsistent pve3 role:Secondary peer-disk:Inconsistent

Now our disk is Inconsistent on all three nodes, this is because drbd does not know which disk should be taken as the original. We need to mark one of them as Primary , so that its state is synchronized to the other nodes:

 drbdadm primary --force nfs1 drbdadm secondary nfs1

Immediately after this, synchronization will start:

 # drbdadm status nfs1 role:Secondary disk:UpToDate pve2 role:Secondary replication:SyncSource peer-disk:Inconsistent done:26.66 pve3 role:Secondary replication:SyncSource peer-disk:Inconsistent done:14.20

We don’t have to wait until it is finished and we can follow up the next steps in parallel. They can be performed on any node , regardless of its current state of the local disk in DRBD. All requests will be automatically redirected to the device with the UpToDate state.

Don't forget to activate the autorun of the drbd service on the nodes:

 systemctl enable drbd.service

Configure LXC Container

Omit part of the configuration of the Proxmox cluster of three nodes, this part is well described in the official wiki

As I said before, our NFS server will work in an LXC container . We will keep the container itself on the /dev/drbd100 device we just created.

First we need to create a file system on it:

 mkfs -t ext4 -O mmp -E mmp_update_interval=5 /dev/drbd100

Proxmox by default includes multimount protection at the file system level, in principle we can do without it, because DRBD by default has its own protection, it will simply prohibit the second Primary for the device, but caution will not harm us.

Now download the Ubuntu template:

 # wget http://download.proxmox.com/images/system/ubuntu-16.04-standard_16.04-1_amd64.tar.gz -P /var/lib/vz/template/cache/

And create from it our container:

 pct create 101 local:vztmpl/ubuntu-16.04-standard_16.04-1_amd64.tar.gz \ --hostname=nfs1 \ --net0=name=eth0,bridge=vmbr0,gw=192.168.1.1,ip=192.168.1.11/24 \ --rootfs=volume=/dev/drbd100,shared=1

In this command, we specify that the root system of our container will be located on the /dev/drbd100 and add the parameter shared=1 to allow the container to be migrated between nodes.

If something went wrong, you can always fix it through the Proxmox interface or in the /etc/pve/lxc/101.conf container /etc/pve/lxc/101.conf

Proxmox will unpack the template and prepare the container root system for us. After that we can run our container:

 pct start 101

Configure NFS server.

By default, Proxmox does not allow the launch of an NFS server in a container, but there are several ways to allow this.

One of them is just to add lxc.apparmor.profile: unconfined to the config of our container /etc/pve/lxc/100.conf .

Or we can enable NFS for all containers on an ongoing basis, for this you need to update the standard template for LXC on all nodes, add the following lines to /etc/apparmor.d/lxc/lxc-default-cgns :

  mount fstype=nfs, mount fstype=nfs4, mount fstype=nfsd, mount fstype=rpc_pipefs,

After the changes, restart the container:

 pct shutdown 101 pct start 101

Now let's log in to it:

 pct exec 101 bash

Install updates and NFS server :

 apt-get update apt-get -y upgrade apt-get -y install nfs-kernel-server

Create an export :

 echo '/data *(rw,no_root_squash,no_subtree_check)' >> /etc/exports mkdir /data exportfs -a

HA Setup

At the time of writing this article, proxmox HA-manager has a bug that prevents the HA container from successfully completing its work, as a result of which, the nfs server processes that are not fully killed by the kernel-space prevent the drbd device from going to Secondary . If you have already encountered such a situation, do not panic and simply execute killall -9 nfsd on the node where the container was launched and then the drbd device should “release” and it will go to the Secondary .

To fix this bug, execute the following commands on all nodes:

 sed -i 's/forceStop => 1,/forceStop => 0,/' /usr/share/perl5/PVE/HA/Resources/PVECT.pm systemctl restart pve-ha-lrm.service

Now we can go to the HA-manager configuration. Create a separate HA group for our device:

 ha-manager groupadd nfs1 --nodes pve1,pve2,pve3 --nofailback=1 --restricted=1

Our resource will work only on the nodes specified for this group. Add our container to this group:

 ha-manager add ct:101 --group=nfs1 --max_relocate=3 --max_restart=3

That's all. Simple, isn't it?

The resulting nfs-ball can be immediately connected to Proxmox, for storing and running other virtual machines and containers.

Recommendations and tuning

DRBD

As I noted above, it is always advisable to use a separate network for replication. It is highly desirable to use 10-gigabit network adapters , otherwise you all will rest on the speed of the ports.
If replication seems slow enough, try tuning in some parameters for DRBD . Here is the config that, in my opinion, is optimal for my 10G network :

 # cat /etc/drbd.d/global_common.conf global { usage-count yes; udev-always-use-vnr; } common { handlers { } startup { } options { } disk { c-fill-target 10M; c-max-rate 720M; c-plan-ahead 10; c-min-rate 20M; } net { max-buffers 36k; sndbuf-size 1024k; rcvbuf-size 2048k; } }

More information about each parameter you can get information from the official documentation DRBD

NFS server

To speed up the work of the NFS server, an increase in the total number of running instances of the NFS server may help. By default - 8 , I personally helped increase this number to 64 .

To achieve this, update the RPCNFSDCOUNT=64 parameter in /etc/default/nfs-kernel-server .
And restart the daemons:

 systemctl restart nfs-utils systemctl restart nfs-server

NFSv3 vs NFSv4

Do you know the difference between NFSv3 and NFSv4 ?

NFSv3 is a stateless protocol; as a rule, it is better at crashing and recovering faster.
NFSv4 is a stateful protocol , it is faster and can be tied to certain tcp ports, but due to the presence of a state, it is more sensitive to failures. It also has the ability to use authentication using Kerberos and a bunch of other interesting features.

However, when you run showmount -e nfs_server , the NFSv3 protocol is used. Proxmox also uses NFSv3. NFSv3 is also commonly used to organize booting of machines over a network.

In general, if you have no particular reason to use NFSv4, try to use NFSv3, as it is less painful when experiencing any failures due to a lack of state as such.

You can mount the ball using NFSv3 by specifying the -o vers=3 parameter for the mount command:

 mount -o vers=3 nfs_server:/share /mnt

If you wish, you can disable NFSv4 for the server at all, to do this, add the option --no-nfs-version 4 to the RPCNFSDCOUNT variable and restart the server, for example:

 RPCNFSDCOUNT="64 --no-nfs-version 4"

iSCSI and LVM

Similarly, a normal tgt-daemon can be configured inside the container, iSCSI will produce much more performance for I / O operations, and the container will work more smoothly, in view of the fact that the tgt-server runs completely in user space.

Typically, an exported LUN is sliced into multiple pieces using LVM . However, there are several nuances that are worth considering, for example: how are LVM locks provided for sharing an exported group on several hosts.

Perhaps these and other nuances I will describe in the next article .

Source: https://habr.com/ru/post/417473/

All Articles