Secure storage with DRBD9 and Proxmox (Part 2: iSCSI + LVM)

In a previous article, I looked at the possibility of creating a fault-tolerant NFS server using DRBD and Proxmox. It turned out pretty well, but we will not stop there and now we will try to "squeeze all the juice" out of our store.

In this article I will tell you how to create a fault-tolerant iSCSI target in a similar way, which, with the help of LVM, we will cut into small pieces and use for virtual machines.

This approach will reduce the load and increase the speed of access to data several times, it is especially beneficial when competitive access to data is not required, for example in the case when you need to organize storage for virtual machines.

A few words about DRBD

DRBD is a fairly simple and mature solution, the code of the eighth version is adopted in the Linux kernel. In essence, the network mirror is RAID1. In the ninth version, there is support for quorum and replication with more than two nodes.

In fact, it allows you to combine block devices on several physical nodes into one common network share.

Using DRBD you can achieve very interesting configurations. Today we will talk about iSCSI and LVM.

You can learn more about it by reading my previous article , where I described this solution in detail.

A couple of words about iSCSI

iSCSI is a protocol for delivering a block device over a network.

Unlike the NBD, it supports authorization, works without problems with network failures and supports many other useful functions, and most importantly shows very good performance.

There is a huge number of its implementations, some of them are also included in the kernel and do not require any special difficulties for its configuration and connection.

A couple of words about LVM

It is worth mentioning that LINBIT has its own solution for Proxmox, it should work out of the box and allow you to achieve a similar result, but in this article I would not like to focus only on Proxmox and describe some more universal solution that is suitable for both Proxmox and something else, in this example, proxmox is used only as a means of orchestrating containers, in fact, you can replace it with another solution, for example, launch containers with Target in Kubernetes.

As for Proxmox specifically, it works fine with shared LUN and LVM, using only its own standard drivers.

The advantages of LVM include the fact that its use is not something revolutionary new and insufficiently run-in, but, on the contrary, it shows dry stability, which is usually required from storage. It is worth mentioning that LVM is quite actively used in other environments, for example, in OpenNebula or in Kubernetes and is supported quite well there.

Thus, you will receive universal storage that can be used in different systems (not only in proxmox), using only ready-made drivers, without much need to modify it with a file.

Unfortunately, when choosing a solution for storage, you always have to make some compromises. So here, this solution will not give you the same flexibility as for example Ceph.
The virtual disk size is limited by the size of the LVM group, and the area marked up for a specific virtual disk will necessarily be preallocated - this greatly improves the speed of access to data, but does not allow for Thin-Provisioning (when the virtual disk takes up less space than it actually is). It is worth mentioning that the performance of LVM sags quite a lot when using snapshots, and therefore the possibility of their free use is often excluded.

Yes, LVM supports Thin-Provision pools, which are deprived of this disadvantage, but unfortunately their use is possible only in the context of one node and there is no possibility to share one Thin-Provision pool for several nodes in a cluster.

But despite these shortcomings, due to its simplicity, LVM still does not allow competitors to bypass it and completely oust it from the battlefield.

With a fairly small overhead, LVM still represents a very fast, stable and reasonably flexible solution.

General scheme

We have three nodes
On each node distributed drbd device .
On top of the drbd device, an LXC container with iSCSI target is running.
Target is connected to all three nodes.
A LVM-group is created on the connected target.
If necessary, LXC container can move to another node, along with iSCSI target

Customization

With the idea sorted out now move on to the implementation.

By default, the module of the eighth version of drbd is supplied with the Linux kernel , unfortunately it does not suit us and we need to install the module of the ninth version.

Connect the LINBIT repository and install everything you need:

wget -O- https://packages.linbit.com/package-signing-pubkey.asc | apt-key add - echo "deb http://packages.linbit.com/proxmox/ proxmox-5 drbd-9.0" \ > /etc/apt/sources.list.d/linbit.list apt-get update && apt-get -y install pve-headers drbd-dkms drbd-utils drbdtop

pve-headers - kernel headers necessary for building the module
drbd-dkms - kernel module in DKMS format
drbd-utils - basic utilities for managing DRBD
drbdtop is an interactive tool like top for DRBD only

After installing the module, check if everything is fine with it:

 # modprobe drbd # cat /proc/drbd version: 9.0.14-1 (api:2/proto:86-113)

If you see the eighth version in the output of the command, something went wrong and the in-tree kernel module is loaded. Check the dkms status for the reason.

Each node will have the same drbd device running on top of the usual partitions. First we need to prepare this section for drbd on each node.

As such a partition can be any block device , it can be lvm, zvol, a disk partition or the entire disk. In this article I will use a separate nvme disk with a partition for drbd: /dev/nvme1n1p1

It is worth noting that device names tend to change sometimes, so it’s better to take the habit of using a permanent symlink on a device right away.

You can find such a symlink for /dev/nvme1n1p1 like this:

 # find /dev/disk/ -lname '*/nvme1n1p1' /dev/disk/by-partuuid/847b9713-8c00-48a1-8dff-f84c328b9da2 /dev/disk/by-path/pci-0000:0e:00.0-nvme-1-part1 /dev/disk/by-id/nvme-eui.0000000001000000e4d25c33da9f4d01-part1 /dev/disk/by-id/nvme-INTEL_SSDPEKKA010T7_BTPY703505FB1P0H-part1

We describe our resource on all three nodes:

 # cat /etc/drbd.d/tgt1.res resource tgt1 { meta-disk internal; device /dev/drbd100; protocol C; net { after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; } on pve1 { address 192.168.2.11:7000; disk /dev/disk/by-partuuid/95e7eabb-436e-4585-94ea-961ceac936f7; node-id 0; } on pve2 { address 192.168.2.12:7000; disk /dev/disk/by-partuuid/aa7490c0-fe1a-4b1f-ba3f-0ddee07dfee3; node-id 1; } on pve3 { address 192.168.2.13:7000; disk /dev/disk/by-partuuid/847b9713-8c00-48a1-8dff-f84c328b9da2; node-id 2; } connection-mesh { hosts pve1 pve2 pve3; } }

It is advisable to synchronize drbd to use a separate network .

Now create the metadata for drbd and launch it:

 # drbdadm create-md tgt1 initializing activity log initializing bitmap (320 KB) to all zero Writing meta data... New drbd meta data block successfully created. success # drbdadm up tgt1

Repeat these actions on all three nodes and check the status:

 # drbdadm status tgt1 role:Secondary disk:Inconsistent pve2 role:Secondary peer-disk:Inconsistent pve3 role:Secondary peer-disk:Inconsistent

Now our disk is Inconsistent on all three nodes, this is because drbd does not know which disk should be taken as the original. We need to mark one of them as Primary , so that its state is synchronized to the other nodes:

 drbdadm primary --force tgt1 drbdadm secondary tgt1

Immediately after this, synchronization will start:

 # drbdadm status tgt1 role:Secondary disk:UpToDate pve2 role:Secondary replication:SyncSource peer-disk:Inconsistent done:26.66 pve3 role:Secondary replication:SyncSource peer-disk:Inconsistent done:14.20

We don’t have to wait until it is finished and we can follow up the next steps in parallel. They can be performed on any node , regardless of its current state of the local disk in DRBD. All requests will be automatically redirected to the device with the UpToDate state.

Don't forget to activate the autorun of the drbd service on the nodes:

 systemctl enable drbd.service

Configure LXC Container

Omit part of the configuration of the Proxmox cluster of three nodes, this part is well described in the official wiki

As I said before, our iSCSI target will work in an LXC container . We will keep the container itself on the /dev/drbd100 device we just created.

First we need to create a file system on it:

 mkfs -t ext4 -O mmp -E mmp_update_interval=5 /dev/drbd100

Proxmox by default includes multimount protection at the file system level, in principle we can do without it, because DRBD by default has its own protection, it will simply prohibit the second Primary for the device, but caution will not harm us.

Now download the Ubuntu template:

 # wget http://download.proxmox.com/images/system/ubuntu-16.04-standard_16.04-1_amd64.tar.gz -P /var/lib/vz/template/cache/

And create from it our container:

 pct create 101 local:vztmpl/ubuntu-16.04-standard_16.04-1_amd64.tar.gz \ --hostname=tgt1 \ --net0=name=eth0,bridge=vmbr0,gw=192.168.1.1,ip=192.168.1.11/24 \ --rootfs=volume=/dev/drbd100,shared=1

In this command, we specify that the root system of our container will be located on the /dev/drbd100 and add the parameter shared=1 to allow the container to be migrated between nodes.

If something went wrong, you can always fix it through the Proxmox interface or in the /etc/pve/lxc/101.conf container /etc/pve/lxc/101.conf

Proxmox will unpack the template and prepare the container root system for us. After that we can run our container:

 pct start 101

Setting up an iSCSI target.

Of the entire set of targets , I chose istgt , since it has the highest performance and works in user space.

Now let's log in to our container:

 pct exec 101 bash

Install the update and istgt :

 apt-get update apt-get -y upgrade apt-get -y install istgt

Create a file that we will give over the network:

 mkdir -p /data fallocate -l 740G /data/target1.img

Now we need to write the config for istgt /etc/istgt/istgt.conf :

 [Global] Comment "Global section" NodeBase "iqn.2018-07.org.example.tgt1" PidFile /var/run/istgt.pid AuthFile /etc/istgt/auth.conf MediaDirectory /var/istgt LogFacility "local7" Timeout 30 NopInInterval 20 DiscoveryAuthMethod Auto MaxSessions 16 MaxConnections 4 MaxR2T 32 MaxOutstandingR2T 16 DefaultTime2Wait 2 DefaultTime2Retain 60 FirstBurstLength 262144 MaxBurstLength 1048576 MaxRecvDataSegmentLength 262144 InitialR2T Yes ImmediateData Yes DataPDUInOrder Yes DataSequenceInOrder Yes ErrorRecoveryLevel 0 [UnitControl] Comment "Internal Logical Unit Controller" AuthMethod CHAP Mutual AuthGroup AuthGroup10000 Portal UC1 127.0.0.1:3261 Netmask 127.0.0.1 [PortalGroup1] Comment "SINGLE PORT TEST" Portal DA1 192.168.1.11:3260 [InitiatorGroup1] Comment "Initiator Group1" InitiatorName "ALL" Netmask 192.168.1.0/24 [LogicalUnit1] Comment "Hard Disk Sample" TargetName disk1 TargetAlias "Data Disk1" Mapping PortalGroup1 InitiatorGroup1 AuthMethod Auto AuthGroup AuthGroup1 UseDigest Auto UnitType Disk LUN0 Storage /data/target1.img Auto

Restart istgt:

 systemctl restart istgt

At this point, the target setting is completed.

HA Setup

Now we can go to the HA-manager configuration. Create a separate HA group for our device:

 ha-manager groupadd tgt1 --nodes pve1,pve2,pve3 --nofailback=1 --restricted=1

Our resource will work only on the nodes specified for this group. Add our container to this group:

 ha-manager add ct:101 --group=tgt1 --max_relocate=3 --max_restart=3

Recommendations and tuning

DRBD

As I noted above, it is always advisable to use a separate network for replication. It is highly desirable to use 10-gigabit network adapters , otherwise you all will rest on the speed of the ports.
If replication seems slow enough, try tuning in some parameters for DRBD . Here is the config that, in my opinion, is optimal for my 10G network :

 # cat /etc/drbd.d/global_common.conf global { usage-count yes; udev-always-use-vnr; } common { handlers { } startup { } options { } disk { c-fill-target 10M; c-max-rate 720M; c-plan-ahead 10; c-min-rate 20M; } net { max-buffers 36k; sndbuf-size 1024k; rcvbuf-size 2048k; } }

More information about each parameter you can get information from the official documentation DRBD

Open iSCSI

Since we do not use multipathing, in our case it is recommended to disable periodic connection checks on clients, as well as to increase waiting timeouts for session recovery in /etc/iscsi/iscsid.conf .

 node.conn[0].timeo.noop_out_interval = 0 node.conn[0].timeo.noop_out_timeout = 0 node.session.timeo.replacement_timeout = 86400

Using

Proxmox

The resulting iSCSI target can be immediately connected to Proxmox, without having forgotten to uncheck Use LUN Directly .

Immediately after this, it will be possible to create LVM on top of it, do not forget to tick the shared one :

Other environments

If you plan to use this solution in a different environment, you may need to install a cluster extension for LVM at the moment from two implementations. CLVM and lvmlockd .

Setting up a CLVM is not so trivial and requires a running cluster manager.
Where as the second method lvmlockd - not yet fully tested and is just starting to appear in stable repositories.

I recommend reading a great article on blocking in LVM

When using LVM with Proxmox, cluster addition is not required , since volume management is provided by proxmox itself, which updates and monitors LVM metadata on its own. The same goes for OpenNebula , which is clearly indicated by the official documentation .

Source: https://habr.com/ru/post/417597/

All Articles