Proxmox cluster storage. Part Three Nuances

Hello!

The third part of the article is a kind of application to the previous two, in which I talked about working with the Proxmox cluster. In this part, I will describe the problems we encountered in working with Proxmox, and how to solve them.

Authorized iSCSI connection

If you need to specify credit lines when connecting to iSCSI, it is better to do this bypassing Proxmox . Why?
')

Firstly, because it is not possible to create an authorized iSCSI connection via the Proxmox web interface.
Secondly, even if you decide to create an unauthorized connection in Proxmox in order to specify the authorization information manually, you will have to butt with the system for the ability to change the target configuration files, because if the connection to the iSCSI host fails, Proxmox overwrites the target information and tries to connect again.

Easier to connect manually :

root@srv01-vmx:~# iscsiadm -m discovery -t st -p 10.11.12.13 root@srv01-vmx:~# iscsiadm -m node --targetname "iqn.2012-10.local.alawar.ala-nas-01:pve-cluster-01" --portal "10.11.12.13:3260" --op=update --name node.session.auth.authmethod --value=CHAP root@srv01-vmx:~# iscsiadm -m node --targetname "iqn.2012-10.local.alawar.ala-nas-01:pve-cluster-01" --portal "10.11.12.13:3260" --op=update --name node.session.auth.username --value=Admin root@srv01-vmx:~# iscsiadm -m node --targetname "iqn.2012-10.local.alawar.ala-nas-01:pve-cluster-01" --portal "10.11.12.13:3260" --op=update --name node.session.auth.password --value=Lu4Ii2Ai root@srv01-vmx:~# iscsiadm -m node --targetname "iqn.2012-10.local.alawar.ala-nas-01:pve-cluster-01" --portal "10.11.12.13:3260" --login

These commands must be executed for all portals that provide the target we need on all the nodes of the cluster. Or, you can execute these commands on one node, and distribute configuration files for this connection to the others. The files are located in the " / etc / iscsi / nodes " and " / etc / iscsi / send_targets " directories .

Mounting on a new GFS2-FS node

In order to mount a GFS2 file system on a new node, it (the file system ) needs to add another journal. This is done as follows: on any cluster node on which the required FS is mounted, the command is executed:

 root@pve01:~# gfs2_jadd -j 1 /mnt/cluster/storage01

The " -j " parameter specifies the number of logs to add to FS .

This command may fail with the error:

 create: Disk quota exceeded

Causes of the error:

Inside GFS2 -tom is actually not one file system, but two. Another file system is designed for service purposes. If desired, it can be mounted by adding the option " -o meta ". Changes within this FS can potentially lead to data file system corruption. When adding a log to FS , the meta- file system is mounted in the " / tmp / TEMP_RANDOM_DIR " directory , after which a log file is created in it. For reasons unknown to us so far, the kernel sometimes believes that in the mounted meta-FS, the quota for creating objects is exceeded, which is why this error occurs. You can get out of the situation by remounting GFS2 with data (of course, to do this you need to stop all virtuals located on this FS ), and once again execute the command to add a log. You also need to unmount the meta-FS , the rest of the last unsuccessful attempt to add the log:

 cat /proc/mounts | grep /tmp/ | grep -i gfs2 | awk '{print $2}' | xargs umount

Mounting the data source inside the container

Container virtualization technology is good because the host has almost unlimited opportunities to communicate with the virtual machine.

When the vzctl container starts, it tries to execute the following set of scripts ( if available ):

/etc/pve/openvz/vps.premount
/etc/pve/openvz/CTID.premount
/etc/pve/openvz/vps.mount
/etc/pve/openvz/CTID.mount
/etc/pve/openvz/CTID.start

When stopped, the following scripts are executed:

/etc/pve/openvz/CTID.stop
/etc/pve/openvz/CTID.umount
/etc/pve/openvz/vps.umount
/etc/pve/openvz/CTID.postumount
/etc/pve/openvz/vps.postumount

where " CTID " is the container number. The " vps. * " Scripts are executed during operations with any container. The scripts " * .start " and " * .stop " are executed in the context of the container, all the others are in the context of the host. Thus, we can script the process of starting / stopping the container, adding data mounting to it. Here are some examples:

Mounting the data directory inside the container

If the container works with a large amount of data, we try not to keep this data inside the container, but mount it from the host. In this approach, there are two positive points:

The container is small, quickly backed up with Proxmox . We have the ability at any time to quickly restore / clone the functionality of the container.
Container data can be centrally backed up by an adult backup system with all the facilities provided by it (multi-level backups, rotation, statistics, and so on).

Contents of the file " CTID.mount ":

 #!/bin/bash . /etc/vz/vz.conf #        OpenVZ.  ,     ${VE_ROOT} -     . . ${VE_CONFFILE} #       DIR_SRC=/storage/src_dir #   ,      DIR_DST=/data #   ,     $DIR_SRC mkdir -p ${VE_ROOT}/${DIR_DST} #      mount -n -t simfs ${DIR_SRC} ${VE_ROOT}/{$DIR_DST} -o /data #

Mounting the file system inside the container

On the host there is a volume that needs to be given to the container. Contents of the file " CTID.mount ":

 #!/bin/bash . /etc/vz/vz.conf . ${VE_CONFFILE} UUID_SRC=3d1d8ec1-afa6-455f-8a27-5465c454e212 # UUID ,      DIR_DST=/data mkdir -p ${VE_ROOT}/${DIR_DST} mount -n -U ${UUID_SRC} ${VE_ROOT}/{$DIR_DST}

Mounting the file system located in the file inside the container

Why this may be needed? If any tricky product ( for example, Splunk ) does not want to work with simfs in any way , or we are not satisfied with the speed of GFS2 operation under certain conditions. For example, we have some kind of cache on a bunch of small files. GFS2 does not work very quickly with large volumes of small files. Then you can create a file system on the host, other than GFS2 ( ext3 ), and connect it to the container.

Mount the loop device from file to container:

First create the file:

 root@srv01:/storage# truncate -s 10G CTID_ext3.fs

Format the FS in the file:

 root@srv01:/storage# mkfs.ext3 CTID_ext3.fs mke2fs 1.42 (29-Nov-2011) CTID_ext3.fs is not a block special device. Proceed anyway? (y,n) y ...

Contents of the file " CTID.mount ":

 #!/bin/bash . /etc/vz/vz.conf . ${VE_CONFFILE} CFILE_SRC=/storage/CTID_ext3.fs #   ,      DIR_DST=/data mkdir -p ${VE_ROOT}/${DIR_DST} mount -n ${CFILE_SRC} -t ext3 ${VE_ROOT}/{$DIR_DST} -o loop

Unmounting external data in a container when stopped

When the container is stopped, the system automatically tries to unmount all file systems connected to it. But in a particularly exotic configuration, it does not work for her. So just in case an example of a simple script " CTID.umount ":

 #!/bin/bash . /etc/vz/vz.conf . ${VE_CONFFILE} DIR=/data if mountpoint -q "${VE_ROOT}${DIR}" ; then umount ${VE_ROOT}${DIR} fi

Working in a cluster with a non-cluster file system

If for some reason there is no desire to use cluster FS (it does not suit the stability of work, it does not suit the speed, etc. ), but you want to work with a single repository, then this option is possible. For this we need:

A separate logical volume in CLVM for each node of the cluster
The main storage for the normal operation of containers
Empty backup storage for urgent mounting of a foreign node's volume in case of its crash / shutdown

Procedure:

Each cluster node allocates its logical volume in the CLVM , format it.

We create the main storage. Create a directory that has the same name on all nodes of the cluster ( for example, "/ storage" ). We mount our logical volume in it. We create a repository of the type “ Directory ” in the admin panel of Proxmox , call it, for example, “ STORAGE ”, say that it is not shared.

Create backup storage. Create a directory with the same name on all nodes of the cluster ( for example, "/ storage2" ). We create a repository of the type " Directory " in the admin panel of Proxmox , call it, for example, " STORAGE2 ", we say that it is not shared. In the event of a drop / shutdown of one of the nodes, we will mount its volume in the " / storage2 " directory on that node of the cluster, which will take over the load of the deceased.

What we have in the end:

Migration ( including online ) of containers between nodes ( if no data is mounted to the container on the side ). The container is transferred from the node to the node by copying, respectively, the migration time depends on the amount of data in the container. The more data, the longer the container will be transferred between nodes. Do not forget about the increasing disk load at the same time.
( Under- ) fault tolerance. When a node is dropped, its data can be mounted on a neighboring node, and theoretically you can start working with them.

Why " nedo ", and why " theoretically ":

Virtual machines live in the storage " STORAGE ", which is located in the " / storage " directory. The disk from the dead node will be mounted in the " / storage2 " directory , where Proxmox will see the containers, but cannot launch them from there. In order to raise the virtual machines located in this storage, you need to do three things:

Report to firefighting containers that their new home is not the " / storage " directory, but the " / storage2 " directory . To do this, in each file " * .conf " in the directory " / etc / pve / nodes / name_death_name_name / openvz " change the contents of the variable VE_PRIVATE from " / storage / private / CTID " to " / storage2 / private / CTID ".
Tell the cluster that the virtualians over that non-alive node are now located on this live one. To do this, it is enough to move all the files from the " / etc / pve / nodes / dead_number_networks / openvz " directory to the " / etc / pve / nodes / live_number_notes / openvz " directory . Perhaps, for this there is some kind of correct API- instruction, but we did not bother with this :)
Reset the quota for each fire container ( just in case ):
```
 vzquota drop CTID 
```

Everything. You can run containers.

If the containers from the dead node take up some space, or we have incredibly nimble disks, or we can afford to wait, then we can avoid the first and third steps by simply transferring the containers we need from " / storage2 / private " to " / storage / private ".

If the cluster has collapsed

A cluster is a capricious creature, and there are cases when it gets into a pose. For example, after a massive network problem, or due to a massive power failure. The pose is as follows: when accessing the cluster storage, the current session is blocked, polling the fence-domain status displays alarm messages, such as " wait state messages ", and connection errors are added to the dmesg .

If no attempts to revive the cluster lead to success, then the simplest thing is to disable automatic entry into the fence- domain on all nodes of the cluster ( file "/ etc / default / redhat-cluster-pve" ), and then restart all the nodes. We must be prepared for the fact that the nodes will not be able to reboot on their own. When all nodes will be rebooted, we manually connect to the fence domain, start CLVM , and so on. The previous articles have written how to do this.