Creating reliable iSCSI storage on Linux, part 2

We continue

We continue the creation of the cluster, begun by the first part .
This time I will talk about setting up the cluster.

Last time we ended up in what started the synchronization of DRBD.
If we have chosen the same server as the Primary server for both resources, then after the synchronization is completed, we should see something like this in / proc / drbd :

# cat /proc/drbd version: 8.4.3 (api:1/proto:86-101) GIT-hash: 89a294209144b68adb3ee85a73221f964d3ee515 build by root@debian-service, 2013-04-30 07:43:49 0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate B r----- ns:0 nr:190397036 dw:190397036 dr:1400144904 al:0 bm:4942 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0 1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate B r----- ns:0 nr:720487828 dw:720485956 dr:34275816 al:0 bm:3749 lo:468 pe:0 ua:0 ap:0 ep:1 wo:d oos:0

The most interesting field here is ds: UpToDate / UpToDate , meaning that both local and remote copies are relevant.
')
After that, we will transfer resources to the secondary mode - then they will be managed by the cluster:

 # drbdadm secondary VM_STORAGE_1 # drbdadm secondary VM_STORAGE_2

Posemaker

So, the cluster manager.

In short, it is the brain of the entire system that controls the abstractions called resources.
A cluster resource can be, in principle, anything: IP addresses, file systems, DRBD devices, service programs, and so on. It's pretty easy to create your own resource, which I had to do to manage iSCSI targets and LUNs, more on that later.

Install:

 # apt-get install pacemaker

Corosync

Pacemaker uses the Corosync infrastructure to communicate between the cluster nodes, so first you need to configure it.

Corosync has a wide enough functionality and several modes to support communication between nodes (unicast, multicast, broadcast), has support for RRP (Redundant Ring Protocol), which allows you to use several different ways to communicate between the nodes of the cluster to minimize the risk of getting Split-brain, there are situations where the connection between the nodes is completely lost, and they both believe that the neighbor has died. As a result, both nodes go into working mode and chaos begins :)

Therefore, we will use both replication and external interfaces to ensure cluster connectivity.

Proceed to setup

First you need to generate an authorization key:

 # corosync-keygen

It must be placed under the name / etc / corosync / authkey on both servers.

Next, create a config, it should be identical on both nodes:

/etc/corosync/corosync.conf

 compatibility: none totem { version: 2 secauth: on threads: 3 rrp_mode: active transport: udpu interface { member { memberaddr: 10.1.0.100 } member { memberaddr: 10.1.0.200 } ringnumber: 0 bindnetaddr: 10.1.0.0 mcastport: 5405 ttl: 1 } interface { member { memberaddr: 192.168.123.100 } member { memberaddr: 192.168.123.200 } ringnumber: 1 bindnetaddr: 192.168.123.0 mcastport: 5407 ttl: 1 } } amf { mode: disabled } service { ver: 1 name: pacemaker } aisexec { user: root group: root } logging { syslog_priority: warning fileline: off to_stderr: yes to_logfile: no to_syslog: yes syslog_facility: daemon debug: off timestamp: on logger_subsys { subsys: AMF debug: off tags: enter|leave|trace1|trace2|trace3|trace4|trace6 } }

Here we describe two rings for communication - internal (via replication ports) and external (via switches), select the udpu protocol (UDP Unicast) and specify the IP addresses of the nodes in each ring. I also had the idea to connect the nodes with a null modem cable, raise the PPP connection and put a third ring through it, but common sense suggested in time that it would come down.

Everything, it is possible to start Pacemaker (it will start Corosync previously).

 # /etc/init.d/pacemaker start

The entire configuration of Pacemaker is done via the crm utility, and it can be run on any server in the cluster — it will automatically update the configuration on all the nodes after the change.

Let's see the current status:

 # crm status ============ Last updated: Mon Jan 20 15:33:29 2014 Last change: Fri Jan 17 18:30:48 2014 via cibadmin on server1 Stack: openais Current DC: server1 - partition WITHOUT quorum Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff 2 Nodes configured, 2 expected votes 0 Resources configured. ============ Online: [ server1 server2 ]

If everything is approximately like this, then the connection is established and the nodes of each other see.

Now we need resources to manage SCST.
I once found them somewhere on the Internet, modified it to fit my needs and put it on Github .

From there we need two files:

SCSTLun - manages device creation
SCSTTarget - manages the creation of iSCSI targets

These are, in fact, ordinary bash scripts that implement the simple Pacemaker API.
Put them in /usr/lib/ocf/resource.d/heartbeat so that the cluster manager can see them.

Next, run crm and enter configuration mode:

 # crm crm(live)# configure crm(live)configure# edit

A text editor opens (usually nano) and you can begin to describe the resources and their interactions.

I will give an example configuration:

 node server1 node server2 primitive DRBD_VM_STORAGE_1 ocf:linbit:drbd \ params drbd_resource="VM_STORAGE_1" drbdconf="/etc/drbd.conf" \ op monitor interval="29" role="Master" \ op monitor interval="31" role="Slave" primitive DRBD_VM_STORAGE_2 ocf:linbit:drbd \ params drbd_resource="VM_STORAGE_2" drbdconf="/etc/drbd.conf" \ op monitor interval="29" role="Master" \ op monitor interval="31" role="Slave" primitive IP_iSCSI_1_1 ocf:heartbeat:IPaddr2 \ params ip="10.1.24.10" cidr_netmask="24" nic="int1.24" \ op monitor interval="10s" primitive IP_iSCSI_1_2 ocf:heartbeat:IPaddr2 \ params ip="10.1.25.10" cidr_netmask="24" nic="int2.25" \ op monitor interval="10s" primitive IP_iSCSI_1_3 ocf:heartbeat:IPaddr2 \ params ip="10.1.26.10" cidr_netmask="24" nic="int3.26" \ op monitor interval="10s" primitive IP_iSCSI_1_4 ocf:heartbeat:IPaddr2 \ params ip="10.1.27.10" cidr_netmask="24" nic="int4.27" \ op monitor interval="10s" primitive IP_iSCSI_1_5 ocf:heartbeat:IPaddr2 \ params ip="10.1.28.10" cidr_netmask="24" nic="int5.28" \ op monitor interval="10s" primitive IP_iSCSI_1_6 ocf:heartbeat:IPaddr2 \ params ip="10.1.29.10" cidr_netmask="24" nic="int6.29" \ op monitor interval="10s" primitive IP_iSCSI_2_1 ocf:heartbeat:IPaddr2 \ params ip="10.1.24.20" cidr_netmask="24" nic="int1.24" \ op monitor interval="10s" primitive IP_iSCSI_2_2 ocf:heartbeat:IPaddr2 \ params ip="10.1.25.20" cidr_netmask="24" nic="int2.25" \ op monitor interval="10s" primitive IP_iSCSI_2_3 ocf:heartbeat:IPaddr2 \ params ip="10.1.26.20" cidr_netmask="24" nic="int3.26" \ op monitor interval="10s" primitive IP_iSCSI_2_4 ocf:heartbeat:IPaddr2 \ params ip="10.1.27.20" cidr_netmask="24" nic="int4.27" \ op monitor interval="10s" primitive IP_iSCSI_2_5 ocf:heartbeat:IPaddr2 \ params ip="10.1.28.20" cidr_netmask="24" nic="int5.28" \ op monitor interval="10s" primitive IP_iSCSI_2_6 ocf:heartbeat:IPaddr2 \ params ip="10.1.29.20" cidr_netmask="24" nic="int6.29" \ op monitor interval="10s" primitive ISCSI_LUN_VM_STORAGE_1 ocf:heartbeat:SCSTLun \ params iqn="iqn.2011-04.ru.domain:VM_STORAGE_1" device_name="VM_STORAGE_1" \ lun="0" path="/dev/drbd0" handler="vdisk_fileio" primitive ISCSI_LUN_VM_STORAGE_2 ocf:heartbeat:SCSTLun \ params iqn="iqn.2011-04.ru.domain:VM_STORAGE_2" device_name="VM_STORAGE_2" \ lun="0" path="/dev/drbd1" handler="vdisk_fileio" primitive ISCSI_TGT_VM_STORAGE_1 ocf:heartbeat:SCSTTarget \ params iqn="iqn.2011-04.ru.domain:VM_STORAGE_1" \ portals="10.1.24.10 10.1.25.10 10.1.26.10 10.1.27.10 10.1.28.10 10.1.29.10" \ tgtoptions="InitialR2T=No ImmediateData=Yes MaxRecvDataSegmentLength=1048576 MaxXmitDataSegmentLength=1048576 MaxBurstLength=1048576 FirstBurstLength=524284 MaxOutstandingR2T=32 HeaderDigest=CRC32C DataDigest=CRC32C QueuedCommands=32 io_grouping_type=never" \ op monitor interval="10s" timeout="60s" primitive ISCSI_TGT_VM_STORAGE_2 ocf:heartbeat:SCSTTarget \ params iqn="iqn.2011-04.ru.domain:VM_STORAGE_2" \ portals="10.1.24.20 10.1.25.20 10.1.26.20 10.1.27.20 10.1.28.20 10.1.29.20" \ tgtoptions="InitialR2T=No ImmediateData=Yes MaxRecvDataSegmentLength=1048576 MaxXmitDataSegmentLength=1048576 MaxBurstLength=1048576 FirstBurstLength=524284 MaxOutstandingR2T=32 HeaderDigest=CRC32C DataDigest=CRC32C QueuedCommands=32 io_grouping_type=never" \ op monitor interval="10s" timeout="60s" group GROUP_ISCSI_1 IP_iSCSI_1_1 IP_iSCSI_1_2 IP_iSCSI_1_3 IP_iSCSI_1_4 \ IP_iSCSI_1_5 IP_iSCSI_1_6 ISCSI_TGT_VM_STORAGE_1 ISCSI_LUN_VM_STORAGE_1 group GROUP_ISCSI_2 IP_iSCSI_2_1 IP_iSCSI_2_2 IP_iSCSI_2_3 IP_iSCSI_2_4 \ IP_iSCSI_2_5 IP_iSCSI_2_6 ISCSI_TGT_VM_STORAGE_2 ISCSI_LUN_VM_STORAGE_2 ms MS_DRBD_VM_STORAGE_1 DRBD_VM_STORAGE_1 \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" \ notify="true" target-role="Master" ms MS_DRBD_VM_STORAGE_2 DRBD_VM_STORAGE_2 \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" \ notify="true" target-role="Master" location PREFER-1 MS_DRBD_VM_STORAGE_1 50: server1 location PREFER-2 MS_DRBD_VM_STORAGE_2 50: server2 colocation COLOC_ALL_1 inf: GROUP_ISCSI_1 MS_DRBD_VM_STORAGE_1:Master colocation COLOC_ALL_2 inf: GROUP_ISCSI_2 MS_DRBD_VM_STORAGE_2:Master order ORDER_ALL_1 inf: MS_DRBD_VM_STORAGE_1:promote GROUP_ISCSI_1:start order ORDER_ALL_2 inf: MS_DRBD_VM_STORAGE_2:promote GROUP_ISCSI_2:start property $id="cib-bootstrap-options" \ dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ stonith-enabled="false" \ no-quorum-policy="ignore" \ default-action-timeout="240" \ last-lrm-refresh="1367942459" rsc_defaults $id="rsc-options" \ resource-stickiness="100"

General Cluster Settings

They are at the bottom. No-quorum-policy = “ignore” and expected-quorum-votes = “2” are important here - we have a cluster of 2 servers and there can be no quorum here, so we ignore it.

Resources

Usually a resource can have two states - on or off, Started / Stopped.
For example, ocf: heartbeat: IPaddr2 picks up IP addresses on the interfaces and removes them, and also sends out gratious arp to update arp-tables. This resource we specify the IP address, mask and interface.

There are also special resources, for example DRBD ( ocf: linbit: drbd ), which have Master / Slave modes.
When the node goes into active mode, the cluster manager will transfer the resource to master mode and vice versa. DRBD will switch from Secondary to Primary. For it, we specify the name of the resource and the path to the DRBD config (probably, it can be omitted, I don’t remember exactly).

Next come our own handwritten resources.
For ocf: heartbeat: SCSTLun, we specify the IQN target to which it will be added, the device name , the LUN number, and (the target must have a LUN 0, otherwise some initiators blow the roof), the path to the exported device and the handler .

The handler needs to stop in more detail - this is the way SCST will work with our device.

From interesting it:

disk - in fact, this is just a direct forwarding of SCSI commands from the initiator to the SCSI device, the simplest mode, but it works only with real SCSI devices, it does not suit us, since export DRBD device
vdisk_blockio - opens the device as a block, bypassing the operating system page-cache. Used if you do not need to cache I / O.
vdisk_fileio - opens the device as a file, allowing you to use the operating system page-cache, the most efficient mode, and choose it

The vdisk_fileio has an important parameter that affects the speed - nv_cache = 1 , it is hardcoded in SCSTLun .
This parameter tells SCST to ignore initiator commands to flush the cache to the device. Potentially, this may lead to data loss in case of emergency shutdown of the storage because the initiator will think that the data is on disk, and they are still in memory. So use at your own risk.

Next comes the ocf: heartbeat: SCSTTarget resource , to which we assign IQN , portals is a list of IP addresses through which this target will be available, tgtoptions are iSCSI options, you can read a lot about them.

Directives responsible for the behavior of the cluster when starting and stopping resources:

group combines resources into a group to work with them as a whole. Resources in the group are launched sequentially.
location indicates on which node we want to see this resource by default.
colocation sets what resources should be together on the same node
order tells the cluster manager how to start up resources

After setting up the resources, exit the editor and apply the changes:

 crm(live)configure# commit crm(live)configure# exit

After that you can see the current situation:

 # crm status ============ Last updated: Mon Jan 20 17:04:04 2014 Last change: Thu Jul 25 13:59:27 2013 via crm_resource on server1 Stack: openais Current DC: server1 - partition with quorum Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff 2 Nodes configured, 2 expected votes 20 Resources configured. ============ Online: [ server1 server2 ] Resource Group: GROUP_ISCSI_1 IP_iSCSI_1_1 (ocf::heartbeat:IPaddr2): Stopped IP_iSCSI_1_2 (ocf::heartbeat:IPaddr2): Stopped IP_iSCSI_1_3 (ocf::heartbeat:IPaddr2): Stopped IP_iSCSI_1_4 (ocf::heartbeat:IPaddr2): Stopped IP_iSCSI_1_5 (ocf::heartbeat:IPaddr2): Stopped IP_iSCSI_1_6 (ocf::heartbeat:IPaddr2): Stopped ISCSI_TGT_VM_STORAGE_1 (ocf::heartbeat:SCSTTarget): Stopped ISCSI_LUN_VM_STORAGE_1 (ocf::heartbeat:SCSTLun): Stopped Resource Group: GROUP_ISCSI_2 IP_iSCSI_2_1 (ocf::heartbeat:IPaddr2): Stopped IP_iSCSI_2_2 (ocf::heartbeat:IPaddr2): Stopped IP_iSCSI_2_3 (ocf::heartbeat:IPaddr2): Stopped IP_iSCSI_2_4 (ocf::heartbeat:IPaddr2): Stopped IP_iSCSI_2_5 (ocf::heartbeat:IPaddr2): Stopped IP_iSCSI_2_6 (ocf::heartbeat:IPaddr2): Stopped ISCSI_TGT_VM_STORAGE_2 (ocf::heartbeat:SCSTTarget): Stopped ISCSI_LUN_VM_STORAGE_2 (ocf::heartbeat:SCSTLun): Stopped Master/Slave Set: MS_DRBD_VM_STORAGE_1 [DRBD_VM_STORAGE_1] Slaves: [ server1 server2 ] Master/Slave Set: MS_DRBD_VM_STORAGE_2 [DRBD_VM_STORAGE_2] Slaves: [ server1 server2 ]

We see that the resources are in an inactive state, DRBD in Slave mode (Secondary).

Now you can try to activate them:

 # crm resource start MS_DRBD_VM_STORAGE_1 # crm resource start MS_DRBD_VM_STORAGE_2

Starting this resource, we call including the launch of other resources because they are listed as dependent ( colocation ), and they will be launched in a strictly defined order ( order ): first, DRBD devices will switch to Primary mode, then IP addresses will rise, LUNs will be created and iSCSI targets will be created at the end.

See the result:

 # crm status ============ Last updated: Tue Jan 21 11:54:46 2014 Last change: Thu Jul 25 13:59:27 2013 via crm_resource on server1 Stack: openais Current DC: server1 - partition with quorum Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff 2 Nodes configured, 2 expected votes 20 Resources configured. ============ Online: [ server1 server2 ] Resource Group: GROUP_ISCSI_1 IP_iSCSI_1_1 (ocf::heartbeat:IPaddr2): Started server1 IP_iSCSI_1_2 (ocf::heartbeat:IPaddr2): Started server1 IP_iSCSI_1_3 (ocf::heartbeat:IPaddr2): Started server1 IP_iSCSI_1_4 (ocf::heartbeat:IPaddr2): Started server1 IP_iSCSI_1_5 (ocf::heartbeat:IPaddr2): Started server1 IP_iSCSI_1_6 (ocf::heartbeat:IPaddr2): Started server1 ISCSI_TGT_VM_STORAGE_1 (ocf::heartbeat:SCSTTarget): Started server1 ISCSI_LUN_VM_STORAGE_1 (ocf::heartbeat:SCSTLun): Started server1 Resource Group: GROUP_ISCSI_2 IP_iSCSI_2_1 (ocf::heartbeat:IPaddr2): Started server2 IP_iSCSI_2_2 (ocf::heartbeat:IPaddr2): Started server2 IP_iSCSI_2_3 (ocf::heartbeat:IPaddr2): Started server2 IP_iSCSI_2_4 (ocf::heartbeat:IPaddr2): Started server2 IP_iSCSI_2_5 (ocf::heartbeat:IPaddr2): Started server2 IP_iSCSI_2_6 (ocf::heartbeat:IPaddr2): Started server2 ISCSI_TGT_VM_STORAGE_2 (ocf::heartbeat:SCSTTarget): Started server2 ISCSI_LUN_VM_STORAGE_2 (ocf::heartbeat:SCSTLun): Started server2 Master/Slave Set: MS_DRBD_VM_STORAGE_1 [DRBD_VM_STORAGE_1] Masters: [ server1 ] Slaves: [ server2 ] Master/Slave Set: MS_DRBD_VM_STORAGE_2 [DRBD_VM_STORAGE_2] Masters: [ server2 ] Slaves: [ server1 ]

If everything is so, then you can congratulate yourself - the cluster is running!
Each resource group is running on its server, as indicated by the location directive in the config.

For confirmation, you can see the kernel log - dmesg - there DRBD and SCST output their diagnostics.

The end of the second part

In the third and final part I will show how to configure the ESXi servers for optimal performance with this cluster.

Source: https://habr.com/ru/post/209666/

All Articles