The organization of distributed disk storage with the possibility of unlimited expansion using the technology of LVM and ATAoE

Task

When the disks were small and the Internet was large, the owners of private FTP servers faced the following problem:
On each hard disk, a daddy Video or Soft was created, and it turned out that adding a new hard disk, you had to make daddy Video2, Soft2, etc. on it.
The task of changing the hard disk to a larger disk meant that the data needed to be transferred somewhere, it all happened nontrivially and with large downtime.
The system we developed in 2005 allowed us to assemble a reliable and fast array of 3 terabytes, scalable, expandable, online, adding disks or entire servers with disks.
The price of the entire solution was 110% of the cost of the disks themselves, i.e. in essence, free, with a slight overhead.

Here is an example diagram of our storage device:

Implementation

The idea is this: there is a supervisor and there are nodes. Supervisor is a public server that clients log in to, it has several gigabit bonding interfaces outside, and a few inside, to our nodes. The supervisor takes arrays or individual disks exported with vblade via ATAoE, makes LVM over them and makes this section available via FTP. The supervisor is also a diskless boot server for the nodes, and it also has the entire node file system with which they boot via NFS after the download. Nodes are pure disks, boot by PXE, then our etherpopulate starts and all disks are exported.

1. Setup for remote node loading

device node

ftp # ls /diskless/
bzImage node pxelinux.0 pxelinux.cfg

kernel, node directory - rootfs, configs for pxe

ftp # ls /diskless/node/
bin boot dev etc lib mnt proc root sbin sys tmp usr var
ftp # chroot /diskless/node/
ftp / # which vblade
/usr/sbin/vblade
ftp / # vblade
usage: vblade [ -m mac[,mac...] ] shelf slot netif filename
ftp / #

all this will be available on nodes

ftp node # cat /diskless/pxelinux.cfg/default
DEFAULT /bzImage
APPEND ip=dhcp root=/dev/nfs nfsroot=172.18.0.193:/diskless/node idebus=66

config for pxe, nfsroot specified

dhcpd config, do not forget to run.

ftp etc # more dhcp/dhcpd.conf
option domain-name "domain.com";
default-lease-time 600;
max-lease-time 7200;
ddns-update-style none;

option space PXE;
option PXE.mtftp-ip code 1 = ip-address;
option PXE.mtftp-cport code 2 = unsigned integer 16;
option PXE.mtftp-sport code 3 = unsigned integer 16;
option PXE.mtftp-tmout code 4 = unsigned integer 8;
option PXE.mtftp-delay code 5 = unsigned integer 8;
option PXE.discovery-control code 6 = unsigned integer 8;
option PXE.discovery-mcast-addr code 7 = ip-address;

subnet 172.16.0.0 netmask 255.255.0.0 {
}

subnet 172.18.0.192 netmask 255.255.255.192 {
class "pxeclients" {
match if substring (option vendor-class-identifier, 0, 9) = "PXEClient";
option vendor-class-identifier "PXEClient";
vendor-option-space PXE;
option PXE.mtftp-ip 0.0.0.0;
filename "pxelinux.0";
next-server 172.18.0.193;
}

host node-1 {
hardware ethernet 00:13:d4:68:b2:7b;
fixed-address 172.18.0.194;
}
host node-2 {
hardware ethernet 00:11:2f:45:e9:fd;
fixed-address 172.18.0.195;
}
host node-3 {
hardware ethernet 00:07:E9:2A:A9:AC;
fixed-address 172.18.0.196;
}
}

tftpd config

ftp etc # more /etc/conf.d/in.tftpd
# /etc/init.d/in.tftpd

# Path to server files from
INTFTPD_PATH="/diskless"

INTFTPD_USER="nobody"
# For more options, see tftpd(8)
INTFTPD_OPTS="-u ${INTFTPD_USER} -l -vvvvvv -p -c -s ${INTFTPD_PATH}"

is tftpd running?

ftp etc # ps -ax |grep tft
Warning: bad ps syntax, perhaps a bogus '-'? See procps.sf.net/faq.html
5694 ? Ss 0:00 /usr/sbin/in.tftpd -l -u nobody -l -vvvvvv -p -c -s /diskless
31418 pts/0 R+ 0:00 grep tft

config for nfs, do not forget to run.

ftp etc # more exports
/diskless/node 172.18.0.192/255.255.255.192(rw,sync,no_root_squash)

Setup for remote boot is complete, all nodes are registered.

2. The software part to automate the assembly of arrays

the software that runs on the nodes makes md * raid arrays and exports them ataoe to the supervisor.

ftp# chroot /diskless/node
ftp etc # more /usr/sbin/etherpopulate
#!/usr/bin/perl

my $action = shift();

#system('insmod /lib/modules/vb-2.6.16-rc1.ko')
# if ( -f '/lib/modules/vb-2.6.16-rc1.ko');

# Get information on node_id's of ifaces
my @ifconfig = `ifconfig`;
my $int;
my %iface;
foreach my $line (@ifconfig) {
if ($line =~ /^(\S+)/) {
$int = $1;
}
if ($line =~ /inet addr:(\d+\.\d+\.\d+\.)(\d+)/ && $1 ne '127.0.0.' && $int) {
$iface{$int} = $2;
$int = "";
}
}

my $vblade_kernel = ( -d "/sys/vblade" )?1:0;
if ( $vblade_kernel ) {
print " Using kernelspace vblade\n" if ($action eq "start");
} else {
print " Using userspace vblade\n" if ($action eq "start");
}

# Run vblade
foreach my $int (keys %iface) {
my $node_id = $iface{$int};
open(DATA, "/etc/etherpopulate.conf");
while () {
chomp;
s/#.*//;
s/^\s+//;
s/\s+$//;
next unless length;

if ($_ =~ /^node-$node_id\s+(\S+)\s+(\S+)\s+(\S.*)/) {
my $cfg_action = $1;
my $command = $2;
my $parameters = $3;

# Export disk over ATAoE
if ($action eq $cfg_action && $command eq "ataoe" && $parameters =~ /(\S+)\s+(\d+)/) {
my $disk_name = $1;
my $disk_id = $2;
if ($vblade_kernel) {
if ( $disk_name =~ /([a-z0-9]+)$/ ) {
my $disk_safe_name = $1;
system("echo 'add $disk_safe_name $disk_name' > /sys/vblade/drives");
system("echo 'add $int $node_id $disk_id' > /sys/vblade/$disk_safe_name/ports");
}
} else {
system("/sbin/start-stop-daemon --background --start --name 'vblade_$node_id_$disk_id' --exec /usr/sbin/vblade $node_id $disk_id eth0 $disk_name");
}
print " Exporting disk: $disk_name [ $node_id $disk_id ] via $int\n";
}

# Execute specified command
if ($action eq $cfg_action && $command eq "exec") {
system($parameters);
}
}
}
close(DATA);
}

config for etherpopulate with the participation of three nodes. two more add. drives from each node are exported for other purposes (backup on raid1)

ftp sbin # more /diskless/node/etc/etherpopulate.conf
# ----------------------
# Node 194 160gb
node-194 start exec /sbin/mdadm -A /dev/md0 -f /dev/hd[ah] /dev/hdl
node-194 start ataoe /dev/md0 0 # Vblade FTP array
node-194 start ataoe /dev/hdk 1 # Vblade BACKUP disk
node-194 stop exec /usr/bin/killall vblade
node-194 stop exec /sbin/mdadm -S /dev/md0

# ----------------------
# Node 195 200 gb
node-195 start exec /sbin/mdadm -A /dev/md0 /dev/hd[ab] /dev/hd[ef] /dev/hd[gh] /dev/sd[ac]
node-195 start ataoe /dev/md0 0 # Vblade FTP array
node-195 start ataoe /dev/sdd 1 # Vblade BACKUP disk
node-195 stop exec /usr/bin/killall vblade
node-195 stop exec /sbin/mdadm -S /dev/md0

# ----------------------
# Node 196 200 gb
node-196 start exec /sbin/mdadm -A /dev/md0 /dev/hd[af]
node-196 start ataoe /dev/md0 0 # Vblade FTP array
node-196 stop exec /usr/bin/killall vblade
node-196 stop exec /sbin/mdadm -S /dev/md0

3. Final work

make the screws on the nodes work at maximum speed to the detriment of reliability
hd*_args="-d1 -X69 -udma5 -c1 -W1 -A1 -m16 -a16 -u1"

Make sure the kernel for the supervisor. On the nodes themselves, export to ATAoE takes place in userland, using vblade.

ftp good # grep -i OVER_ETH .config
CONFIG_ATA_OVER_ETH=y

on the nodes themselves immediately after loading and launching etherpopulate in accordance with the config.

node-195 ~ # cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5] [raid4] [raid6] [multipath] [faulty]
md0 : active raid5 hda[0] sdc[8] sdb[7] sda[6] hdh[5] hdg[4] hdf[3] hde[2] hdb[1]
1562887168 blocks level 5, 64k chunk, algorithm 2 [9/9] [UUUUUUUUU]

unused devices: node-195 ~ # ps -ax | grep vblade | grep md
Warning: bad ps syntax, perhaps a bogus '-'? See procps.sf.net/faq.html
2182 ? Ss 2090:41 /usr/sbin/vblade 195 0 eth0 /dev/md0

node-195 ~ # mount
rootfs on / type rootfs (rw)
/dev/root on / type nfs (ro,v2,rsize=4096,wsize=4096,hard,nolock,proto=udp,addr=172.18.0.193)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
udev on /dev type tmpfs (rw,nosuid)
devpts on /dev/pts type devpts (rw)
none on /var/lib/init.d type tmpfs (rw)
shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec)

we collect lvm from disks at the supervisor, in the future we don’t need to do this, just by vgscan there will be a partition ready for mounting

ftp / # ls -la /dev/etherd/*
cw--w---- 1 root disk 152, 3 Jun 7 2008 /dev/etherd/discover
brw-rw---- 1 root disk 152, 49920 Jun 7 2008 /dev/etherd/e194.0
brw-rw---- 1 root disk 152, 49936 Jun 7 2008 /dev/etherd/e194.1
brw-rw---- 1 root disk 152, 49920 Jun 7 2008 /dev/etherd/e195.0
brw-rw---- 1 root disk 152, 49936 Jun 7 2008 /dev/etherd/e195.1
brw-rw---- 1 root disk 152, 49920 Jun 7 2008 /dev/etherd/e196.0
cr--r----- 1 root disk 152, 2 Jun 7 2008 /dev/etherd/err
cw--w---- 1 root disk 152, 4 Jun 7 2008 /dev/etherd/interfaces

From the first two nodes, 1 array and 1 disk were exported, and only the array from the third node.
Before you can use these devices on the LVM Supervisor, you need to do “special” markup so that LVM adds some internal identifiers to the disk.

# pvcreate /dev/etherd/e194.0
...
...

Disks are ready to use. We create Volume Group.
# vgcreate cluster /dev/etherd/e194.0 /dev/etherd/e195.0 /dev/etherd/e196.0

Although the group becomes active immediately, in principle it can be included
# vgchange -ay cluster

on and off
# vgchange -an cluster

To add something to the volume group use
# vgextend cluster /dev/*...

Create a Logical Volume hyperspace for all available space. Each PE defaults to 4mb. So

# vgdisplay cluster | grep "Total PE"
Total PE 1023445
# lvcreate -l 1023445 cluster -n hyperspace

See what happened you can vgdisplay, lvdisplay, pvdisplay.
You can extend everything with vgextend, lvextend, resize_reiserfs.
Read more here http://tldp.org/HOWTO/LVM-HOWTO/

We end up with / dev / cluster / hyperspace and make it mkreiserfs and mount. All is ready. Setting the FTP server is omitted. TA-dah!

Reuse

On the supervisor itself, if it is restarted, it is sufficient to perform

more runme.sh
#!/bin/sh
vgscan
vgchange -ay
mount /dev/cluster/hyperspace /mnt/ftp

to use a pre-created array.

disadvantages

specifically in our case, the error was with the choice of the hard drives themselves. For some reason, the choice fell on Maxtor and almost the entire batch of 30 disks in a year went bad;
hot swapping was not used, since it was still an IDE. In the case of hotplug SATA, it would be necessary at the mdadm level on the nodes themselves to set up a notification about the failure of the drives;
proftpd needs to be run only after lvm from ataoe devices is mounted to the file system of the supervisor. if proftpd was launched earlier, then he did not understand what had happened at all;
They experimented with nuclear and userspace vblade on nodes for a long time, but then it was the dawn of ataoe and everything worked as lucky. but it worked;
either reiserfs or xfs can be used as a file system - only they supported online resizing at that time if the disk under them increased;
then patches just started appearing that allowed raid-5 to expand the md array online
there was a limit on ataoe of 64 slots per "shelf". Shelves could make 10 pieces, i.e., in principle, there were some restrictions, such as 640 GCD :)
There are many nuances with performance, but they can all be solved to one degree or another. in short - do not be afraid, when at first the speed is not very, there is no limit to perfection;

findings

The solution is certainly interesting, and I want to make it already on terabyte screws, hotplug sata and with new fresh versions of software. but who will go on such a feat is unknown. Maybe you% username%?