Good day, dear community!
In this article I would like to share the experience of creating disk storage, which resulted in many experiments, trial, error, discoveries, seasoned with bitter disappointments. And, finally, ended in some interesting, relatively budget and fast storage.
If you have a similar task or if you are just interested in the title, then welcome to habrakat.
Prologue
So, recently, our department faced the task of providing a cluster of VMware ESXi 5.1 hypervisors with large storage. On it, we planned to locate the encrypted maildir for dovecot and the “cloud” file storage. A prerequisite for allocating a budget was to provide storage space for company-critical information, and this section should be encrypted.
')
Iron
Unfortunately, and perhaps fortunately, we were not burdened with a large budget for such ambitious tasks. Therefore, we, as true fans of maximalism, could not afford any kind of brand storage, and within the allocated material resources we chose the following hardware:
- Server Chassis Supermicro CSE-836BE16-R920B
There was a lot of reasoning, choosing the number of units, the size of hard drives, their speed, body, or immediately the platform, reviewed many options, smoked the Internet and eventually settled on this variant, as the optimal one for our tasks. - Motherboard Supermicro MBD-X9DRI-FO
The main condition was the presence of 4 ports PCI-E x8. - Intel Xeon E5-2603 processors
The choice was simple - with enough money. In addition, we had to install 2 processors at once, and not one at first, then, if necessary, we would need to purchase, because only 3 PCI-E works with one occupied slot, and we had 4. - Seagate Constellation ES.3 ST3000NM0033 wheels
SATA because it is cheaper, and in the same money we received many times more free space than using SAS. - Adaptec ASR-7805Q RAID Controller
Since this is a storage system, they didn’t bother with the controller. This series has SSD caching, which would be very useful for us, and there is a BBU immediately included, which is also a very useful option. - Intel SSD SSDSC2CW240A310
They were needed solely in order to work MaxCache (aka SSD cache). - Intel X520 DA2 Network Cards
To avoid a bottleneck on network interfaces, it was necessary to provide a 10Gb link between the ESXi nodes and the storage system. After studying the offers of the market, we may come to the not very elegant, but then to a suitable for the price and speed option using 10 gigabit network cards.
All this cost us about 200 thousand rubles.
Implementation
We decided to issue the target, that is, allocate storage resources to consumers, using iSCSI and NFS. The most reasonable and quick solution, of course, would be to use FCoE, so as not to get into TCP with the corresponding overhead, which, in general, could be done with our network cards, but, unfortunately, we do not have an SFP switch with support FCoE, buy it was not possible, since it would cost us 500 TR from above.
Once again, having smoked the Internet, we found a way out of this in the vn2vn technology, but ESXi learns how to work with vn2vn only to the 6.x version, so, without thinking further, they started to think about what it is.
Our corporate standard for Linux servers is CentOS, but in the current kernel (2.6.32-358) encryption is very slow, so I had to use Fedora as the OS. Of course, this is a Red Hat test site, but in the latest Linux kernels, data is encrypted almost on the fly, and the rest is not what we need.
In addition, the current 19 version will be used as the basis for RHEL 7, and therefore will allow us in the future to safely switch to CentOS 7.
Targets
In order not to inflate the article and do not move away from the topic, I omit all the uninteresting ones such as assembling iron, butting with the controller, installing the OS and others. I will also try to describe as little as possible the target itself and limit myself only to its work with the ESXi initiator.
From Target, we wanted to get the following:
- correctly working caching - disks are rather slow, they can only be squeezed out of themselves 2000 iops;
- maximum speed of the disk subsystem as a whole, read (give as much as possible iops).
Meet, here they are.
LIO
linux-iscsi.orgWith the Linux kernel 3.10.10, it showed me 300 MB / s of writing and 600 MB / s of reading in blockio mode. He showed the same numbers with a fileio and also with a RAM disk. The graphs showed that the recording speed jumps very much, probably due to the fact that the ESXi initiator requires recording synchronization. For the same reason, the number of IOPS per record was the same with fileio and blockio.
In the meillists, it was recommended to disable emulate_fua_write, but this did not lead to any changes. Moreover, with the 3.9.5 kernel, it shows the best results, which also makes us think about its future.
LIO, judging by the description, can do a lot of things, but most features are available only in the commercial version. The site, which, in my opinion, should be primarily a source of information, is full of advertisements, which causes a negative. In the end, they decided to refuse.
istgt
www.peach.ne.jp/archives/istgtUsed in FreeBSD.
Target works quite well, except for a few but.
Firstly, it does not know how to blockio, secondly, it cannot use different MaxRec and MaxXtran, at least I did not succeed. For small MaxRec values, the sequential write speed did not exceed 250 MB / s, and the read was at a quite high level - 700 MB / s. Approximately 40K of iops, I received 4k randomly recorded with a queue depth of 32. With an increase in MaxRec, the write speed rises to 700 MB / s, the reading drops to 600 MB / s. Iops fall to read 30K and 20K to write.
That is, somehow it would be possible to find a middle ground, changing the settings, but somehow it seemed not to be difficult.
STGT
stgt.sourceforge.netWith this target there were problems with setting up the interface with the hypervisor. ESXi is constantly confused LUN - took one for the other, or stopped to see at all. There was a suspicion of a problem in incorrect binding of serial numbers, but writing them in configs did not help.
The speed is also not pleased. Achieving more than 500 MB / sec from it, neither read nor write failed. The amount of IOPS for reading is 20K, for writing it is approximately 15K.
As a result, problems with the config and low rates in speed. Refuse.
IET
iscsitarget.sourceforge.netWorked almost flawlessly. Read and write 700MB / sec. IOPS on reading about 30K, on ​​record 2000.
The ESXi initiator forced the target to write data to the disk immediately, without using the system cache. Also, a few scared reviews about him in the maillists - many reported on unstable work under load.
SCST
scst.sourceforge.netAnd finally got to the leader of our race.
After rebuilding the kernel and the minimum configuration of the target itself, we received 750MB / s of reading and 950MB / s of writing. IOPS in fileio mode - 44K for reading and 37K for writing. Immediately, almost without a tambourine.
This target seemed to me the perfect choice.
iSCSI for VMWare ESXi 5.1 on SCST and Fedora
And now, in fact, for the sake of which we all gathered here.
A small instruction on how to set up an ESXi initiator. I did not immediately decide to try to write an article on Habr, so the instruction will not be step by step - I restore it from memory, but it will contain the main points of the settings that allowed us to achieve the desired results.
ESXi 5.1 Preparation
The following settings have been made in the hypervisor:
- In the iSCSI settings of the initiator, the delayed ACK is disabled for all targets. Made in accordance with: kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1002598
- the initiator's parameters are changed in accordance with the parameters of the target:
InitialR2T = No
ImmediateData = Yes
MaxConnections = 1
MaxRecvDataSegmentLength = 1048576
MaxBurstLength = 1048576
FirstBurstLength = 65536
DefaultTime2Wait = 0
DefaultTime2Retain = 0
MaxOutstandingR2T = 32
DataPDUInOrder = No
DataSequenceInOrder = No
ErrorRecoveryLevel = 0
HeaderDigest = None
DataDigest = None
OFMarker = No
IFMarker = No
OFMarkInt = Reject
IFMarkInt = Reject
You will need to disable Interrupt Moderation and LRO for network adapters. You can do this with the commands:
ethtool -C vmnicX rx-usecs 0 rx-frames 1 rx-usecs-irq 0 rx-framesirq 0 esxcfg-advcfg -s 0 /Net/TcpipDefLROEnabled esxcli system module parameters set -m ixgbe -p "InterruptThrottleRate=0"
The reasons why it is worth doing:
www.odbms.org/download/vmw-vfabric-gemFire-best-practices-guide.pdfwww.vmware.com/files/pdf/techpaper/VMW-Tuning-Latency-Sensitive-Workloads.pdfIn order not to set these values ​​again, you can add them to this script:
/etc/rc.local.d/local.sh
Fedora preparation
Download and install the latest version of Fedora at a minimum.
Update the system and reboot:
[root@nas ~]$ yum -y update && reboot
The system will work only on the local network, so I turned off the firewall and SELinux:
[root@nas ~]$ systemctl stop firewalld.service [root@nas ~]$ systemctl disable firewalld.service [root@nas ~]$ cat /etc/sysconfig/selinux SELINUX=disabled SELINUXTYPE=targeted
Configure network interfaces and disable the NetworkManager.service service. It is not compatible with BRIDGE interfaces, and this was necessary for NFS.
[root@nas ~]$ systemctl disable NetworkManager.service [root@nas ~]$ chkconfig network on
LRO is disabled on network cards.
[root@nas ~]$ cat /etc/rc.d/rc.local
Following the recommendations from Intel, the following system parameters have been changed:
[root@nas ~]$ cat /etc/sysctl.d/ixgbe.conf net.ipv4.tcp_sack = 0 net.ipv4.tcp_timestamps = 0 net.ipv4.tcp_rmem = 10000000 10000000 10000000 net.ipv4.tcp_wmem = 10000000 10000000 10000000 net.ipv4.tcp_mem = 10000000 10000000 10000000 net.core.rmem_max = 524287 net.core.wmem_max = 524287 net.core.rmem_default = 524287 net.core.wmem_default = 524287 net.core.optmem_max = 524287 net.core.netdev_max_backlog = 300000
Target preparation
To use SCST, it is recommended to add patches to the kernel. This is optional, but with them the performance is higher.
During the creation of the repository, the latest version of the kernel was - 3.10.10-200. By the time you read the article, the kernel can already be updated, but I do not think that this will have a strong impact on the process.
Creating an rpm package with a modified kernel is described in detail here:
fedoraproject.org/wiki/Building_a_custom_kernel/enBut in order to avoid difficulties I will describe the preparation in detail.
Create a user:
[root@nas ~]$ useradd mockbuild
Let's go to his environment:
[root@nas ~]$ su mockbuild [mockbuild@nas root]$ cd
Install the build packages and prepare the kernel sources:
[mockbuild@nas ~]$ su -c 'yum install yum-utils rpmdevtools' [mockbuild@nas ~]$ rpmdev-setuptree [mockbuild@nas ~]$ yumdownloader --source kernel [mockbuild@nas ~]$ su -c 'yum-builddep kernel-3.10.10-200.fc19.src.rpm' [mockbuild@nas ~]$ rpm -Uvh kernel-3.10.10-200.fc19.src.rpm [mockbuild@nas ~]$ cd ~/rpmbuild/SPECS [mockbuild@nas ~]$ rpmbuild -bp --target=`uname -m` kernel.spec
Now patches will be required. Download SCST from svn repository:
[mockbuild@nas ~]$ svn co https://scst.svn.sourceforge.net/svnroot/scst/trunk scst-svn
Copy the necessary parts in ~ / rpmbuild / SOURCES /
[mockbuild@nas ~]$ cp scst-svn/iscsi-scst/kernel/patches/put_page_callback-3.10.patch ~/rpmbuild/SOURCES/ [mockbuild@nas ~]$ cp scst-svn/scst/kernel/scst_exec_req_fifo-3.10.patch ~/rpmbuild/SOURCES/
Add a line to the kernel config:
[mockbuild@nas ~]$ vim ~/rpmbuild/SOURCES/config-generic ... CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION=y ...
Let's start editing kernel.spec.
[mockbuild@nas ~]$ cd ~/rpmbuild/SPECS [mockbuild@nas ~]$ vim kernel.spec
We change:
On:
%define buildid .scst
We add our patches, preferably after all the others:
Patch25091: put_page_callback-3.10.patch Patch25092: scst_exec_req_fifo-3.10.patch
Add a command to apply the patch, it is recommended to add after the remaining entries:
ApplyPatch put_page_callback-3.10.patch ApplyPatch scst_exec_req_fifo-3.10.patch
After all the actions, run the build of rpm kernel packages with the included firmware files:
[mockbuild@nas ~]$ rpmbuild -bb --with baseonly --with firmware --without debuginfo --target=`uname -m` kernel.spec
After the build is completed, install the firmware and kernel header files:
[mockbuild@nas ~]$ cd ~/rpmbuild/RPMS/x86_64/ [mockbuild@nas ~]$ su -c 'rpm -ivh kernel-firmware-3.10.10-200.scst.fc19.x86_64.rpm kernel-3.10.10-200.scst.fc19.x86_64.rpm kernel-devel-3.10.10-200.scst.fc19.x86_64.rpm kernel-headers-3.10.10-200.scst.fc19.x86_64.rpm'
Reboot.
After a successful download, I hope, go to the directory with the SCST sources and by the root user build the target itself:
[root@nas ~]$ make scst scst_install iscsi iscsi_install scstadm scstadm_install
After the assembly, add the service to autorun:
[root@nas ~]$ systemctl enable "scst.service"
And configure the config in /etc/scst.conf. For example, my:
[root@nas ~]$ cat /etc/scst.conf HANDLER vdisk_fileio { DEVICE mail { filename /dev/mapper/mail nv_cache 1 } DEVICE cloud { filename /dev/sdb3 nv_cache 1 } DEVICE vmstore { filename /dev/sdb4 nv_cache 1 } } TARGET_DRIVER iscsi { enabled 1 TARGET iqn.2013-09.local.nas:raid10-ssdcache { LUN 0 mail LUN 1 cloud LUN 2 vmstore enabled 1 } }
Create files that allow or prohibit connections to the target from specific addresses, if you need it:
[root@nas ~]$ cat /etc/initiators.allow ALL 10.0.0.0/24 [root@nas ~]$ cat /etc/initiators.deny ALL ALL
After configuring the configuration files, run SCST:
[root@nas ~]$ /etc/init.d/scst start
If everything was done correctly, then the corresponding target will appear in ESXi.
Thank you for reading to the end!