ZFS Filer for Cloud Infrastructure - NexentaStor

In connection with the change of Hostkey from the main data center from Stordata to Megaphone and the transfer of the main virtualization site and transition to Server 2012 for Windows, we had to create a new high-performance filer for issuing iSCSI / NFS / SMB cluster targets and our clients to organize private clouds / clusters. We chose and implemented NexentaStor and this is what came out of it and how we did it.

Last Friday, we talked about the experience with the deployment of 180 UPUs via SolusVM , which we all liked except for the lack of a fine allocation of resources. That is what was important for us to ensure continuous maintenance of a cluster based on KVM with Linux, and on Hyper-V Windows Server 2012.

For the last 2 years we have worked with varying success at Starwind. The variable success lay in the fact that the high-available cluster of two equivalent filers for Hyper-V (as planned) had to be forgotten - it fell apart once a week. As a result, it became more or less reliable to work only with the main and backup, where asynchronous replication took place. Using the MPIO settings, we discarded the Hyper-V cluster to use the backup filer in all cases except crashes, but sometimes the targets themselves slipped into the archive filer.

Another disadvantage is that you cannot do ANYTHING with the already created volume: neither migrate, expand, change settings, etc. If the volume has a size of about 8 TB and under constant load from 1000 to 2000 IOPS, this is a problem, and each migration is a nightmare admin with a monthly preparation and replacement servers. All sorts of stuff like resynchronization which takes about 3 days, blue screens of death, unstable deduplication, etc. leave out the brackets. Attempts to get to the bottom of the problem through support - a dead number, send logs and silence or vague answer not reflecting the essence of the phenomenon. Any version upgrade == reload target == 2-3 days resync == nightmare.
')
The system works on Windows Server 2008 Standard edition. 2 cars, 2 processors, ~ 2000 rubles per month for SPLA each. 48,000 rubles a year. Support for our HA license for 8TB is still about 60000r. Total base most TCO about 100000r per year.

Nexenta

We have been looking for an alternative for a long time and began to look closely at the NexentaStor system based on ZFS / Open Solaris (yes, I know that Solaris is not quite open there, this is not about that). Nexenta had a competent small local partner who helped us set everything up, suggested the desired configuration and set everything up as expected. It took about a month for everything from the start of tests to launch.

I will not touch the theory of ZFS, about this written mass posts.

Just want to say - HCL (Hardware compatibility list) is our everything. Do not even try to put there something not described in it, there will be no happiness. As a hoster with a fleet of 1000 of our cars, it was relatively easy for us - we climbed into a wardrobe with spare parts or a free server and took out an alternative part. Corporate client will not be easy in case of an error. Solaris does not have such a mass of drivers as other popular OS.
The license for 8TB and support cost us $ 1,700, there is a free version without support for 18TB. All the same, but without useful and necessary plug-ins. TCO - only support contract, about 30000r per year. We save from scratch 70000r.

The result of our collaboration was the server of the following configuration:

Materiel

Special file Supermicro, SuperChassis 847E16 .

4U, redundant power supply, 36 3.5 ”hot-swap drives on both sides, 2 high-quality backplanes, large swappable fans and space for a standard mother with 8 low-profile expansion cards.
Motherboard Supermicro, 2x s1366, 6xPCIe 8x, 2xSAS 2.0, 2x Xeon E5520 processors, 192Gb DDR3 LV RAM.
3 controllers LSI HBA LSI9211-8i - Nexente need disks themselves, no RAID
1 LSI HBA controller LSI9200-8e - for expansion by disk shelves on SAS.
1 Adaptec controller for 4 ports for connecting SSD
1 10G Intel 520 Network Adapter with 2 SFP + ports; we work with a switch through copper modules directly connected through it

Wheels - 35xHitachi Ultrastar SAS 300Gb 15K
One port is occupied by an AgeStar hot-box with two small 2.5 ”SSD drives for storing ZFS logs (ZIL), consumable, you need to easily and easily change (and the box is included in the 3.5” case slot as your own). The box is connected to Adaptek.
One SSD on a 300Gb server class Intel 710 is the central ZFS cache. Household are not suitable, fade at times. Connected to Adaptek.
Nexenta has a cache in RAM and keeps in memory a deduplication table and metadata. The more memory the better. We set the maximum, 12x16Gb = 192.
The budget of the event: ~ 350 000 rub.

network

We have 2 Gigabit ports and 2 10G ports on the Intel 520 controller on the motherboard. 10G ports are connected to a Cisco 2960S switch - 24 1G ports and 2 10G, SFP + ports (~ 90000r). They are connected via copper SFP + direct connect, it is inexpensive - about 1000r per meter cord with SFP + connectors.
In the near future, an Extreme Summit 640 switch with 48 ports of 10G SFP + (~ 350000r) will arrive to us, then we will overlay the filer and rebuild it, and use Tsiska to reduce it by 1G.

Features disk breakdown

As it turned out, the most reliable way is an analogue of RAID50, triples of disks are going to RAIDZ1 and are added together to the common disk pool. They add a certain number of disks for hot swapping. As you know, ZFS has a special method, scrub read, which in its free time compares the actual contents of the disk with the checksums of the blocks, and if it finds the difference, it immediately fixes it. If there are many errors, the disk goes out of the array, a backup is put in its place.

Control

Simple work through an intuitive web interface. There is a console with a simple syntax and the usual ZFS control commands. In web statistics, generating iSCSI / NSF / SMB targets, creating mount points and managing them. This is where the most important thing is why we took Nexent - thin provisioning & deduplication. For us in the world of virtualization, deduplication is all - in each of the hundreds of Windows IPS 10GB, the places are almost the same. In Nexent, they come together at the block level.

Further, customers order large disks for virutalok and do not use them completely. Average usage with OS is about 15G. In Nexent, we can make a virtual iSCSI target (shared iSCSI mount point) with an arbitrary size, for example, 500T. It does not affect the use of disks, as new data appears there - the place is consumed.

The capacity of the pool can be expanded at any time by adding new disks. This is not a dramatic event when expanding RAID6 with a capacity of 10Tb on a traditional controller, you simply add more disks and the system starts using them. Nothing is rebuilt, the data does not move anywhere.

Nexenta provides complete statistics on how she feels and works - here is the data on processor load, network interfaces, memory, cache, disk breakdown, IOPS on drives / targets, etc.

Backup

The nature of ZFS is the simplicity of snapshots. The system is able to take pictures from the mount points at a specified interval and add them to external storage. You can do asynchronous replication on a side by side. The virtual machines themselves can be backed up with the means of the virilization system used.

Performance

Yes, 10G and multi-tiered storage work almost wonders. We received tests from the stand of Windows 2008 Standard edition with a regular iSCSI initiator and a Qlogic card with a delay of about 2-4 ms, linear transfers of about 700-800 MB per second and more than 31000 IOPS for reading with a queue depth of 16 requests. There were about 25,000 IOPS entries, but in our mass virtualization environment, reading to the record is about 1 to 10. In one line, we have about 3000 IOPS. The file size for tests does not affect the speed, you can at least make it 1T.
As the disks are filled and the load increases, we will add more disks, which will add speed and space.

Using

The main purpose of this solution is to send iSCSI targets to a high-available Windows Server 2012 cluster for Cluster Shared Volumes and to KVM nodes working under SolusVM for organizing the LVM from which work is being done. Deduplication and all the above described buns allow us to keep prices at an excellent level.

The second purpose is to provide clients with the iSCSI target service as a pay-per-use service, so that our clients would no longer need to make their filer if they need a cluster. Since we specialize in large dedicated servers with SLA 4 hours, we know why they are taken. It will be possible to seriously save through the use of centralized infrastructure.

We can now distribute the target at a speed of 1GB, and 10GB. Client targats are automatically snapshoted and replicated to archive storage.

I hope this post will be useful and will avoid rakes when choosing a solution for multi-level budget storage. We will work, I will write observations. We are waiting for comments and welcome to HOSTKEY .

Source: https://habr.com/ru/post/171321/

All Articles