📜 ⬆️ ⬇️

Nexenta operating experience, or 2 months later

A couple of months ago we launched a storage system based on Nexenta for mass virtualization needs. We put it under a good load and not everything went like clockwork and a book. Who are interested in the shocking details and operating experience, please under the cat.


I remind you, Nexenta is such a commercial software for storage systems, based on ZFS / OpenSolaris. It knows a lot of things, but we need to feed our virtualization nodes under KVM and Hyper-V / Windows 2008 HA Cluster via iSCSI - with deduplication, cache, and all the other gingerbread.
The car was assembled fit, 36 discs of 300Gb SAS 15K, found 300Gb SDD Intel 710 series (nowhere better) and so on.

We have collected the server, set Nexent. Its official partners delivered us Nexents, certified engineers. It all started, earned. Solemnly included deduplication, rejoiced. Have adjusted bakap.
')
We began to move virtual machines there from the Hyper-V / Windows 2008 R2 cluster. And I have to say a lot of them, quite a few hundred, and their size is not small - 20GB on average. There are 5 GB, and there are 200.

We have divided the cluster, the old one was in the old data center, and we have already assembled a new cluster in the new one. Therefore, they migrated not by LAN, but through VLAN over several hundred megabits — the migration of VMs with a 200GB disk was delayed for several hours.

Everything would be fine, but we caught the first joint in a week - as we migrated to Nexent, the memory for the deduplication table ran out. I must say that in Nexent, the deduplication table refers to the disk pool, and not to a specific volume. If the memory runs out, it does not mean that the smart system will disable the fashion feature in advance. A smart system will go into a stupor with performance degradation by an order of two. Those. if earlier you were chasing the VM body at a speed of 80-90MB / s through a gigabit port, now it will be 3-4MB / s and 200-300 IOPS at all.

As it turns out, for each FS block in the Nexente pool, you need approximately 500 bytes in memory for the deduplication table. If we have 128GB of memory in the system, Nexenta can store information about 256,000,000 blocks:
the maximum size of the occupied pool on 4K blocks is 1TB, on 16K blocks 4Tb, on 32K blocks 8Tb and on 64K blocks 16Tb. Write it down with a red marker on the wall! Go abroad and goodbye premium =)

If you didn’t set the block size right at the beginning, nothing will help except migrating to a new volume with the correct size. If you turn off deduplication, nothing happens. The deduplication table has nowhere to go, including from memory. All that can be done is to create another new volume without deduplication and copy the old one there. Yes, if you have the DD table overflowed for this moment, it will be copied at about 3-4 GB per minute. In our case, the volume was about 4TB in size, the migration (zfs send / zfs receive) took about 16 hours. Then we erased a volume with deduplication, after a couple of hours the car felt so much better that we could start everything up.

Please note that we had free space on volumes - we could do it. But if there is not enough space for migration inside the repository due to lack of disks or license restrictions, then welcome to the world of wonders and adventures. iSCSI in this situation barely works and constantly falls off from Windows on timeouts, Linux stoically transfers the brakes. A block device will have to be mounted and many hours will be trying to merge content from there, another option is to finish the memory into the server with Nexenta (to give up the mother, as a rule), so that she would come to her senses ...

More about deduplication at the block level . It does not work, forget it. Holding on the array with blocks (sic!) 4Kb hundreds of client VMs .vhd assembled from 3-4 templates, we expected it to be like in the book about NetApp, coefficient 8-10. This did not work out, the dedup ratio was about 2-3. Those. the inclusion of basic compression leads to a similar result - 40%. To achieve coefficients of 8-10, you should probably use offline deduplication with a creeping block. It's like in the storage pools of the new Windows 2012 or ONTAP.

Further more. Turn off deduplication, everything started. Volumes are large, 7TB each. We continue the migration of VM Hyper-V from the old cluster to the new one, we notice on MRTG - the write speed is reduced. Not immediately, slowly. During the day of migration at a speed of 50-60MB / s two times.

Why? Here comes the second joint, not described in the marketing materials immediately. It is worth ZFS pool filled as it is radically degraded in performance . More than 30% of the free space is not evident (but it can be seen on the graph), from 30 to 10% everything works three times slower, and if the space becomes less than 10%, everything works an order of magnitude slower.

Moral, watch the place . If there is no space on the pool, it will be a catastrophe - it will be necessary to expand the pool abnormally by pushing disks / shelves / licenses or migrating data to neighboring storage systems / local disks of nodes. Nexenta feels in good health, if she has more than 30% free space, plan this when calculating servers.

But this is not the most unpleasant. The most unpleasant thing is when Hyper-V loses the quorum disk or the entire Cluster Shared Volumes on 4TB feeding three large nodes and it is impossible to mount it on one of the nodes by any means. How to achieve such a magical result: put a hundred VMs on one CSV volume under Windows, start them and start migrating a few more VMs to this volume. At first, everything is cheerfully copied, then the copying process stops, and then CSV falls off at your place. Yes, on the whole cluster. In this case, the target will be visible, but you can not mount a volume from them.

In some cases, it turns out to be mounted back, and in some cases, after daily communication with MS support, Nexenta can be mounted on one of the nodes of the former cluster on which the .vhd files are located - without configurations. Then you will be able to copy them at the above speed to the good old hard drives of the node and then start one at a time there, starting them in Hyper-V with handles.

This applies only to Hyper-V / Windows 2008 R2 and does not apply to KVM. This does not apply to Starwind - it works smoothly in this mode. Our opinion is that iSCSI Winds 2008 and Nexent are not quite compatible with each other, massive parallel recording / reading leads to locks, timeouts and malfunctions. Nobody told us anything intelligible in support of both companies. The forums say that some patches help Windows in some cases, that the type of LUN should be changed, etc.

Further, the upgrade of the system on Nexent requires its reboot. If there is no high-availability storage cluster, prepare to announce technical work or migrate all VMs to neighboring storage sites. Keep this in mind, the changelog between the versions of Nexents is huge.
Further, the Nexenta disk pool should ideally consist of identical disks assembled into the same raid group. You will not be able to further expand these groups or change the type of raid in them.

In general, about these impressions of the Nexenta after 2 months of work. Now we have Linux virtualization rather smoothly and quickly working with Nexenta, Windows 2012 Cluster working with Starwind and we are testing Storage Pools SMB 3.0 speed and mechanism for using them to organize high-availability CSV. Yes, offline deduplication works there and we see good odds, but this is a topic for another article.

For general performance reasons and IO pattern. For VPS tasks, SAS disks in the pool are not needed; 12 SATA disks of 1TB each would be enough for us. We planned that everything will be with deduplication, but it did not. If they knew beforehand, they would save money. The disk monitor is convenient there, you can immediately see from it how many operations a disk has. The SSD in the cache should be server class, it has a constant serious load on both reading and writing.

And how do you operate Nexent under load with volumes of tens of TB? Without load, everything is beautiful.

Source: https://habr.com/ru/post/179151/


All Articles