📜 ⬆️ ⬇️

Netapp - Reality vs. Marketing

Good day

image
It just so happened that I have been working on data storage systems for the last 5 years, 4 years of which I dedicated to EMC mid-tier systems, which I was generally pleased with. About EMC, I may devote a separate post, and this one will be devoted to NetApp storage systems, which have been dealt with over the past year in rather complex configurations. The view from the buyer, user, administrator, without any technical details and beautiful pictures.

Who cares - welcome under cat.

How it all began

It all started with the fact that they decided to build a second, remote data center. Since the resources of the old storage system have come to an end, and in any case we need to build DCs from scratch, we decided to purchase something that can withstand the load of about 1000 virtual machines and about 20 Oracle product databases + a lot of development. Well, respectively, once a remote DC, it means replication and fault tolerance at the level of the entire data center. I will not dwell on all aspects of the choice, but I will say that the applicants were Hitachi, EMC and NetApp. We chose NetApp, because After the tests, we liked how Oracle works on NFS and the lack of FC SAN as a class, we can use the existing 10gb network. On that and stopped.

What tasks were set

  1. Remote sites in active-active mode (in the sense of part of the bases there, part there, not RAC)
  2. Oracle data loss in case of a failure of one of the parties - 0 seconds
  3. Base switching time to the other side - up to 5 minutes using clustered software (Veritas)
  4. Permanent availability of VMware virtual machines

')
What is the result?

And it turned out 2 FAS6280 systems, 2 controllers in each, on two platforms for databases, SnapMirror replication + one FAS3270 system in the MetroCluster (SyncMirror) configuration for virtual machines. Ontap version - 8.1.2 on all systems. Everywhere - FlexVol and RaidDP .

I will say right away - there is no problem with the metro cluster on the FAS3270. Totally. Absolutely. It works and performs exactly the tasks that were assigned to it, pulls about 1000 virtual machines divided in half between the sites. No problems and pitfalls. If the controller goes into reboot, the virtuals freeze for 15 seconds on I / O and continue to work as if nothing had happened. Reverse switching, when the controller returns, takes about the same amount of time. I am satisfied with this system for all 200%. But, let's say honestly, the load on it is still about 50% of the place and about 25-30% of input-output disks (about 4,500 disk iops per 75 active disks / side). As practice shows, this is exactly what allows him to work without problems.

In conjunction with NetApp, a specification was drawn up for the FAS6280, a sizing was conducted, and we discussed all the pitfalls that may arise in the case of our tasks. We were assured that everything will work as we have discussed. And we discussed the following picture:

At the time of launch, recycling at the site was about 40% for each controller, recycling for disc water-output - 20%. From which we can conclude that the system just idle.
Now the disposal at the site is about 85%, on disks about 70% on average, in peaks 90%. If the figures - 100% recycling, it is 32500 disk operations on the controller. Disk operations! = Count iops on nfs.

Problem One - Synchronous SnapMirror

Promised - 30 sessions of synchronous replication on the controller. Everything. No details, except what is indicated - synchronous replication is very sensitive to the latency of the network. No details about the load, direction.

In fact, synchronous replication breaks down even with 2 sessions directed in different directions (the fact that this may turn out to be the key, we realized after almost 2 months).
It looked like this:
[fas6280a_1:wafl.nvlog.sm.sync.abort:notice]: NVLOG synchronization aborted for snapmirror source VOL_prod_ru, reason='Out of NVLOG files' [fas6280a_1:snapmirror.sync.fail:notice]: Synchronous SnapMirror from fas6280a_1:VOL_prod_ru to fas6280a_2:VOL_prod_ru failed. 

They opened the case, consulted directly with the engineers of NetApp - the verdict always sounded the same - you have problems with the network. We conducted our own research on this subject, spent almost a month on it - our verdict, the network is in order, the latency under load is about 0.5ms and does not jump. The solution was suggested by one of the technicians from the depths of NetApp - he said that he didn’t work in different directions, because there is something related to the Consistency Point (CP) mechanism itself, we didn’t understand the details, but it was worth deploying replications in one direction - all was well.
Since this did not suit us, we bought 4 shelves of disks and took all the Redo / Undo on FAS3270, to the metro cluster, and forgot about the problems of synchronous replication as a class. True, the intermittent load from the virtual machines, the high load on the processors and the high response time requirements for Redo did not allow us to stay on this decision, but this is a separate story.

Problem Two - Asynchronous SnapMirror

Everything is easier here - if we set the synchronization once a minute, we get a never ending session, because the data do not have time to overflow in the allotted time, the changes are considered too long. We stopped at the 5 minute schedule, shifted by 1 minute for each section. When the load has increased to 80% of disk utilization, we have a synchronization lag for the heaviest bases of the order of 15–20 minutes all the time, during peak times (backup, for example, disks of 90–95% utilization), the lag increases to infinity. From which it follows that there can be no switching in 5 minutes, for only dokatyvanie changes at best 20 minutes + reverse resynchronization after replication reversal, which takes a very long time on large partitions, the same 20 minutes at best.

Problem three - QOS, or in NetApp terminology - FlexShare

The system manages the scheduler, works at the volume level (volume), allows you to set priorities from 1 to 99 for each volume relative to other volumes plus relative to system processes. The description is gorgeous, it would seem, everything is just as 2x2. In fact:

It is because of the latter that priority is turned off on our systems. They opened a case on this issue - there is no result.

The fourth problem is free space on the unit (a set of several raid-groups united in one space) .

Those who have experienced NetApp from previous years probably know that if the unit is filled with more than 90% of the data, the system becomes extremely lazy and does not want to work. Marketers stubbornly claim that this problem has been completely eliminated since version 8.1 (or maybe even 8.0, I don’t remember exactly, to be honest). In fact - a complete lie. We did not keep track of the place, filled the unit by as much as 93% - that's it, hello. The response time of the system as a whole has increased almost 2 times. And it provided that we do not use deduplication, compression and other goodies, only thin volumes (thin provisioning). The system was released only with the release of space to 85%. Point.
Moreover, try to fill the unit with 85-95%, create a dozen snapmirror sessions in different directions, load the system with I / O percent by 80, and then delete the partition, 1/5 the size of the unit. The result - more than an hour of inoperative databases due to the fact that the system has gone to itself, the response time for all sections ranged from 300ms to 5s. I didn’t react to the console, there was even a thought to reboot the controller, but first they began to urgently take off all the load from it, killed the replication on the receiving side, and after some time the system began to recover. NetApp recommendation - kill files one by one, do not immediately remove 12TB via vol destroy.

Problem fifth - 2 units

But the joke is that if you have 100% loaded disks on one unit, then the second unit with a load of 10% will refuse to work, and the latency there will be comparable to the unit that has 100%. This is directly related to how data is recorded in NetApp.
The fact is that unlike classic block storage systems that can drive data directly to disks (write-through), if the block size is more than a certain value (usually 16kb or 32kb, but can be customized), NetApp never does that and Essentially always writes a write-back due to the peculiarities of the work of WAFL . This is due to a large amount of memory on the controllers, even younger models. And this is precisely the reason for the super-low latency on recording, of which NetApp is so proud if the system is not heavily loaded. The cache is the same for all units, it overflows, it becomes impossible to write even in an unloaded unit.

Conclusion - you should not overload disks by more than 60% if you want stable low latency on the system. Well, it does not make sense to make a separate unit, if you have the same drives.

Small conclusions


Here, in general, probably all my observations for the year of communication with these systems. The systems are actually not very bad, they are very easy to learn. All pleasant administration.

Source: https://habr.com/ru/post/212453/


All Articles