Netapp - Reality vs. Marketing

Good day

It just so happened that I have been working on data storage systems for the last 5 years, 4 years of which I dedicated to EMC mid-tier systems, which I was generally pleased with. About EMC, I may devote a separate post, and this one will be devoted to NetApp storage systems, which have been dealt with over the past year in rather complex configurations. The view from the buyer, user, administrator, without any technical details and beautiful pictures.

Who cares - welcome under cat.

How it all began

It all started with the fact that they decided to build a second, remote data center. Since the resources of the old storage system have come to an end, and in any case we need to build DCs from scratch, we decided to purchase something that can withstand the load of about 1000 virtual machines and about 20 Oracle product databases + a lot of development. Well, respectively, once a remote DC, it means replication and fault tolerance at the level of the entire data center. I will not dwell on all aspects of the choice, but I will say that the applicants were Hitachi, EMC and NetApp. We chose NetApp, because After the tests, we liked how Oracle works on NFS and the lack of FC SAN as a class, we can use the existing 10gb network. On that and stopped.

What tasks were set

Remote sites in active-active mode (in the sense of part of the bases there, part there, not RAC)
Oracle data loss in case of a failure of one of the parties - 0 seconds
Base switching time to the other side - up to 5 minutes using clustered software (Veritas)
Permanent availability of VMware virtual machines

What is the result?

And it turned out 2 FAS6280 systems, 2 controllers in each, on two platforms for databases, SnapMirror replication + one FAS3270 system in the MetroCluster (SyncMirror) configuration for virtual machines. Ontap version - 8.1.2 on all systems. Everywhere - FlexVol and RaidDP .

I will say right away - there is no problem with the metro cluster on the FAS3270. Totally. Absolutely. It works and performs exactly the tasks that were assigned to it, pulls about 1000 virtual machines divided in half between the sites. No problems and pitfalls. If the controller goes into reboot, the virtuals freeze for 15 seconds on I / O and continue to work as if nothing had happened. Reverse switching, when the controller returns, takes about the same amount of time. I am satisfied with this system for all 200%. But, let's say honestly, the load on it is still about 50% of the place and about 25-30% of input-output disks (about 4,500 disk iops per 75 active disks / side). As practice shows, this is exactly what allows him to work without problems.

In conjunction with NetApp, a specification was drawn up for the FAS6280, a sizing was conducted, and we discussed all the pitfalls that may arise in the case of our tasks. We were assured that everything will work as we have discussed. And we discussed the following picture:

Each database contains 3 sections: data + index, archivelog, redo + undo.
20 databases divided into 4 controllers, replication between a pair of controllers.
2 sections, data + arch - asynchronous snapmirror, once a minute. The redo section is a synchronous snapmirror.
Development is spread by a thin layer on all 4 controllers.

At the time of launch, recycling at the site was about 40% for each controller, recycling for disc water-output - 20%. From which we can conclude that the system just idle.
Now the disposal at the site is about 85%, on disks about 70% on average, in peaks 90%. If the figures - 100% recycling, it is 32500 disk operations on the controller. Disk operations! = Count iops on nfs.

Problem One - Synchronous SnapMirror

Promised - 30 sessions of synchronous replication on the controller. Everything. No details, except what is indicated - synchronous replication is very sensitive to the latency of the network. No details about the load, direction.

In fact, synchronous replication breaks down even with 2 sessions directed in different directions (the fact that this may turn out to be the key, we realized after almost 2 months).
It looked like this:

[fas6280a_1:wafl.nvlog.sm.sync.abort:notice]: NVLOG synchronization aborted for snapmirror source VOL_prod_ru, reason='Out of NVLOG files' [fas6280a_1:snapmirror.sync.fail:notice]: Synchronous SnapMirror from fas6280a_1:VOL_prod_ru to fas6280a_2:VOL_prod_ru failed.

They opened the case, consulted directly with the engineers of NetApp - the verdict always sounded the same - you have problems with the network. We conducted our own research on this subject, spent almost a month on it - our verdict, the network is in order, the latency under load is about 0.5ms and does not jump. The solution was suggested by one of the technicians from the depths of NetApp - he said that he didn’t work in different directions, because there is something related to the Consistency Point (CP) mechanism itself, we didn’t understand the details, but it was worth deploying replications in one direction - all was well.
Since this did not suit us, we bought 4 shelves of disks and took all the Redo / Undo on FAS3270, to the metro cluster, and forgot about the problems of synchronous replication as a class. True, the intermittent load from the virtual machines, the high load on the processors and the high response time requirements for Redo did not allow us to stay on this decision, but this is a separate story.

Problem Two - Asynchronous SnapMirror

Everything is easier here - if we set the synchronization once a minute, we get a never ending session, because the data do not have time to overflow in the allotted time, the changes are considered too long. We stopped at the 5 minute schedule, shifted by 1 minute for each section. When the load has increased to 80% of disk utilization, we have a synchronization lag for the heaviest bases of the order of 15–20 minutes all the time, during peak times (backup, for example, disks of 90–95% utilization), the lag increases to infinity. From which it follows that there can be no switching in 5 minutes, for only dokatyvanie changes at best 20 minutes + reverse resynchronization after replication reversal, which takes a very long time on large partitions, the same 20 minutes at best.

Problem three - QOS, or in NetApp terminology - FlexShare

The system manages the scheduler, works at the volume level (volume), allows you to set priorities from 1 to 99 for each volume relative to other volumes plus relative to system processes. The description is gorgeous, it would seem, everything is just as 2x2. In fact:

It is impossible to limit the upper bar for the volume, well, if suddenly some application broke and started to torment I / O, everyone will suffer.
No matter what priority is set, say, for the sale of 99, for the development - 1, the load of the development will still affect the prod. But yes, the response time of the design section will be slightly worse if the load is constant.
If the development load is not constant, but intermittent, the priority does not work at all. He needs time to adapt.
The priority does not speed up the execution of SnapMirror, no matter how high the system is.
Priority raises the overall latency through the system, but not much. Noticeable only with very high loads.
With a very high controller load, it is impossible to execute most commands, the result comes 3-5 minutes after running the command to execute, for example, you cannot even see the status of replication, and this blocks the operation of cluster software.

It is because of the latter that priority is turned off on our systems. They opened a case on this issue - there is no result.

The fourth problem is free space on the unit (a set of several raid-groups united in one space) .

Those who have experienced NetApp from previous years probably know that if the unit is filled with more than 90% of the data, the system becomes extremely lazy and does not want to work. Marketers stubbornly claim that this problem has been completely eliminated since version 8.1 (or maybe even 8.0, I don’t remember exactly, to be honest). In fact - a complete lie. We did not keep track of the place, filled the unit by as much as 93% - that's it, hello. The response time of the system as a whole has increased almost 2 times. And it provided that we do not use deduplication, compression and other goodies, only thin volumes (thin provisioning). The system was released only with the release of space to 85%. Point.
Moreover, try to fill the unit with 85-95%, create a dozen snapmirror sessions in different directions, load the system with I / O percent by 80, and then delete the partition, 1/5 the size of the unit. The result - more than an hour of inoperative databases due to the fact that the system has gone to itself, the response time for all sections ranged from 300ms to 5s. I didn’t react to the console, there was even a thought to reboot the controller, but first they began to urgently take off all the load from it, killed the replication on the receiving side, and after some time the system began to recover. NetApp recommendation - kill files one by one, do not immediately remove 12TB via vol destroy.

Problem fifth - 2 units

But the joke is that if you have 100% loaded disks on one unit, then the second unit with a load of 10% will refuse to work, and the latency there will be comparable to the unit that has 100%. This is directly related to how data is recorded in NetApp.
The fact is that unlike classic block storage systems that can drive data directly to disks (write-through), if the block size is more than a certain value (usually 16kb or 32kb, but can be customized), NetApp never does that and Essentially always writes a write-back due to the peculiarities of the work of WAFL . This is due to a large amount of memory on the controllers, even younger models. And this is precisely the reason for the super-low latency on recording, of which NetApp is so proud if the system is not heavily loaded. The cache is the same for all units, it overflows, it becomes impossible to write even in an unloaded unit.

Conclusion - you should not overload disks by more than 60% if you want stable low latency on the system. Well, it does not make sense to make a separate unit, if you have the same drives.

Small conclusions

If there is a task to build a DR solution up to 50km, take MetroCluster and do not even think. The difference in price is ridiculous, completely transparent to the hosts. Of the minuses - highly desirable separate dark optics between the DC. But in fact it works when dividing resources, it is checked. You can fit into an existing SAN, but I did not tell you that.
If there is a task to get low latency - do not allow disk loading of more than 50-60% (90-110 iops on a SAS disk 3.5 ").
Try to avoid an intermittent load, the more linear the load, the easier it is for NetApp to digest it.
Try to avoid loading with different blocks. WAFL is extremely worried.
Do not allow more than 85% of the load data of the unit.
Get ready to change the iron warranty. Lot. Often. But this is now affecting all manufacturers of primary and secondary systems.

Here, in general, probably all my observations for the year of communication with these systems. The systems are actually not very bad, they are very easy to learn. All pleasant administration.

Source: https://habr.com/ru/post/212453/

All Articles