📜 ⬆️ ⬇️

How I Quest (now Dell) implemented Foglight

Opus on how not to choose and implement a monitoring system

Hello dear habrovchane.

Let me tell you about the long history of one company, with a very small staff of the hosting team, who suddenly wanted to upgrade their monitoring system. It will be a long and thorny path. Ways that only now, after almost two years, come to this remarkable and ambiguous concept as maintenance mode. Kohl this story will seem interesting to you - welcome under cat.

So, two years ago it was decided that SolarWinds ipMonitor, which we have successfully used for many years, has exhausted its possibilities. The company grew, the number of servers in the cells grew, the number of cells grew as well and it was decided that ping, telnet and the search for the word in the source is not enough. In addition to this system, there was also a great many scripts written by various engineers and, naturally, without documentation. The scripts broke down regularly, sometimes not obviously, and as a result, the quality of the service provided suffered.

At one of the vmWare presentations, my boss noticed a monitoring system with “enormous potential”. A bunch of indicators, buttons. graphs, analysis tools, in general, a lot of beautiful and sweet for the unchallenged head of the hosting department of five people. This miracle was called the Quest Foglight Monitoring System (FMS below) tool. Without delay, the senior engineer was asked to contact the vendor and make a test deployment. After several weeks of “hard work,” the engineer gave the go-ahead. Of course, the head suggested we all familiarize ourselves with the system before purchasing and asked us to express our opinion. So the point of no return has come - we agreed with the elder’s arguments blindly, since no one relieved us of the main work and to waste time on something where the elder said “grave” as if does not make much sense. So, the price was announced, of course, we wanted absolutely all the functionality that was possible and the price was quite biting. Vendor persuaded us to buy several months of time in their professional services, but their services seemed to someone too expensive. In the end, after all, we somehow dealt with what we already had, and we can handle it, right? O Great Vishnu, how much this opinion turned out to be erroneous. A three-day-long training package was bought for the whole group, as well as a week PS and “some customizations” were also ordered. Experienced IT professionals of a rather big medium-sized business probably already giggle and twist a finger at their heads. Hosters probably just sigh, and perhaps surprised by the immense short-sightedness of all the above stated.
')
The problems started a minute after the consultant’s time had run out and we were transferred to the customer support department. It all started with the fact that our senior vendor provided a deployment plan that indicated his test sandbox. The vendor was probably happy to sell people with three dozen virtual machines and one database a monitoring system in top configuration, but after all, it was actually about several hundreds of virtual machines on several chassis, with several clusters of database servers, and even geographically at different ends of the continent . At that moment, we could not imagine how voracious the FMS would be in terms of resources. After creating all the database agents, vCenter, and infrastructure, we suddenly realized that it was hanging tight. Zaonok in support, we poke his nose in the plan of deployment and declare that if we had reported in advance about the size of our needs, then it would have been about other requirements. Two days later, the senior engineer quit. So I appear on the scene - in principle, still far from senior and I have no words in choosing projects for myself.

The first thought was "Should I quit now." But the Russians aren't giving up, right? First, I knocked out the dedicated server for this fun. Two old Dell 2950 with ESXi on them. I could not knock out a separate server for the database, and therefore I had to use a virtual machine on them as well.

A short description of the FMS architecture


FMS consists of:
1. Management Server. These servers can be somewhat in the active / passive cluster of their own implementation, this is the central point that commands all.
2. Foglight Agent Manager. The agent manager is a windows service (daemon if you can and want to) which can be installed for several different purposes. We thus divided vmWare, SQL staging, SQL production and OS so that when there is a problem with any one type of agent, we do not have to interrupt all observations.
3. Foglight Agent. Agents can be for all occasions: as bought from a vendor, and written independently.
4. Database. Everything is clear here - we have SQL Server 2008.

Pretty quickly, I realized that working with what is simply impossible. Firstly, the system braked even with adequate resources. A page with a rule manager could load a list of rules for an arbitrary amount of time from five to fifteen minutes. The call to support had an unexpected result - they knew about the problem and promised to fix it in the next version ... after a quarter. In the meantime, the authorities demanded results and no justification for the fact that our version was slowed down, because a considerable amount of money had been spent. Gritting his teeth and inventing detours, everything more or less earned in another six weeks and then the clock was transferred. What does it have to do with DST, you ask fairly? The fact is that in this rather developed system there was a bug. No, this does not happen. After all, such bugs do not fall into production. With the change of time, the database began to grow uncontrollably. In two days, when they reached the limits of the disk, the messages about observations suddenly ceased to arrive, and it was here that an emergency occurred and we learned about it from our clients. It was very unpleasant, was debriefing, deprivation of awards and other nice things. A call to support, again, "Yes, of course, we know about this problem, here's a script for you." I could not give a clear answer to the questions about when the support would be available, and the patch did not come out until the next time change, although now we knew waited for the manifestation of this problem and she did not disappoint.

Having used the system for some time, we began to understand that in the first place, customization by us was simply not working, and secondly, they simply were not needed. We need others, but here's a bad luck - the vendor bought Dell and the pricing policy has changed somewhat. The authorities demand that I urgently write the required customizations myself. The idea that it would be nice to quit visited me again, because I was never a programmer. That is not my soul to this and that's it. But the Russians aren't giving up, right? I master groovy script on which it all works. In the process of learning, I understand that almost half of the functionality purchased by us can work better if I just rewrite it for our specific needs. I rewrite and stop talking to the authorities at the same time that I hate this product because it’s already 30% my own product: after all, for all the implementation time, no engineer had touched it at all, even though I asked for help.

And then came the cherished hour - a new version was released in which, O Great Vishnu, both the problem with loading many pages and the hated bug with DST were corrected. I confess - on this day I celebrated. The end of the constant nervous ticking and coffee tours “until the page loads”. This event finally brought the onset of the cherished maintenance mode. Now, I only occasionally, at the request of workers, change alert thresholds and occasionally write new agents who have nothing to do with the infrastructure, but simply notify customers of completely customer problems, such as blocking users of our product. Now I am a lead and now I know for sure how to choose and implement software.

I will try to present my seemingly obvious conclusions.

1. You can not immediately buy the full functionality without a firm belief that it is needed. Make sure that he really needed is easy, because you can hire a consultant with experience with this particular software. Believe me - it is much cheaper than we paid for cartridges that are no longer used.

2. You can not hurry. Nothing terrible would have happened if we sat for half a year on what had already happened. You can always find several old servers, and no one except the sales manager from the vendor, is driving you to pay here and now.

3. You need to understand the specifics of the staff that is available. It is not necessary to entrust the analysis to just one person, especially if the person is poorly motivated.

4. Do not save on the price of implementation. True, not worth it. Vendor usually really wants to bring you to production as soon as possible, because that's when he will be paid in full, and consultants also have their own benefit if everything goes well. If the vendor says that it will take months with their staff, it means that this is most likely the case. If there is no money in the budget for this, stop, for you will still pay, but more.

Source: https://habr.com/ru/post/214409/


All Articles