Don Jones "Creating a unified IT monitoring system in your environment." Chapter 6. Unified management with examples

Well, finally we got to the last chapter in the book. Here some practical examples will be considered, for the sake of observing ethics, the author practically does not name any specific systems, except very well-known ones. The state of affairs before the introduction of unified management systems and after is considered.

Content

Chapter 1. Managing your IT environment: four things you do wrong
Chapter 2. Elimination of management practices for individual sites in IT management
Chapter 3. We combine everything into a single IT management cycle
Chapter 4. Monitoring: a look beyond the data center
Chapter 5: Turning Problems into Solutions
Chapter 6: Unified Case Management
')

Chapter 6. Unified Case Management

In the final chapter of the book, I would like to once again chase away the contents of the previous five chapters. However, I would like to do this in the form of case studies - practical tasks. It was good luck for me, at one time, to talk with several of my clients whom I advised - they were just trying to overcome similar problems, like those discussed earlier. Not so long ago, they tried to use in their own solutions, which laid the approach that coincides with the one described in the book. They gave me their consent so that I could retell their stories without mentioning companies and actors, and we can see what they had and what happened, and how “unified management” should actually work. In addition, I will also share information about some of the obstacles encountered in their way, and what challenges they had to face and overcome. The transition to a unified management is not always painless, and I think that you will appreciate the understanding of how this was done and what the initial plans were.

This chapter will also include practical information on unified management, which was not discussed in previous chapters. I will give a summary of the functions of unified management, so if you need to evaluate specific solutions, you can put this list in front of you. We will also consider the various sales models offered by vendors, so that you have an idea of the flexibility needed when choosing and implementing your solution.

Practical tasks

The solution for unified management requires functionality, which I personally attribute to two fairly broad areas. The first helps you respond to problems, and the second helps you handle requests that are not related to the actual problems, such as requests for changes in the environment. For each of them, I have my own story and they are both rewritten from the same customer, although you, in each dialogue, will meet different people in different organizations.

Problem detection and resolution

Lisa is a senior system administrator who is responsible in her environment, mainly for Windows-based systems. Her colleague, Peter, is responsible for the infrastructure of Unix and Linux servers. They both have significant overlapping areas, because most of the company's business applications depend on the success of Windows and * nix resources. “Of course, these are not only servers,” Lisa told me, “This is also all that works on these servers: databases, web services, though you know yourself. There are still people who support these very different parts, so we sometimes spent a lot of time arguing about whose personal error was where. ”

I asked her about the example of how everything worked for them before implementing a unified management system. She laughed and showed me the file where she was recording at the time. It looked like a collection of notes from tickets collected on the help desk. I will provide the content here by changing the names. I added a few [editorial] additions where I had to turn to Lisa for additional explanations.

OPEN HELP-DESK in 2009-06-06 13:34
The user insists that the BOS [business application] is extremely slow.
In about [turn] there are already several emails on the same topic. Server BOSDB02 slowly responds to pings.

APPOINTED LHEIRT [this is Lisa]

NOTES LHHART 2009-06-14 15:26
BOSDB02 works fine, except that SQL eats 100% of the CPU. Transferred to DBMS administrator.

DESIGNED by DShields [this is the DBMS administrator]

NOTES DHShields 2009‐06‐14 16:53
Perhaps again indexes, SQL takes more time to execute queries than necessary. We plan to rebuild the indexes for the evening.

NOTES HelpDesk 2009-06-15 10:44
We still receive calls about this

NOTES DHShields 2009‐06‐15 11:12
Indexes rebuilt

ASSIGNED HELPDESK 2009-06-05 11:34
Still getting calls about BOSDB02 - slow ping.

APPOINTED DShields 2009‐06‐15 13:12
SQL is still slow - similar to disk I / O problems. Disk fragmentation? Need server support.

APPOINTED LHEART

NOTES LHURT 2009-06-15 13:47
Fractionation of the server disk less than 2% - the problem is not here. IO is slow because SQL jerks disks very often. Perhaps the database is fragmented. I'll call you back.

TRANSFERRED SHISHELDS

At this point, the dialogue went offline, because the next entry says “problem solved”. Unfortunately, there is no official documentation describing what went wrong or what was done to remedy the situation, but Lisa explained: “We continued to transfer the problem to each other - Peter saw something like that in Performance Monitor, which is why the server worked slowly, so he threw it to me, and I told him that it was his SQL server to blame and return the problem back. But I did not have the authority to see what was being done inside the SQL server, but he constantly wanted to drop the ticket from the queue. ”

“In the end, it all came down to a problem with the SAN, for which Peter was responsible. Something happened to our main channel to the SAN and we worked through a slow backup connection, and there were still some difficulties with the channel configuration, because it did not work at full speed. We saw a slow exchange rate with the disk, because Windows obviously thought that the SAN is just one large logically connected drive. We ran all possible types of tests on the server and SQL server itself and tried to find the source of the problem, but none of our tools were able to show that the real problem was hidden in a completely different place. ”

Peter also recalled this incident. “The strangeness was that outwardly everything worked as it should and none of the systems with which I monitored the work of the SAN showed any alarm signals. The problem was related to the configuration of several of our hosts. But the utility programs did not signal any troubles, although the server access to the SAN was much slower than usual.

“The real problem was that it came out immediately on several servers. We didn’t immediately connect it with each other: each of the hosts used SAN in its own way. On the storage network itself there was not only a large DBMS, but also a small web farm, and in addition - a file server. All the symptoms felt by the users were different and the problems constantly came to different specialists. I got a problem from the guys involved in file servers. They saw how fast the disk queues were growing, and they knew that this could be somehow connected to the SAN, and so they hooked me up. ”

“After we spent a lot of time, this was exactly the source of the problem,” said Lisa. “Each of us tries to think first about what he is responsible for, but now there are so many intersections and interdependencies in systems that when a problem occurs, we don’t see it from our level, because we are completely attached to our tools” .

I also talked to Kevin, who was in charge of the company’s help desk. He said that such cases for his team are particularly difficult: users continue to call, and the help desk has no idea where to put them, can not say anything about the causes of the malfunction, and what the state of affairs is. “Users retell the problem in different words, and each help desk operator opens a new ticket. Of course, we would slow down the work of any specialist, starting to divert him with tickets on the same topic, but in fact, we had no real connection. Normally, if you answer an incoming call, then you see if you have a similar ticket open, but we didn’t have a single place where we could keep track of all current open problems. In the end, I even put a board on which questions requiring special attention were recorded, and with an incoming call, the operator could at least see if the problem was open or not, then search for the ticket associated with it and report the state of affairs to the user phone. "

I asked Lisa how work is going now after the company has implemented a unified management system. “We have been working with her for about a year,” she told me, “with her everything became different.” She showed me a ticket from a problem that happened just recently: “That's what we see now.”

ALARM 2011‐06‐14 12:13:42
KNOT Windows Server BOSDB02
SQL Server instance = DEFAULT
SYMTOM: SQL Server response time does not fit within acceptable limits.

IP: 10.10.15.212

SQL Server DBMS shows 34% free
SQL Server DBMS fragmentation <5%
Disk Queue <1
Network utilization <40%
CPU utilization <60%
Memory utilization <75%

RELATED ALARM 2011-06-14 12:10:52
KNOT MBS3667 Router
Interface fault

"Look, here you can immediately begin to guess what could be the matter." She showed me the monitoring console, in which the entire IT service is now working, with information similar to Figure 6.1. “You can see a simple network diagram, it shows not only servers and services, but also network elements — switches and routers. If the server signals that something is wrong with it, it also collects alert messages from all dependent elements, such as a router. In our case, the interface of the router, which began to drop packets, is to blame. The system itself transferred the problem to a specialist who is responsible for this issue, and, in addition, raised an alarm on all servers connected to this router, because the clients and the monitoring system saw that the response time of the systems began to grow. The availability of this data has saved us a lot of time searching for the source of the problems. The system is configured to automatically perform basic checks, so that when a problem occurs, the system makes a preliminary collection of data on its own, without our participation. ”

Figure 6.1: Visual tracing of alarms.
Lisa also said that the team began to spend significantly less time on the mutual transfer of tickets. When the system is viewed as a single whole, it became clearer where the failure occurred. “The problem becomes huge if it is outside the data center. We have a large number of applications working through SalesForce.com, and if these guys have something to happen, or, more often, one of the providers starts working slower than usual, our users see that “our” application starts working slower But the monitoring system knows about dependencies and usually by this time it has already notified us about the problems that are beginning. We are sending a message about applications that depend on the operation of these services, and we start calling the service provider to register a ticket with him. ”

Kevin says that such a mailing essentially helps the help desk. “We have a web portal on which users can register tickets, but also shows the current system status. Before they open the ticket, they can look and see what we know about the problem. After we taught them to use the system and trust it, they stopped registering repeated tickets. ”

He acknowledged that learning was a big step forward. “At first, we did not do this, but after the users realized that we were fairly honest with them and were well versed in the situation of problems, they began to trust us more. We have made a lot of effort, and now we even have mailing lists and users can add themselves there, so that they can receive a message if something happens to the system. If we are working ahead of the curve - proactively, then this takes a very heavy load off of us. ”

The advantage of a unified management system for this team was completely clear: faster problem solving, fewer cases of mutual transfer of tickets, and more active communication with end users. What are the biggest problems they have encountered?

“The issue of trust,” Lisa told me, “We had to trust the new monitoring system, just as we trusted the tools we were familiar with before. When at first something went wrong, we returned to them to solve problems, but after we realized that we were seeing the same data, we began to trust the new system more, and from a certain moment we began to rely only on it. We dig up our good old tools from time to time if we need to go deep into an incorrectly working system, but by this moment we already know exactly where the problem lies, and we don’t have to spend much time on it. At this point, there is no need to play football - you are already in the right problem area, and it remains for you to establish the exact cause. ”

Fulfillment of custom orders

Kevin spoke about another side of unified control. “We are not only responsible for opening tickets on issues. We also open tickets for routine change operations. ” I asked him to give an example of how this was done before the implementation of the joint management system. He showed me a ticket from the archive:

OPEN HELPDECK 2010-08-20 12:50
The BDOUDS user needs a new SharePoint site hosted on
intranet / projects / universitybid. The user will be the site administrator.

APPOINTED JHOLTZ

NOTES JHOLTZ 2010‐08‐13 08:27
Message sent to Bill’s manager for confirmation. Also sent a message to the department of special projects

NOTES JHOLTZ 2010‐08‐16 11:12
The head of Bill, KHiki, confirmed the application. Still waiting for confirmation from the special projects department.

NOTES by JHOLZ 2010-08-18 11:05
Still waiting for a response from special projects. So far, I stopped working on a virtual machine.

NOTES HelpDesk 2010‐08‐20 10:34
User requests status.

NOTES JHOLTZ 2010‐08‐20 11:34
Tell him to get in touch with the special projects department. I need confirmation from them, as this goes beyond their budget.

NOTES JHOLTZ 2010‐08‐22 13:11
Confirmation from special projects received. Raised the site and assigned the user BDOUDS as a user of the site.

STATUS INSTALLED COMPLETED 2010-08-10 13:12

“It happened all the time. Someone could call us for access or something else. We assigned a ticket to someone in IT, but then they started to figure out who would be responsible for it. As a result, we had to create a thick book, "he added, pointing to a thick folder with three rings, standing on his shelf," by which we could find out who was responsible for what. " And then you had to try to hear from them the answer and wait ... How much could it take? Specifically, this problem took us two weeks. This is idiocy, of course, but all this time users called us to find out the state of affairs, and we were not able to say anything to them because we did not know anything. The work itself, after receiving approval from Jeff, took just 10 minutes. ”

And what does this look like with the implemented unified management?

“Actually, it’s quite good,” Kevin said. “We now have a large online catalog that has everything the user needs. It has the type of online storage through which the user places a request, and the system automatically opens a ticket. In addition, each incoming element is associated with the work flow, so IT knows nothing about it, until it passes through those coordinating and approving these works. After we see this, then this part has already been completed, and all we can do is start and finish our work. For some things, we initially had to redo the source scripts, so we are now well off-loaded. ” The organization has developed and documented the desired workflows for each product (internal service). Kevin gave an example of the documentation shown in Fig. 6.2. “This kind of documentation of the process is important because we spent a lot of effort to implement workflows. Business owners (business processes) can independently use these schemes after they are associated with the specified products in the catalog. ”

Figure 6.2: Documented order used for automated approvals / approvals when requesting a catalog item.

As an example, we discussed permission to access, and I asked how it was before when someone needed to get it. “Nothing was done,” Kevin admitted, “Once he got access, he remained with the users until the person left the company. We did not track it. Now it can be seen in the general catalog. If you do not need something, then you can 'return it to the store', it will go through a special order of approvals and we get a ticket that indicates what access and where to remove it from. Various managers periodically check the authorities of persons who have access to their resources, and then tell us who and what should be removed or left. IT is no longer engaged in this work. ”

I noticed that an automated workflow does not necessarily guarantee a quick response time. “Oh, yes, on some issues, users sometimes have to wait two weeks for approval, but if they place a request through the catalog, they themselves can check the status of the task. And then they can see that he hasn’t come to us yet, and they can try to speed it up on their own by disturbing their managers or those responsible for these resources. We do not deal with issues that are outside the coordination cycle, and users know this, besides, the status shows that we don’t have one yet. ” Such systems better inform the user, and help them understand where and at what stage their task has slowed down.

"Memory nodes" when choosing a unified IT management system

I would like to use this section to present a list containing, in my opinion, the required properties of unified systems. As you evaluate the solutions you are considering, make sure that this functionality is there, and also check that they work in the expected way and are useful for your environment.

The sequence of works. Unified management solutions should offer workflows to help automate the coordination and management of services. The sequencing of the work should be realized to the maximum in the form of normal mouse movements, so that programming is kept to a minimum.
. , , , , ; , . , , , . , , .
. , ; . « » .
. , «» – , . / , «-» , .
. . , , .
. – -. , , .
. , . .
. , . , (compliance) - , , , ITIL. « », , -. «» , , , .
. , , . - , , () , .
Interface. , , -, .
. , , . «», - , . , ( ) , .
SLA. , (SLA) .
Trends. , , . SLA, .
. , SLA , , . SLA .
. , , , SLA . , , , , .
. .
. , – , , , , . , , .
. , , - .
. , : IP-, . , , , .

Obviously, this list is not comprehensive, but provides some starting point. If a potential solution offers this functionality and meets the specific needs of your organization, then it may be worth paying close attention to it and try it out live. Make sure that you do not just tick the corresponding item - you have a detailed understanding of the implementation of this functionality in a particular system. Also check that it meets your organizational requirements.

Ways of acquiring a unified IT management system

I would like to briefly describe the different approaches used by vendors in implementing solutions for unified management. I would like to immediately emphasize that I do not consider any methods to be "correct" or, on the contrary, "incorrect." The correct option is the one that is right for you , and what is good for you - you decide on your own.

Usually, the price of solutions of this kind is based on the number of nodes that you need to manage, perhaps there will be a number of users in your organization. A “node” is usually understood as any managed device: a router, a server, and so on. Some vendors are more creative with their licensing models than others, but don’t allow yourself to be intimidated by the complexities. In some cases, more complex licensing rules will bring you benefits, because vendors are trying to adapt to the most diverse situations with their potential customers. More attention should be paid to what you license.

For example, at one end of the spectrum you will find what I call monolithicsolutions. In this case, you get and pay for each function, regardless of whether you need them right now or not. I think this is very important - to know that you get a decision making everything that you need, although I'm not sure what you want to pay for everything that is written there. Sometimes it is necessary to implement a solution for individual stages, licensing only the functionality that is necessary for a particular project phase. This allows you to increase the capabilities of the product over time and save on full licensing. The advantage of monolithic products is that they often have good internal integration, because everything is assembled into one system.

In addition, there are modular frameworks (pluggable frameworks). To such systems, I would refer solutions like HP OpenView. When using these systems, you buy a base product, and then begin to buy additional parts and modules for it. Such systems offer greater flexibility, and if you are going to work with solutions from a major vendor, then you will be able to find solutions to almost all your tasks in its catalog. These solutions carry the risk of becoming massive projects that take a lot of time and effort, and the modules are not as well integrated as you might need. The licensing scheme can be very, very difficult, because plugins are licensed separately for the base product.

Another licensing model is pay as you go.. In this model, the solution offers all the functionality that may be required, but you do not include it all at once. Instead, you only activate what you need and pay only that. As your needs grow, you start paying a little more. This implementation is more like a “cloud” model, where your needs are gradually increasing, but you only pay for what you actually use. Here you need to separately acquire plug-ins, and if they are, they are usually supplied by the same vendor solution. The number of supporters of this approach is growing among many of my clients.

And the last thing you need to think about where the solution will be deployed. In the age of "clouds" you have a certain choice - to place your monitoring and control solutions inside your data center or simply to purchase such a service as a service that is hosted in a data center with a provider. In any case, software agents are installed in your environment. I will not go deep into the dispute “local placement against remote”, perhaps you have already decided what is good for you and what is not; but you will certainly need to consider a specific decision. Regardless of the strategy chosen, it would be good if your solution had the opportunity to use both options.

Conclusion

This is how the unified IT management looks like. The general idea that permeates this entire book is simple enough: to concentrate on the main topic “to collect everything in one place and on one page”. The only revolutionary moment, when compared with a disconnected approach, is that our existing technologies, one way or another, are pushing us towards this.

Of course, I don’t expect you to drop everything immediately and start implementing a new monitoring and management solution. These things can be made in small steps, so that they will not have a big impact on your organization, but they will allow you to learn the appropriate approaches and techniques in a natural and non-destructive way.

The main goal is to stop wasting time on constant switching between tools, to bring everything and everyone into one picture of monitoring your organization’s top level. Integrate everything along with the help desk, which will allow you to keep all interested parties informed and will also give you the metrics necessary for an objective analysis of the performance of the IT infrastructure.

Good luck.

Source: https://habr.com/ru/post/175975/

All Articles