Among all the duties of a system administrator, the most interesting, complex and productive, in my opinion, is the detective work on the grounds of the incident. At the same time, unlike real criminology, the system administrator himself is both a detective and an expert in physical evidence.
I now exclude from consideration incidents with intelligent malicious intent, this is a separate topic. We are talking about natural problems (the server crashed / hung, the virtual machine started to slow down and then stopped, the application lost 100,500 transactions and believes that everything is fine).
The essence of the incident
Sometimes it is trivial (“the server spontaneously rebooted”, or “the plane fell”). Sometimes it is extremely difficult to explain (“customers complain that they don’t manage to change the region”, while all employees with client accounts can change the region). Most often, the farther away from the system administrator is the source of the complaint, the more vague the complaint becomes: “the customer says that after ordering a teddy bear in the online store, he cannot change the region to IE7 when using an LTE connection via a USB modem, and he receives a 500th error when trying to cancel an operation and pressing "back").
')
Even more difficult is the case when several problems merge together: “the server suddenly rebooted, and on the other server there was a timeout of working with the database, and the clients at that time wrote that they did not load pictures”. How many problems are there? One, two, three, and maybe more? Which problems should be silently merged (database and lack of pictures), and which ones should be considered separately? And if at this moment another complaint comes that the user cannot log in to the system - is this the usual “forgot password” or also a symptom? And if there are two users? Or does someone in passing say, “my mail doesn’t get through”?
Subconsciously, at the onset of problems, each new complaint immediately merges with existing ones (and may not lead there), plus dramatically increases stress due to the fact that you have to think not of three symptoms, but eight, for example. And in my head, only seven are well kept. But at the same time, in my practice, it happened that the incoming “new” symptom easily led to the essence of the problem and its elimination ........ minus that the serious problem (from which it all started) has nothing to do with joyfully and quickly repaired nonsense. And time is spent.
There is no simple advice for such a situation. In difficult situations, I usually write out everything that I hear and notice, not analyzing, but fixing the time.
That is, the journal (in sticky notes) looks like this:
- Monitoring triggered on srv1 (22:05)
- (name) said about problems with the mail (22:07)
- I can not log in to srv12 (22:08) / refused - Logged in 22:16, dmesg clean, uptime big
- I can not log in to srv13 (22:10) (timeout) - office wifi has fallen off (22:11)
- The panel does not open (22:12)
- Support writes that the client complains that nothing works, 22:15
Do not get involved (not time to print), but the
symptoms should be written out. One is a case or several, important are the symptoms or not, it will become clear later. I usually start writing out after about the third distracting appeal.
The second aspect of the problem is proof of the existence of the problem. The most hated phrase that cannot be avoided:
Everything works
After Enise Airlines complained to the manufacturer that the planes sometimes fall, the developer checked that the planes take off / land and closed the ticket with 'Unable to reproduce'. Employees of the support of Enia Airlines continue to collect statistics on the fall of the aircraft and are trying to learn how to reproduce the fall in the laboratory.
Finding a problem is often more difficult to fix. The worst scenario that can be imagined is the stream of complaints from users with a completely green monitoring. Moreover, sometimes it happens that you look with eyes / hands - everything is fine. And there are complaints. That is, "there is a problem," "but there is no problem." This is the most disgusting state, and if we are not talking about the primitive "works" / "does not work", but about more subtle things, then with an unfortunate set of circumstances, it can drag on for weeks or even months. A real example from my practice: five months there is a delay of 300-400s (!) with an ordinary load on the disk array. There are no other symptoms. In the laboratory does not appear. Debug as you want.
So surprisingly, one experiences joy by learning to faithfully reproduce the problem. A real example: At one time, it took me more than a year to formulate an enclosure freeze scenario with SATA disks. In production, I watched such filth, but I could not reproduce on a stand or a synthetic load. Finally
managed by sheer chance. To the disgrace of LSI, the reaction to bug reports has not been followed up to now, but at least I have a reinforced concrete argument in discussing the (in) applicability of enclosure under serious loads. I must say that I still have no equivalent scenario for SAS disks, although there is an extremely high certainty that the problem is reproducible with the same success. So we can say that for 8 years I have the same bug, but with SAS, I have not learned to reliably reproduce.
Another misfortune of the same plan is the interaction of departments. When there is no obvious reproducible evidence, it is very hard to complain about foggy pains and glitches - because a person sees that “everything works for him”. On trust, a person can do some gestures on checking and debugging, but when he can always see that everything is good, then the words from the side “actually everything is bad” rapidly lose their credibility. This can and must be fought, but this phenomenon is and is objective. He must be stopped by others and restrained in himself, if someone else complains about problems.
Another serious dilemma is the choice between “repairing” and “studying”.
To fix
It does not work. But it should. It is necessary to repair. Right now. Unfortunately, the repair often destroys all the evidence and circumstances (for example, due to a server restart, the state of the kernel, the processes and the contents of tmpfs are lost). So if there is an opportunity, it is necessary to transfer the load to the running servers and get rid of time trouble without getting rid of the evidence.
To study
This is my favorite mode of operation. When the problem is localized, it is excluded from production (from its sensitive places), but not eliminated. You can carefully and gently touch in all places, feel the problem more precisely, understand its scope, study the circumstances and all the direct symptoms, write out all the characteristic lines from all the consoles and describe the behavior for themselves and their descendants. After the patient is wrung out and smoothed out, you can try to fix it, and you should start with the least intrusive methods and move on to more severe ones, if less severe ones could not be done. At the exit, this will give the maximum experience / incident ratio.
Depth of study
After the cause was found (“the plane fell due to the fact that its engines stopped working”) and was eliminated (“the plane was replaced, after the restart the flight was normal, no complaints were received from new passengers, the support responds to the complaints and identification requests of the corpses of the previous passengers in the normal mode ”), a certain little worm of doubt arises: is the cause clearly identified? The more complex the installation, the more connections there are between the components, the less obvious and deterministic the state of each of them is, the less unambiguous is the connection between the intended effect and the intended cause. Question: "why did the motors stop working?". Reading the logs of the nagios in the cockpit, we find the answer: "because they did not receive fuel." Are the engines bad? No, it seems, they did everything right. No fuel - no traction. The report on the incident begins to look more meaningful: “fuel stopped flowing to the engines, they stopped and the plane crashed. Pilot redundancy system in the cockpit did not work (the reasons are being investigated). "
How deep should one go in trying to understand "why?" Each subsequent step opens another abyss (the study of most subtle problems associated with reading the source text), it is very exhausting and takes a lot of time. There is even such (alas, not a comic) proverb: “Today I learned one more thing that I didn’t need to know.” This usually happens after reading the next masterpiece of Hindu thought.
Living example. There is such a virtualization monster - XenServer. A mighty corporation (Citrix), an enterprise, millions in sales, thousand-dollar licenses, everything is a chinar.
... While the old tortured sysadmin does not detect the following construction in the code serving the virtual machine disks:
launching the ls program in python and parsing its output to get the list of files in the directory . One more thing he wants to see.
... The senior pilot nervously shivered, remembering that the fuel hose to the fuel tank is fastened with a perforated condom, because the tank has an outlet of 11 mm and the diameter of the hose is 19. The main thing is not to think about it. Perhaps it will carry ... But the deadline for the take-off was not frustrated.
Each step into the past requires, firstly, a deeper understanding of how the next component works, and, secondly, increasing breadth of coverage. Because most often every step into the past is a search for superposition from the impact of an increasing number of components, often beyond the scope of normal research, and in each of them you have to dig deep.
Sometimes there are “turtles to the bottom” - that is, the breadth of the search becomes too large, the volume of the place to look at is too extensive, and the logs are getting smaller (and they are less relevant). In this case, the search is shut up due to the increasing complexity of the analysis and lack of resources (time / enthusiasm).
And it happens that the answer is found at the very bottom. If someone did not read the previous story, a good example:
Admin tales: in pursuit of the fragmentation of tunnels in the overlay network .
But there is a third option: when the “broken change” is found, but the questions remain. For example: “why?” (Usually its abusive equivalent sounds). Given that the process begins to resemble an existential search for the light of the mind in humanity, the question is really important. This may be stupidity, neglect, mistake. But if not?
And then another layer of questions unfolds: why do they (the developers) want to do things that we won’t work with? It is clear that they (the developers) are rascals and villains, and we (system administrators) are all princes on white horses. But if not? What if we cook it wrong?
This, in turn, can lead to a major revision of the architecture. A petty rare problem, in particular, suddenly exposes logical inconsistency, or violation of promises / expectations from components. (Note: “correct and forget” does not give the opportunity to see just such things). A real example: when a live migration of a virtual machine is sometimes, extremely rare, it (migration) fell off with an error of 0x4c (without further explanation). And left the virtual machine inoperative in the middle of the migration. After the curses on the heads of the developers, drawing up regulations for the migration of machines and other nonsense, one day, during one of the hangs, they got deep into the system and began to study the "undated" machine. Usually the car was repaired by the restart method, and then there was a chance to study it carefully, because the car was "own" and it was possible to leave it in this state for a long time. As a result of a close analysis of events from the logs, it was found that if the memory of the virtual machine changed in a big way after the machine started to migrate, but until a certain point (about 2c window), the hypervisor was stuck on the inability to copy the rest of the memory. Are the programmers guilty? Well, yes, a little bit. We had to think about this. Are the architects of the final system guilty? Blame, and more than programmers. There is nothing to arrange transients in the middle of migration. A small fix that turned off memory regulation during the migration, and the error "4c" is no longer.
Place of incident
We return to the detective metaphor.
The more servers in the farm, the more difficult it becomes with the definition of "place". If you have a single server on the farm, then whatever happens (except for the loss of electricity / Internet / cold), all the problems are on the Server. The one and only. If you have several servers on the farm, the picture may already become more complicated, but not radically. You can explore all the servers in turn and find the "crime scene".
But if there are several thousand servers?
It is assumed that this moment is at least clear what is happening. Further it is necessary to understand the workflow of what is happening. In complex systems, it can be very confusing (coming up on the go: LB / CDN / cache / application-server / database. A database is several servers, and you need to understand the wizard or one of the slaves, applications can walk On other servers, CDNs in turn can be tied to object storage (+ several dozen servers) or storage systems). It happens that some API requests break through several applications and are generally terminated outside the servers (IPMI, switches, requests to aliens behind random numbers through HWRNG, etc.).
So, workflow can be disassembled in the reverse order (from the result), or on a whim, from the intended reason.
In reverse order, this is a boring, but definitely working method. We go to the load balancer or its logs, we look where the requests go / left, we go to the cache, we look cached or not, we go to the corresponding application, we look at which application, we look at the logs, we look at which database and which server, we watch the database logs, we find , that the log zasrat errors, happy end). Long. Dozens of complex queries to the logging system, hundreds of kilobytes of which need to be picked out, etc.
There is another approach - intuitive. “It reminds me of something in behavior ... It doesn’t seem to be a new application on the DB, this new application is a hooligan” - and we go to read (randomly selected by intuition) logs. With all the arbitrariness of choice, experience and intuition, capable of secondary but important signs to understand where the problem is often right. In the secondary signs there may be completely unexpected things - for example, time (3:59? Cron.daily on the database?), Specific formatting of the error page (it looks strange. The styles are not fully loaded? We are watching the backend of the servers with statics), message features ( if the message breaks through the strerr format, then it’s necessary to dig right to the host where it was done, and if it reminds a python / java trace, even if fragmentary, then it’s better to read around the application), latency when answering (something I don’t like by ssh twitches, enough to read the logs went debug the network), the interval between the flapping (flapping - when the service changes the state between "working" and "not working") - it seems that this is a hooligan orchestration ...
I usually make two or three attempts to find with the help of intuition, after which I fall down a slow path. Sometimes, when in the process it seems that “I understood everything”, I can jump forward, but in general I have to unwind to the right place.
Event time
With all the triviality of GR, in the
space-time continuum , as well as in the administration of servers, time is absolutely not a trivial thing to establish. First, not all events have time. For example, the application has started writing convulsive letters in an unexpected log file. The letters are important, but they do not have a timestamp (and the log in logstash does not go away, because no one thought that such a log (in / tmp) could be). Or, there is time, but with very large strokes. For example, by atop, you can see that the application has ended between 12:20 and 12:30. Rude.
Sometimes it happens that for the correct sequence, you need synchronous time from different servers with an accuracy of less than a second. Because it is important - first a request, and then a trace, or first a trace, and then a response to a request that is not related to the accident.
And subsecond accuracy is already quite severe. First, ntp can be a bit of a walk. A real example from practice: in the dark times of the openvswitch 1.1, he had the bad habit of adding more (random) latency to new flow (and eating memory before stupor, but this is a separate story). And there was ntp on all hosts. And I thought ntp, that jittering is big. And the beginning of the time on the hosts slowly disperse, not a little more - for 30-90 seconds. This is despite the fact that all were synchronized from one server in LAN with a ping of 0.0001s. The result was a separate check for nagios that the time on the servers does not diverge.
Secondly, logging can put a timestamp post factum (for heavy traces this can be a lot). Even more fun when one of the victims of the problem - ntpd, or even just time on the server. In my practice, it was somehow that the program misled time (and stopped receiving signed messages). Forensic showed that the time of writing to the log file differs from the file modification time, which became the key for detecting a bug (which occurred at the junction of the summer-winter time zones).
It happens that different sources do not have a general ntp at all or have a very rough idea of ​​time (a support two hours ago said that five minutes ago a client said that ten minutes ago everything fell).
In such conditions, when the chronology does not hold in the head, we have to write again, this time the chronological sequence.
Motive
Programs never do anything for nothing. Especially in the region of all sorts of bad branches. Why is the program stuck at the output? Why didn't the kernel give IO back? Why the log did not register?
For each subject of action it is necessary to write out his motive. What for? Why? If there is no motive, but there is a suspicion of participation, you must at least write out what you did and what you had to do.
In the ideal world, programs do not contain errors; in the real world, they contain and have worked with them for years. At the same time, many errors “lie quietly and do not touch anyone” in rare corners of the code, until it happens that their turn comes. It happens, the fix is ​​not to find and correct the error, but to eliminate the circumstances causing the corresponding piece of code. At the same time, both the “with error” mode and the “without error” mode are both more or less regular. The most striking example of this (not in my practice) is reaching the threshold of 512k prefixes in full view in BGP on some old routers (
for more information ).
Actors of action
Another important question: who is involved in creating an emergency? It often turns out that the culprit is not only the direct suspect. From practice: the request for billing could not work. A glance showed that the mongodb, which stores temporary samples, is not running. Time trouble demanded "restart everything", that is, the state of the processes was lost. Subsequent proceedings showed that mongo is constantly being completed (entry in dmesg). However, no OOMs. But after a few minutes of exploring the vague logs, it turned out that mongodb is quite culturally complete due to lack of memory to respond to a complex query.
And the atop logs showed that the ceilometer-collector is swelling alongside, pressing and pressing Mongu from memory. There would be no atop, no one would have guessed anything, because it was restarted along with mongo. Although Monga is also to blame, for she did not write clearly what she did not like.
Chain of events
It often happens that surprise in one process is a mistake. But it is caused by another reason, often completely incoherent.
It happens that the assumption about the stateless behavior of the system is violated. Everyone thinks that the restart on the same data gives the same results, but in reality it turns out that the first launch and the next one are two big differences. In this case, the accident occurs at the first start, and after restarting, everything passes and works as expected.
The debugging chain becomes even more serious if the source of the problem is data in the database. Where are they from? The data may lie for a long, long time until the code stumbles about it. An example from a fresh practice: the ceilometer began to fall on image metadata with a dot in the name. The image took the metadata during the creation of the snapshot instance. And the instance took the metadata from the host where it was launched (last year). And the typo metadata has long been fixed. And the problem bit in a year.
Building such chains is the most fascinating part of the process. At this point, all the connections in the system, under the watchful eye of the suspect detective, become crystal clear, and often terrified by their uselessness (after it breaks, it becomes clear how bad it was). A simple example from practice: in youth, the following decision was made: to make an nfs-ball with temporary files / one-line servers on a group of servers of the same type. This is very convenient, because often a file from one server is needed on another, and having them in one place is very convenient. It was so until there were problems on the server, where nfs was. After the reboot ... Yes, suddenly, it turned out that the path includes this directory, and the attempt to call any program leads to the scan of the nfs-balls. ... At the login stage. That time they broke through ssh / bin / bash -i. Since then, I have not refused to practice such balls, but I no longer register it in the path.
The process of investigation in the causes often allows you to suddenly and very deeply understand what is happening in the system from the point of view of architecture. Because when everything works, it is taken for granted. And when there is no ... Then it becomes clear how it is connected with each other, how it works, and how it does not.
In a sense, I have no feeling that I feel how the system works, until I looked at how it breaks. Because at the moment of partial breakdown all design errors, unsuccessful decisions and the lack of a safety margin become visible. Everything breaks down, the question is, with what consequences. One solution in case of a fatal error informs about it and ends, and the other one finds itself in the code branch, where the handling of such errors is not provided, with the consequences of an appropriate scale.
Investigation Conclusion
His writing is a mandatory part of forensic. People’s memory is bad, especially on illogical and poorly structured data. The conclusion is divided into four parts: characteristic signs (especially quotations from logs), a description of what happened, a method of early (or at least some) diagnostics - most often, another check in monitoring, and recommendations for elimination. All this is written on the wiki (you also have a corporate wiki, right?), And is scattered as a set of tasks to perform. There is another bad trend that needs to be fought: “there has never been such a thing and again there will not be again”. If something strange happened, then it should not be ignored. Monitoring, monitoring, monitoring. Plus writing out all that was. It so happens that a couple of extremely rare (one-off) incidents suddenly line up into a coherent picture much later. ... If fragments of pictures remain. So do not be lazy to write out, do not be lazy to disassemble, and cover with tests.