Black Box vs White Box in system administration

I would like to draw the attention of fellow system administrators, as well as those who hire them, to two diametrically opposed approaches to system administration. It seems to me that understanding the difference between these approaches can significantly reduce the mutual disappointment of both parties.

It seems to be nothing new, but in the almost 15 years that I have been associated with this topic, I have witnessed so many times problems, misunderstandings and even conflicts related to misunderstanding or unwillingness to understand the difference between these two approaches, it seems that it’s worth raising the topic once again . If you are a sysadmin and at work you are not at ease or if you are a manager who hires a sysadmin - perhaps this article is for you.

I will deliberately exaggerate a little in order to make the difference a little more visual.
')

Black Box Administration

Under Black Box administration (analogy with a black opaque tightly closed box, with buttons and light bulbs) I understand the situation when there is a certain system, there are instructions for its operation, some set of tricks, questions and answers in Google. But there is no information about how the system works; we don’t know (or don’t want to know) what is inside of it, how it works, what is inside with what and how it interacts. Yes, it does not matter: if it is operated in regular conditions, we just know how to use it and what to expect. A set of commands / actions leading to the results we need is described in advance and it does not matter to us how the system does it, we just “order” and get the result. Or, if figuratively, it is described in advance (or found by typing, that is, by experiment), which button should be pressed in order for the desired combination of light bulbs to light up.

Accordingly, administration in this case is reduced, firstly, to the support of standard operating conditions under which the system behaves as specified, and, secondly, to “pressing the right buttons” at the request of clients / users. This is done strictly according to the well-known knowledge of which button turns on a light bulb, variations are not only unwelcome, but even contraindicated, because the other way can give unexpected side effects, bring the system out of normal mode to a state that is not described in the documentation, and then do to fix the situation - is unknown.

If something went wrong and the system stopped working as it should, the first thing to be done is finding what is wrong with the operating conditions, what is different from how it was when it worked and how to get it back, how to “put out all the lights” and return the system in the original state of the state. If this does not help, google the “secret combinations of buttons” from the experts and try them all in a row to decrease the similarity of the described situation until the cherished light comes on or the system returns to the state known to us. If this does not help - a dead end. Either rollback to backups, or contact support (if the system has one) or replace (reinstall) the system.

It is worth noting that there is a certain number of systems for which such an approach is the only possible one. For example, in cases when the device of the system is a commercial secret of its manufacturer and the possibility of studying its device is strongly limited by contractual obligations and the internal rules of the company. Or in the case of huge complex systems with complex internal relationships, the support of different components of which is carried out by different departments that do not have access to information and have no right to do anything outside their area of responsibility. In addition, there are many cases where this approach is simply more appropriate. For example, Windu is often easier to reinstall than to understand what's broken in its depths.

White box

Accordingly, White Box is when the box is transparent. We have the opportunity to see (and also understand) how the system works. In this scenario, the instruction is secondary, it allows you to understand how the system is supposed to be used and how it is arranged, but does not limit us to this. There is an understanding of how the system works and, as a result, how it will behave in different conditions, including those not described in the documentation.

After some time has been spent studying the system, there is an understanding why it is necessary to press the buttons in exactly this sequence and why the system should be operated in such conditions. The same actions can be done in different ways, if this leads to the desired result, because we can predict possible side effects in advance, and therefore we can choose the most effective way for the current task. If something went wrong - we can see what it was and how it broke, which gear it wedged. We can consciously return the system to its original state or change that and only that factor that prevents the system from working normally. That is, to move from the internal state and needs of the system, and not from the available documentation / experience.

In this situation, the ability to solve problems increases many times, the “deadlock” is reached much longer and less, the system can be operated more fully and flexibly, more efficiently. But, this approach requires mastering, digesting and keeping in the head an order of magnitude greater amount of information, which is much longer and more difficult.

This approach is also the only available. For example, when the system is an internal development of a company that is constantly in development, changing, supplementing. Therefore, no one knows what to expect from it, and the documentation is often missing. In this case, situations will regularly appear that it is simply impossible to solve in a reasonable time without understanding how the system is arranged and not being able to “dig into its gears”.

The essence of the problem

From my personal experience, I can say that most system administrators feel more comfortable (and also more efficient) in one of these two approaches, and, accordingly, not at ease when there is a need to work within the other part most of the time. approach.

Consider both options on a real (slightly simplified , so as not to waste time on insignificant but time-consuming details) life example.

A certain site works on 2 8-core machines with 8 GB of memory. Apache2 + PHP + MySQL + memcache. During peak hours, the system periodically began to slow down terribly and the site itself responded with delays of 10-30 seconds or did not respond at all.

To begin with, the problem was considered on the Black Box approach.

On both servers, the top command showed that there is almost no free memory, load average is around 20, swap is actively used and the system does not crawl out of iowait. Restarting apache returned everything to normal. After that, apron was restarted to cron once an hour and safely forgot about the problem for another six months ...

What exactly happened and why it happened - the rest is unknown, the actual problem was “the site slows down and does not open”, the problem is solved, the site no longer slows down. Diagnostics - 3 minutes, the solution - another 5 minutes. That is, in less than 10 minutes, the problem was resolved, knowing almost nothing about why the problem appeared. There is no certainty that this will help for a long time and that this is generally a solution, but (!) 10 minutes and in fact the site works again without problems for almost half a year.

After six months, the problem began to appear again despite the hourly restart of the Apache. They began to reduce the restart of the restart, complaints began to appear that the connection to the site sometimes just breaks off, the page turns out to be underloaded. That is, the solution itself began to create new problems.

Further, the same system began to be considered in more detail. Like White Box.

I will omit the details of the process, as a result of which the system was studied almost under a microscope, I will immediately dwell on the conclusions. It revealed:

Different requests to the server take up a very different amount of memory, there is a small number of requests that eat up to 200 megabytes, but the main mass consumes no more than 5-10. At the same time, php memory frees, but apache does not free memory within one child, it keeps with itself, so that if needed it already exists. As a result, we get that sooner or later at least one heavy request will pass through each child, as a result of which the Apache will “save up” memory for the future much more than most subsequent requests will need.
The number of “babies” of the Apache is quite a large 250 pieces, which, with their smooth “fattening” up to 200MB, smoothly but inevitably leads to the consumption of memory much more than is available in the system. The system starts to swap, everything starts to work slower, requests are processed more slowly, and they do the same, which leads to a greater number of simultaneous requests, which leads to the fact that all 250 Apache kids are actively involved and have a queue of requests and all are actively “fattening” and swapping .
In addition, this growing snowball was somewhat accelerated by a certain amount of long-polling requests constantly hanging on the background and, as a result, keeping additional Apache’s children busy, which didn’t allow the Apache’s unnecessary processes to complete due to “too many unused children”.

The solution was as follows (taking into account the specifics of the project, which is not described here):

they put nginx at the entrance, it doesn’t get fat on the number of requests and the queue.
Nginx, by the match url, sent heavy requests to a separate apache instance with mod-fpm, thereby eliminating the problem of “locked up” memory at the root, allowing a maximum of 25 parallel processes and only 5 spare processes (max spare childs).
The “light” requests went to the usual apache, which stopped “getting fat”, but just in case, they still put a maximum of 1000 requests on the child, so that if something suddenly happened, it would still be released from time to time.
Long polling, with the support of programmers, generally sent a small server node.js which for every client proxy requests to the Apache every second, after looking at whether there is a flag about “fresh data” in memcache and “skipping con” if nothing fresh appeared. These requests are very light, for Apache, they are no longer long-polling and occur only when there really is new data - these query requests fly by in microseconds and are not even noticeable, practically do not occupy apache-children.
In addition (thanks to pinba), the scripts themselves were slightly corrected, some of them after that began to eat less and work faster.

As a result, during peak hours, no more than 25-30 light Apaches are working at the same time, 5-10 heavier individual php through mod_fpm, node.js moves a little, occupying no more than 2-5 apache processes simultaneously on its micro queries. Nginx queue, if suddenly formed, keeps it lightly without straining, while the processor consumes almost nothing, because it does nothing, only proxies and, thanks to its architecture, with the maintenance of several hundred simultaneous sessions, has no difficulties at all. In addition, nginx buffers the responses of Apache and already slowly slowly gives them to the client, which allows apache to get rid of the request faster.

As a result, the average load average on the servers in the “rush hour” walks around 0.2-0.5. Memory consumption is about 2-3GB for all processes. The remaining memory is cache. Swap is not used. The response time now does not change at rush hour and has become approximately equal to the same as in quiet time (when there are only 2-3 clients on the site).
The number of clients that the site can serve without having problems with the load has increased by about 10 times (then already begin the difficulties with the database).

That is, the problem is solved again, but this time with a huge margin and clearly knowing what to expect and how long it will work without problems. Everything is reasonable, thoughtful, balanced. Solution time - 2 weeks ...

Results

At the risk of getting the title of “Captain of Evidence”, I will move on to the consequences of such a “separation”:

It is necessary to estimate what type of system should be serviced at your company before selecting a sysadmin. Black-Box-guru in the case of complex self-made ever-changing systems is unlikely to be as useful as a white-box-guru, and is unlikely to be satisfied with its work. White-box-guru in the case of stable, well-established systems that it is undesirable to “climb inside” will not find a place at all and most likely will work only formally, doing all their free time with their own personal projects and experiments. Well, or will constantly try to "redo everything here correctly, and not as it is now."
A sysadmin should “understand himself”, which approach is closer to his heart, and choose a job with this understanding in mind.
Black-box-guru solves problems very quickly, just as quickly able to take on maintenance new well-documented and widely used systems. Results are stable and predictable. He prefers to solve problems in the same type and predictably (and often this is a huge plus, especially when working in a team), but not always optimally.
White-box-guru spends considerable time studying the system, but then it gives much more efficient solutions. Able to solve more complex tasks and those for which the “deadlock” state of the black-box-guru has come, but not so quickly and not so much. At the same time, it is practically useless at fast “extinguishing fires”, because instead of just restarting quickly, apache will consider what is happening “hot on the heels” and study the state of the system in its unhealthy state “while it’s visible”.
A large company cannot do without a team with admins of both types: while one quickly “extinguishes the fire”, that is, they make the system work at least somehow, others calmly understand the roots of the problems and make them never repeat. And these second ones should not be forced to do what the first is doing well, and vice versa - nothing good will come of it.
The most valuable, but also the rarest shots are those who can work successfully on both approaches, but they also prefer (more comfortable) one more approach.

When choosing a worker or job, remember these few points and you may save time, money and nerves. That's probably all on this topic. Thanks to those who read it. :)

Source: https://habr.com/ru/post/152053/

All Articles