Hello! My name is Dmitry Chepel, I am an expert in Acronis. It so happened that they turn to me with problems that my colleagues could not solve. Someone thinks that I have a certain “developmental flair”, intuition. I do not know about you, but I believe that intuition is the same development tool as the others, and it can and should be improved and trained.

Intuition as a tool
If we consider intuition as a development tool (and indeed as a tool), it will immediately become clear that, in addition to having the tool itself, it would be nice to have skills in using this tool. Skills in this case will be the personal experience of developing / debugging / integrating the system. The more experience - the more developed the "instinct" for possible problems. The broader the horizons, the deeper you know the system with which you work, its environment and how the environment interacts with the iron on which it works - the easier it is to understand and realize the entire vertically integrated circuit from iron to your product. For clarity, I will give a couple of examples.
Example 1: Self-corrupting archive
Imagine that you rarely receive a message that the data is damaged. There are many clients, the conditions are different for everyone, but from time to time you are notified that the archive copies refused to be restored. Blame, of course, your "curve" software, your "crooked" hands and in general everything. We start testing the code, checking the archive creation - everything is in order. Ten people look at the listing and see no problems. The data format and the way of creating the archive implies checking and protection against failures, that is, the problem, in theory, is not in our product.
(Note: When creating the archive, Reed-Solomon codes are used, which allowed using only 40% redundancy with the same reliability as Amazon S3 (whose redundancy is 200%), we will definitely tell about this in one of the following articles in our blog , subscribe and don't miss anything for sure!)')
Actually, the problem is, it is necessary to solve. Knowledge of the product helped in the search for errors. The fact is that the peculiarity of our archive is the creation of two copies of metadata: I proposed to compare and see whether they will converge or not. Did not agree. They began to dig - in one of the copies, when opening with a hexadecimal editor, an obvious pattern appeared - every 64 bytes of data separated 16 bytes of “digital garbage”. Such a “pattern” clearly correlated with the file system and the way data is stored on the server. Knowledge of the environment and knowledge of the product allowed us to apply an intuitive approach to finding, identifying and eliminating problems.
Wide outlook -> wide experience -> quick search for errors
As you can see from the previous example, without knowing how the data is stored, the error could be searched for for a very long time. Usually, a good developer looks for problems primarily in his code, but one of the debugging problems is that it is difficult for a developer to get out of the “cycle” of debugging: re-checking the data, the algorithm and the result in refined conditions does not produce any “wrong” answers, the developer starts the test again or plunges into the sources, finds nothing, and so on until the working day ends. It is very important to be able to switch and go beyond your own area of ​​responsibility: the error is not necessarily in the code, the execution environment itself and the hardware environment on which everything is spinning can be to blame, and they cannot be ruled out. And for this problem, I have another good example.
Example 2: the client cannot wait for the result of the operation and falls with an error
The situation is this: the program works for a long time and everything is fine. Requests, responses, calculations, results - no problems, nothing falls anywhere, the outgoing data looks as expected, incoming data are processed correctly. After some time, the following error began to appear: the client stumbles for about a minute, does not wait for a response from the server, and then returns an error. The developer of hard wool, it would seem, is fully working code, which until then worked without any problems and day, and two, and ten. Checks, reading and viewing logs do not bring any results. Exit from the vicious cycle of debugging does not come out. At a certain point, we decided to look at the problem more broadly, track all events and look at how the server’s response history developed before it began to generate errors. We noticed that from a certain moment, the response time began to grow almost linearly, until it reached the unfortunate minute. We went through SSH, did a forced indexing - it all worked. In retrospect, the solution looks obvious and simple, but the developer found it difficult to get out of the loop with debugging and look at the problem from the system, not from the inside.
Example 3: Non-recovering machines after backup
Here is another interesting story literally from the recent past: in Acronis, as usual, they constantly rework and improve something, or even develop it from scratch, taking into account previous experience. In this particular example, the developers worked on a new type of archive to perform a disk backup, which will become (or rather, has already become) part of Acronis Cloud Storage. Code written, checked, it is time for combat tests. We uploaded data, made a backup, sent it to the cloud, and then tried to restore the computers. We check the results: part works, part doesn't, computers just don't boot. A tedious series of debugging, logging, checking for errors in automatic and manual mode begins, a direct scan of the file system for full recovery - everything is in order, the volume is mounted, all files are read. Found a couple of little things, corrected, but they did not affect the fact that computers do not boot. In total, we spent several sleepless nights trying to catch the reason for not loading (and, again, I want to repeat - only some computers were not loaded). In general, the developers got into the debug loop without exiting. Absolutely accidentally paid attention to the fact that mft is fragmented (it has about one and a half thousand parts). The idea immediately appeared that the loader tritely lacks something to process such an array of information. Climbed into the code, optimized the file system parser, corrected several strokes and the problem disappeared instantly. In this particular case, from the debug-loop, it brought a banal attention to all possible trifles. And again, in retrospect, the solution seems simple and obvious, and at the time of development, the very idea of ​​checking fragmentation and the parser was akin to insight and suddenly descended Developer Grace.
Instead of conclusion
Know your product, know your code and know your surroundings. Do not focus on the same method of catching bugs.

Check not only the program and the algorithm, but also everything that is related to its execution: related products, answers from the OS, answers from the hardware and / or the server. Look wider, look deeper, and intuition will begin to help when looking for problems, and yours, of course, a good product will be closer to your clients and users. Thanks for attention.