When the butler is a victim

Today we move away from the classic cliché of second-rate detectives and tell you about a case from our practice, when the butler himself acted as a victim, and the search for a real criminal led us to an unexpected result. But do not be afraid. This, of course, is not about real people, but about programs. And that we are not afraid to admit our own mistakes: we know that not a single team of developers is insured against them.

Introduction

At the beginning - a little theory. We are developing PVS-Studio - static code analyzer for C, C ++ and C #. For internal testing of our product, we use a variety of tools and techniques: Unit and Visual Studio UI tests, joint code review, specialized application testers. I will dwell in more detail on the last paragraph.

When developing new diagnostic rules, as well as when changing the internal mechanisms of the analyzer, it is always necessary to understand how exactly the changes made affected the quality of the analysis. To solve this problem, we perform test runs of the analyzer on a set of large open projects. About 150 projects are used for C / C ++ code, and 52 for C #. The tester for analyzing C / C ++ projects in Windows is called SelfTester. Next, it will be about this tool. There is also a specialized system for Linux, but now it will remain behind the scenes.

All work is done on the build server with Windows 10. As the build system, we use Jenkins, which is configured to run regularly overnight, including testers. Running Jenkins itself is done from cmd-file with the command:
')

java -jar jenkins.war > %JENKINS_PROJECTS%\Logs\%YYYY%.%MM%.%DD%_%HH%.%MI%.log 2>&1

The JENKINS_PROJECTS variable sets the location of the Jenkins project folder on the local disk. The variables YYYY , MM , DD , HH and MI contain, respectively, the date and time at the time of execution. Thus, after starting Jenkins, the output of its console will be redirected to a file with the name of the form:

 2017.11.08_17.58.log

Cmd-file starts using the Windows Task Scheduler. Running Jenkins as a Windows service for us turned out to be inapplicable, since in this case there were problems in the work of Visual Studio UI tests: the service does not have a desktop. Thus, the run chain looks like this:

 Windows Task Manager -> cmd.exe -> java.exe

So, having read a brief introductory, we turn to the question of what went wrong.

Problem

About six months ago, we began to notice that from time to time Jenkins closed during nightly assemblies. As a result, you had to manually start it, check the logs and restart some tasks. The problem was reproduced quite regularly. It was decided to start the investigation with a more careful study of the Jenkins behavior, because it is he who does all the “dirty work” and is the first to fall under suspicion. Yes, yes, that same butler.

Jenkins

First of all, the configuration and settings of Jenkins tasks were checked. The check revealed nothing serious. Especially, if an error had been made, for example, in the order, priority, or time of launching tasks, then the problem would probably manifest itself differently. Yes, and it is unlikely that Jenkins fell due to configuration errors.

As a temporary measure, the Windows scheduler task was migrated to periodically launch Jenkins every 30 minutes. To protect against restarting, a check was made for the presence of an already running process.

The next step is to study the Jenkins console logs. As you probably remember, with us they are saved in the % JENKINS_PROJECTS% \ Logs folder, and the file names contain the creation timestamp (in fact, the Jenkins restart time). After some time, we examined the contents of this folder and, as expected, we found a fairly large number of files there. This indicated that Jenkins continues to restart. In this case, the Jenkins console logs themselves did not contain anything criminal. And the problem has acquired a slightly different character: some of the night tests have ceased to be performed. They just did not start. And those that started, often ended with errors.

All this was in favor of the fact that Jenkins himself was probably not at fault. Someone from the outside impedes his work, "killing" the java.exe process.

Windows

Windows was chosen as the next suspect. Well, really, who else but the operating system should be responsible for the unplanned completion of processes. The system may restart after installing updates, or for some reason it closes only the java.exe process. This is possible when someone prevents the restart and it is not performed, but the system has already managed to kill some of the processes. One way or another, but the analysis of the Windows logs gave no results. Yes, sometimes the system really rebooted, but everything was happening in normal mode. And the time and place of the "crime" did not coincide.

At about the same time as we were conducting our investigation, another incident occurred, which was not immediately given much attention. One night our assembly server “fell” due to the emergence of BSOD. An automatic reboot did not happen, so in the morning we could see a very bleak picture.

Windows Debugger (WinDbg) was used to analyze the memory dump. The utility was found along the path "C: \ Program Files (x86) \ Windows Kits \ 10 \ Debuggers \ x64 \ windbg.exe". The dump was in the standard path "C: \ Windows \ MEMORY.DMP". After analyzing the dump in WinDbg, the cause and originator of the BSOD was discovered:

 CRITICAL_PROCESS_DIED (ef) CRITICAL_PROCESS: svchost.exe .... FAILURE_BUCKET_ID: 0xEF_svchost.exe_BUGCHECK_CRITICAL_PROCESS_TERMINATED_BY_ SelfTester.exe_6771c7c0

As you understand, forcibly closing instances of svchost.exe processes usually results in BSOD. You can check.

Yes, our flawless SelfTester tool caused Windows to crash. But since this was the first such incident in the entire long history of the tester’s existence, and the last changes to its code were made a long time ago, it was decided to consider what “it seemed to us”.

Meanwhile, the investigation went on as usual. To find out the culprit of the completion of the java.exe process, a regular Windows utility was used to configure the registry keys “Global Flags”. Among other things, it allows you to specify the monitoring mode for the completion of the specified process with the event logged in the system log:

In case of a forced termination of a process by a third-party process, the Application log of Windows should contain an event with an Event ID = 3001 from the source Process Exit Monitor.

As often happens in such cases, it was worth setting the traps, as the prey was immediately hidden. The next couple of weeks the server worked perfectly, the tasks were successfully completed, and the Windows log did not contain the required events.

Finally, one fine morning, the problem repeated, and the following entry appeared in the system log:

The process 'C: \ Program Files \ java \ jre1.8.0_102 \ bin \ java.exe was terminated by the process' D: \ SelfTester \ SelfTester.exe with termination code 0. 0x01d3541edc4f867b.

SelfTester again. It turns out that it was due to his fault that the java.exe process was completed, which generated it. Incredible deceit.

SelfTester

Finding the culprit is only half the battle. Now it is necessary to understand what exactly is happening and how it can be corrected. To do this, we examine the mechanisms of the tester.

To analyze each C ++ project, the tester starts the analyzer process (PVS-Studio_cmd.exe) with parameters. Upon completion of the analysis, the tester does not wait for the process to finish directly, but responds to the event that a verification report (plog) has been created, after which, “just in case”, ends the analyzer process and its child processes. Probably at this very moment something unexpected happens.

In fact, this algorithm is a legacy of the past and now there is no such necessity in it. Earlier, devenv.exe was run for analysis with parameters. The check was carried out by the PVS-Studio plugin for Visual Studio. In the course of this, many child processes were spawned, which had to be terminated.

To understand what exactly is happening, it was necessary to add a logging mechanism for terminating processes. After studying the logs, we managed to find out the following. Child processes for PVS-Studio_cmd.exe in the tester are searched for with the following WMI request:

 SELECT * FROM Win32_Process WHERE ParentProcessId={0}

The problem is that at the time of the execution of this request, the process may no longer exist. He honestly worked and ended, but the tester does not know. Sometimes a situation arises where the process ID matches (was reused) with the process that originally spawned cmd.exe, on which java.exe hangs (Task Scheduler process). At the same time, Windows remembers the IDs of the parent processes, even if they are already completed. So we get this cmd.exe and terminate it, as well as its child processes (recursively).

Decision

The problem was solved by adding two conditions to the tester code, under which the child processes cannot be terminated:

The parent process has already ended.
The parent process was not completed, but it started later than its intended subsidiaries. In this case, he could re-use the ID of an already completed process.

Conclusion

So, the investigation is complete. All suspicions from innocent people were removed, and the guilty were found and severely punished (corrected).

In conclusion, I want to note that the situation described is, in my opinion, a good example of how even a small oversight can lead to a difficult to reproduce error.

We are not ashamed to admit our mistakes and, I think, readers were interested in meeting one of them.

Bezdazhnogo code and successful builds!

Source: https://habr.com/ru/post/344142/

All Articles