Elusive bugs: errors that have escaped all tests and checks

Even in relatively simple products, occasionally bugs are encountered that successfully hide from any tests and get released. And the more complex the application, the higher the likelihood of such bugs. In products containing millions of lines of code, it is generally unrealistic to catch all the errors, you can only minimize their livestock to the release of the next version. And after the release of these bugs sometimes make themselves felt. Alexander Grechishkin , project manager Parallels told us about how we hunt for ninja bugs and how we treat them.

Bugs ninja

For us, this is a sensitive issue, because more than 5 million people use our products. And when a new version comes out, the majority of users switch to it within 1-2 weeks. At this time, a bug often emerges that was not detected during testing, but appears quite often to users.
')
Recently we wrote about how our reporting system works . Briefly repeat: when our program crashes, the user is asked to send a report. The reasons may be different, for example, accessing inaccessible or protected memory. The report is a set of files: logs, internal information about the operation of the application, computer configuration, system environment. We have a separate system that collects these reports, processes, analyzes and generates a statistical report that is updated almost in real time.

In this report, you can group the recorded cases of failures on various grounds: by type, by IP and so on. The report allows us to monitor the current dynamics of failures and detect bad trends in time. And at the same time helps to calculate the same and catch such bugs-ninjas that have escaped from all the tests and penetrated into production.

Individual service

When drops occur in 100 different users 1 time - this is bad. But if the application drops 100 times for one, it is much worse for us, because with high probability it means that a person had some kind of cunning bug, and numerous drops do not increase user confidence in our product. Therefore, we try to monitor such situations and respond to them as soon as possible.

It so happens that the reports allow you to quickly find the cause of a unique bug and figure out how to help a person. But very quickly release a product update will not work, even if some developer comes with a ready fix. Our codebase contains tens of millions of lines of code, and we never release such a product at random. We have already encountered the fact that a change of only one letter in the code leads to a completely unworkable product. Therefore, when rebuilding with the introduction of fix, you need to go through all autotests and release checklists, which takes from 2 days to a week, depending on the resources available.

How, then, quickly help individual users who have unique bugs? To do this, in the form of sending a fall report, we inserted a field for the user's mailbox and a tick "Send me a solution when it is available." If you can fix a bug without even changing the product, we write letters with instructions to those users who have indicated their mailboxes in the report. If the crash is fixed in a new release, and the user still uses the old one, then when a crash is detected, we will send an email saying that it is worth updating. Often, users simply disable the automatic update in the product.

For example, there was such a bug: some users had a system for sending a crash report. Funny situation, but nonetheless. The reason was the excessive number of files in the log folder. The solution was simple: write the user to go to the Logs folder and delete all files. We immediately fixed this bug in the code and massively distributed it in the next update.

And imagine how it looks from the point of view of the user. He has a program crashed, he swears, presses the report sending button, and after a couple of minutes he receives a notification about a letter from Parallels: “We see you have a problem, it can be solved this way.” Practical real-time individual service! We even wrote letters of thanks for the fact that within minutes we had offered a solution to the problem, which did not allow us to work.

Arbitrariness MacOS

Often, drops are not associated with bugs in our products, but with the system environment. Oddly enough, but before the release of Mac OS 10.11, the main cause of the crashes was Mac OS itself . In version 10.9, the so-called silent updates appeared - silent updates. For example, you left the computer, come the next day, and he says: "I installed the updates here, but don't worry, I have the resume function". That is, all applications that worked before installing the update are launched again. And it seems that nothing has changed: the same sites are opened in Safari, an unfinished letter in the mail, correspondence in the messenger.

Because of a bug in Mac OS, the system killed the Windows Server before it waited for the applications to finish. We have a heavy application and it needs some time to zasospendit working virtual systems, but Mac OS decided that it responds very slowly and cut it down using kill -9 . And we have a multiprocess system, two kernel drivers, a virtual machine processor, a graphical interface, integration components. In Windows, something could be downloaded or installed at that time. For example, updates were installed, the mass of registry entries was changed, and in no case should they be turned off. And Mac OS either cut down Windows Server before the one who used it, or cut down Parallels Desktop because of the slowdown of virtual machines.

Naturally, applications started dropping massively, and Mac OS itself actively generated reports. Our applications checked the system logs, found these records, and honestly reported to users who sent these reports to us. And here it is also necessary to remember that, according to our statistics, only one out of four presses the “Send Report” button. So for us it was just a nightmare. I even had to cut the crash-signature from the reporting system so as not to upset users with crash reports.

In the end, we managed to convince Apple to fix this bug, and starting with Mac OS 10.11, it no longer bothers us.

Not all drivers are equally useful.

Another typical example of the influence of the environment on the work of our products is the fall of other drivers . In particular, we have 3D support in Linux — OpenGL, and in Windows — DirectX and OpenGL. And if the nVidia drivers work stably enough, it is typical for Intel drivers to fall on any sneeze. And this leads to the most deplorable consequences, including the crash of the application, the restart of the OS and the programs used. Naturally, with data loss.

For example, a person launched a resource-intensive program. Say, a game on Mac OS. She starts using memory, processor, video card to the maximum. And what is happening in 3D-drivers is known only to the authors of these drivers. The user played the game, the drivers did not free any buffer or changed some settings in the system. And after that, or during this time, a person launches our product, which also needs resources from a video card, and quite a lot - after all, it is a computer emulator. He begins to actively use 3D graphics, and as a result, due to errors in the Makov graphics card drivers, it falls.

BSOD forever

Also, we must not forget that in Windows itself there are also errors that may affect the stability of our products. For example, you calmly work for yourself, and suddenly Windows has rebooted. What's the matter? And there the same fall occurred, or an exception, or an appeal to a nonexistent memory. But our program cannot silently reboot. She needs to say something to the user. She gives an error. And then we collect and analyze these errors.

Most of them are single, but we don’t even consider them, because, from experience, these problems are not related to our program and we can’t fix them. Moreover, such an error will not happen to the user anymore. First of all, we pay attention to the most frequently repeated bugs in the report.

On test benches, we could never have foreseen a specific combination of factors, because the variability of the software environment of users is infinite. The main difficulty is that inside our products run other people's programs. For example, inside Parallels Desktop is running Windows, inside of which there are still a lot of applications. And it is simply impossible to cover all the combinations with tests.

But in Windows and third-party programs also have bugs. When analyzing such problems, we try first to find such a program and run it on a real computer in order to check how it actually works. But, unfortunately, the users of our product are often very specific and, let's face it, not free software, which imposes some restrictions on us.

Unpredictable Qt

Another common source of failure is the Qt framework . We use it to develop our products. This happened historically, since we had Parallels Desktop for Linux, Windows and Mac, so we had to choose Qt as the development framework. In an amicable way, if something works, then don't change it. But we are still a progressive company, we like to use progressive technologies. Therefore, we always update our Qt. We had versions 3.0, 3.2, 4.0, 4.2, 4.5, 4.8, 5.2, 5.5. But the fact is that in Qt the change in the second digit is almost always a catastrophe. If all programs have a major update, and the third digit has a fix, then our framework can sometimes compare this with the first digit of changes. That is, the difference between 5.0 and 5.5 is striking. The authors throw out entire classes, refactor a huge number of components, move to a new API. But the new code does not happen without errors. And we regularly face the fact that our products are falling due to Qt.

At some point, our patience is exhausted. We pumped all the Qt source code to our repository and started making local patches for it, not bringing them into the main trunk. We had about 250 of them, and they solved rather serious problems with inefficiency or with incorrect work of the code. And recently, going to the next version, we started to make patches already in the trunk Qt.

A simple example of how the framework code affects. In some situations on user computers, our applications shamelessly slowed down. Everything worked perfectly for us. It turned out that the Qt network component is used to build our product. We will not use it, but it is in the library. And it somehow affects the work of the entire framework. As soon as he was thrown out, the brakes immediately disappeared.

Finally, I remembered a funny incident. Analyzing the reports, we noticed that one application constantly crashes our application. And falls because pirated and crookedly broken. We wrote a letter: “Sorry, please, we are from Parallels. We want to help. Your product is not entirely honest, and that’s it! ”. The girl apologized, said that the application was set by her friend, and after 10 minutes had already paid for the purchase. Since then, from her no reports on the problems.

Source: https://habr.com/ru/post/317270/

All Articles