📜 ⬆️ ⬇️

The life of insects, or How we catch "bugs" in the updates of anti-virus databases

Unfortunately, everyone has mistakes. And Kaspersky Anti-Virus did not escape this fate. We have “bugs” in the updates, some of which give users unpleasant troubles. We thoroughly investigate all such cases, draw conclusions, and “spin up” testing technologies.

And how are anti-virus updates generally tested?


For obvious reasons, in the antivirus industry, technological details of testing are usually held in seven seals. Try searching on the Internet - there is no useful information about this.
On the other hand, testing updates is a very interesting topic, worthy of the reader's attention. And we have something to share here.
In the late 90s, Kaspersky Lab was one of the first in the industry to automate the process and has been constantly developing it for about 15 years.

We are testing updates of anti-virus databases

First, some interesting numbers:
')
Public anti-virus databases are released on average every 2 hours, anti-spam databases - up to 5 minutes. Everything is tested for false positives ("Fals"), lost detects (malware missed), "crashes", system loadability, performance in each product and for a number of other criteria.

One of the main ones is the Fals and Lost-Detections. The quality of protection depends on them and we approach this process particularly thoroughly. The bases are tested automatically on a large collection (software, curves, broken files, etc.), on a regular computer this would take a day. Of course, we cannot afford this luxury, so the tests are carried out on a special computing cluster running our own distributed Distributed Data Processing System (DDPS) system, which is able to scan an 80 TB collection in 6 hours.

The performance of databases in a particular product under a specific OS is also very critical (because who needs a good database with a non-working product?). They are tested by a special robot on virtual machines, where all combinations of supported versions and operating systems are installed (Windows, different Unix-Linux, MacOS). There are more than 1300 virtual machines in this cluster, i.e. potentially we can simultaneously test 1300 (sic!) combinations.
Cluster, it is also a cluster in Russia

Update modules are tested separately (antivirus engine, anti-rootkit, Script Emulator, IDS, etc., these are more than 20 modules). This is the work of a dedicated team of testers located in Moscow, St. Petersburg and Beijing (10 people in total). The process is semi-automatic, i.e. work on the task and robots and specialists.

After testing, a “display set” is created, which is uploaded to public servers for a total of more than 60 pieces on all continents with the exception of Antarctica. From these servers, the antivirus installed on your computer receives updates. The download takes place with the help of our own development of DRS (Distributed Replication System). Thanks to the DRS process goes very quickly, in many threads, with a high degree of reliability. For example, here’s an indicator: the replication of the update to all of these servers takes only a few minutes, and we make about 20 calculations per hour.

Controlled updates


What is important in all this mechanism of production, testing and delivery of updates?

The update results are monitored by our other tool, namely the KSN cloud system. In the case of "bugs", KSN signals the presence of a problem in an hour (or even several hours or even days, depending on the nature of the "bugs") before the first signals from users come to technical support. This technological feature allows us to respond faster and even solve some incidents BEFORE their sprawl.

The total cost of our infrastructure for testing updates is about $ 3 million per year. Not cheap, but these are very important costs - the quality of protection, the stability of the products and, most importantly, the user's opinion and comfort depend on them. Over the past year, the testing system has identified and prevented 4 major incidents, as well as a decent number of "falss". In general, the infrastructure will be further developed, this is not a reason for savings.

Recently we had 2 serious incidents with updates (in December and in February). We have analyzed them carefully, made conclusions and have already made several important changes in both organizational and technical terms.

First, they tightened the rules for issuing “dangerous” updates that could potentially cause problems on the client side: increased control - now every manipulation with each file is recorded and you can instantly understand what, how and by whom it is released, and any, even the most minor infrastructure deviations from the normal state or the investigated incidents can stop the update (accordingly, prevent a new incident).

Secondly, in the procedure for checking updates introduced new tests for "falling". In one of them, a special collection of files is used, during scanning of which the product uses the maximum amount of code in the anti-virus databases, while the poor-quality code has more opportunities to appear. At the same time, the system of collecting, recording, attributing and aggregating information about “crashes” on the client’s side (dumps) was finalized - now not a single “fall” dissolves in the databases, everything is being investigated. The first tests showed that the procedure is very effective. Already caught one very unpleasant bug.

Thirdly, they formed requirements for future products to increase resistance to various kinds of problems with updates. There are no details here, so as not to let out descriptions of future patents :-).

Finally, we are launching a new corporate crisis management procedure that covers all the departments involved and ensures maximum speed and transparency in passing information along the chain from the developer to the client.

Important: The above is only a small part of the improvements being worked on this year.

Despite the powerful testing system, mistakes happen, and no one is immune from it. We are people too, even though we control robots and machine guns. Errare humanum est. But it is important that humanum draw conclusions from each errare and constantly improve itself, because perfection, alas (or fortunately?), Has no limits.

Posted by Nikolay Grebennikov, Kaspersky Lab CTO

Source: https://habr.com/ru/post/173753/


All Articles