BASS - framework for automatic synthesis of antivirus signatures

Hello. There is less than ten days before the start of the course “Reverse Engineering” , in this connection we want to share one more interesting translation on the topic. Go!

Short review

')
The picture of threats is changing rapidly - new cyber attacks constantly appear, and old ones become more sophisticated. Under these conditions, security specialists face ever more complex tasks. Every day they have to process and analyze millions of samples of previously unknown and completely new malware, develop effective antivirus signatures to describe entire families of malicious programs, ensure the scalability of tools as the number of samples for analysis increases. At the same time, it is necessary to take into account the limited resources for automation tools for malware analysis. To help IT professionals cope with these diverse challenges, Talos offers a new open source platform called BASS.

BASS (reads “bs”) is a framework for automatically generating antivirus signatures based on samples from previously formed clusters of malicious code. It is designed to reduce the consumption of resources by the ClamAV core by increasing the share of template-based signatures relative to hash signatures, as well as to simplify the work of analysts who develop signature-based signatures. Due to the support of Docker containers, the framework scales well.

It is worth noting that so far only the alpha version of BASS is available and much remains to be finalized. This project has open source code, and we are actively working on it, so we will be glad to receive any feedback from the community and recommendations for improving it. BASS source code is available here .

The BASS project was announced in 2017 at the REcon conference in Montreal, Canada.

Relevance

Talos specialists receive more than 1.5 million unique samples daily. Most of them belong to known threats and are immediately eliminated by a malware scanner (ClamAV). However, after scanning, a lot of files remain that still need further analysis. We run them in a sandbox and perform dynamic analysis, which allows them to be divided into malicious and safe ones. We process the malicious samples selected at this stage in order to create ClamAV signatures based on them, which will help to filter out these threats at an earlier stage, during the scan.

For three months, from February to April 2017, 560,000 new signatures were added to the ClamAV database, an increase of 9,500 signatures per day. We received a significant part of them automatically in the form of hash signatures. Such signatures have one major drawback compared to template-based or bytecode signatures (these are two other types supported by the ClamAV kernel): one hash signature corresponds to only one file. In addition, the increase in the number of hash signatures leads to the fact that the ClamAV database takes up more memory. That is why we prefer template-based signatures. They are much easier and faster to manage than byte-code, and at the same time they allow you to describe entire file clusters.

Bass

The BASS framework is designed to facilitate the creation of ClamAV signatures based on templates. It automatically generates them, processing segments of binary executable code.

BASS takes clusters of malicious code as a basis, but does not include the means of creating them. Due to this technology remains comfortable and flexible. We intentionally made the input interface universal so that it was easy to adapt to new sources of clusters. Now we use several such sources, including clusters based on indicators of compromise (IoC) from our sandbox, structural hashing (when we have a deliberately malicious executable file and we are looking for additional samples that are similar to it in structure) and malware obtained from spam campaigns.

At the first stage, the malicious instances pass through the ClamAV kernel unpackers. It can unpack archives of various formats and compressed executable files (for example, UPX), as well as extract embedded objects (such as EXE files inside Word documents). The resulting artifacts are carefully analyzed, information is being collected. Now for the next stage, filtering, we use their sizes and the UNIX magic string.

Then the cluster of malicious code is filtered. If the files do not meet the BASS requirements (while the platform only works with executable PE files, but it is not difficult to add support for the ELF and MACH-O binary files), they are removed from the cluster or, if there are too few objects, the cluster is rejected completely.

The filtered cluster proceeds to the signature generation stage. First, the binary files are disassembled. To do this, we use IDA Pro, but it can easily be replaced with another disassembler with similar capabilities, for example, radare2.

After disassembling, it is necessary to identify the common code in the samples in order to generate signatures on its basis. This step is important for two reasons. First, the signature creation algorithm requires significant computational resources and works best with short code segments. Secondly, it is preferable to obtain signatures from code samples that are similar not only syntactically, but also semantically. To compare the code, we use the BinDiff utility. Again, it is also easy to replace, and in the future we may integrate other utilities into the framework for comparison.

If the cluster is small, BinDiff compares each executable file with all the others. Otherwise, the scope of the comparison is reduced, otherwise the process may take too long. Based on the results obtained, a graph is constructed, where the vertices denote functions, and the edges denote their similarities. To find a good general function, it suffices to find a connected subgraph with a high general similarity index.

The subgraph ƒ1, ƒ2, 4, ƒ6 with high rates of vertex similarity (see figure above) is an excellent candidate for the role of a common function.

As soon as several such candidates are recruited, we associate them with the white list in order to avoid creating signatures based on ordinary functions of libraries that are statically associated with the sample. To do this, the functions are sent to a copy of Kam1n0 , the database of which we previously filled with functions from obviously clean samples. If a clone of any function is found, the procedure for selecting a subgraph is repeated in order to select the most suitable of the remaining ones. If the check reveals nothing, the set of functions is passed to the next stage.

Then the signature generation itself begins. Pattern-based ClamAV signatures are designed to identify subsequences in binary data. Therefore, we apply to all extracted functions the search algorithm of the largest common subsequence for them (LCS, Longest Common Subsequence).

From a computational point of view, this algorithm is quite expensive even for two samples and noticeably heavier for several, so we use its heuristic variant, described by Christian Blichmann . The result might look something like this:

Finally, before publishing the signature, you need to test it. We automatically check the signature using our test set for false positives. For greater accuracy, we use Sigalyzer - a new feature of our CASC IDA Pro ClamAV plug-in for generating and analyzing signatures (it will be updated later). Sigalyzer marks sections of a binary file that match the ClamAV signature that worked for it. Thus, a visual visual representation of the signature is formed.

Architecture

BASS is implemented as a Docker container cluster. The framework is written in Python and interacts with all the necessary tools through web services. The architecture was created by analogy with the VxClass project, which also generated ClamAV signatures using IDA Pro and BinDiff, but would later be closed and, unlike BASS, inaccessible to the general public.

Restrictions

BASS works exclusively with binary executables, since the signature is generated from the sample code. In addition, it analyzes only executable files x86 and x86_64. Support for other architectures may appear in the future.

So far, BASS does not cope well with file viruses, which embed small and very unlike each other code snippets into infected objects, and with backdoors, mostly consisting of non-malicious binary code (often stolen), which is complemented by malicious functions. We are struggling with these shortcomings by working to optimize the clustering stage.

And once again we want to remind you that BASS is at the stage of alpha testing, and so far not everything works smoothly. But we hope that we will benefit the community by developing this framework as an open source project, and we will welcome any ideas and criticism.

application

The difference between the largest common substring and the largest common subsequence

The following illustration shows the difference between the largest common substring and the largest common subsequence. The largest common subsequence is indicated in our publication by the English abbreviation LCS.

That's all. And already on June 20, it will be possible to get acquainted in detail with the program of the course at the open door , which will be held in the webinar mode.

Source: https://habr.com/ru/post/456426/

All Articles