Viruses. Viruses? Viruses! Part 1

Talk about computer viruses? No, not about the fact that yesterday caught your antivirus. Not that you downloaded the installer of the next Photoshop. Not about the rootkit-e, which is on your server, disguised as a system process. Not about search bars, downloader and other Malvari. Not about the code that does the bad things on your behalf and wants your money. No, all this is commerce, no romance ...

We will talk about computer viruses, as a code that is capable of generating its own copies, changing from generation to generation. Which, like its biological counterparts, requires a carrier file that is operable, and remains operable, in order to give life to new generations of the virus. Which requires a fertile environment for reproduction, a lot of tasty executable files, and also a lot of stupid and active users to run them. So the name “virus” is not just a beautiful label for describing a malicious program, a computer virus, in its classical sense, is an entity very close to its biological counterpart. Mankind, as it has been repeatedly proved, is capable of creating very sophisticated solutions, especially when it comes to creating something harmful to other people.

So, a long time ago, after DOS came to people, and each programmer had his own little universe, where the address space was the same, and the rights to the files were always rwx, there was a thought about whether the program could copy itself. “Of course, it can!” Said the programmer and wrote code that copies its own executable file. The next thought was “can two programs unite into one?”. “Of course they can!” Said the programmer and wrote the first infector. “But why?” - he thought, and this was the beginning of the era of computer viruses. As it turned out, it is a lot of fun to try to avoid detection on a computer and in every way possible, and creating viruses is very interesting from the point of view of the system programmer. In addition, the antiviruses on the market provided the creators of viruses with a serious challenge to their professionalism.
')
In general, the article is quite enough lyrics, let's get down to business. I want to talk about the classic virus, its structure, basic concepts, detection methods and algorithms that are used by both parties to win.

Virus anatomy

We will talk about viruses that live in executable files of PE and ELF formats, that is, viruses whose body is the executable code for the x86 platform. In addition, let our virus not destroy the source file, fully preserving its operability and correctly infecting any suitable executable file. Yes, it's much easier to break, but we agreed to talk about the right viruses, right? For the material to be relevant, I will not waste time reviewing the old COM format infectors, although it was on it that the first advanced techniques of working with executable code were run.

The main parts of the virus code are infector and payload. Infector is a code that searches for files suitable for infection and injects a virus into them, trying to hide the fact of the injection as much as possible and not damage the functionality of the file. Payload is a code that performs the actions needed by the virmaker, for example, sends spam, DoS-it to someone, or simply leaves the text file “Here was Virya” on the machine. It is completely unprincipled to us that there is payload inside, the main thing is that the virmaker tries in every way to hide its contents.

Let's start with the properties of the virus code. To make it easier to embed the code, there is no wish to separate the code and data, therefore, data integration is usually used directly into the executable code. Well, for example, like this:

jmp message the_back: mov eax, 0x4 mov ebx, 0x1 pop ecx ;      «Hello, World» mov edx, 0xF int 0x80 ... message: call the_back ;        «», ..  «Hello, World\n» db "Hello, World!", 0Dh, 0Ah

Or so:

 push 0x68732f2f ; “hs//” push 0x6e69622f ; “nib/” mov ebx, esp ;  ESP    «/bin/sh» mov al, 11 int 0x80

All of these code variants under certain conditions can be simply copied into memory and made JMP on the first instruction. Having correctly written such a code, having taken care of the correct offsets, system calls, cleanliness of the stack before and after execution, etc., it can be embedded inside the buffer with someone else's code.

Suppose a virmaker has the ability to write virus code in this style, and now he needs to inject it into an existing executable file. He needs to take care of two things:

Where to put the virus? It is necessary to find enough space for the virus to fit there, write it there, if possible without breaking the file and so that in the area in which the virus turns out, the execution of the code is allowed.
How to transfer control to a virus? Simply putting the virus in the file is not enough, you must also make the transition to his body, and after completing his work, return control to the victim program. Or in a different order, but, in any case, we agreed not to break anything, right?

So, let's deal with the introduction of the file. Modern executable formats for the x86 platform in Windows and Linux are PE (Portable Executable) and ELF (Executable and Linkable Format). You will easily find their specifications in the system documentation, and if you are concerned with the protection of executable code, you will definitely not miss. Executable formats and the system loader (the code of the operating system that runs the executable file) are one of the "elephants" on which the operating system stands. The procedure for launching an .exe file is a very complex algorithmically with a bunch of nuances, and you can talk about it in a dozen articles that you are sure to find for yourself if the topic interests you. I will confine myself to a simple examination, sufficient for a basic understanding of the startup process. To avoid throwing tomatoes at me, then under the compiler I will have in mind the whole complex of programs that turns the source code into a ready-made executable file, that is, in fact, the compiler + linker.

An executable file (PE or ELF) consists of a header and a set of sections. Sections are aligned (see below) buffers with code or data. When you run the file, the sections are copied into memory and memory is allocated for them, and not necessarily the amount that they occupied on the disk. The header contains section markup, and tells the loader how the sections are located in the file, when it lies on the disk, and how to arrange them in memory before transferring control to the code within the file. We are interested in three key parameters for each section, these are psize, vsize, and flags. Psize (physical size) is the size of the partition on the disk. Vsize (virtual size) - the size of the section in memory after downloading the file. Flags - section attributes (rwx). Psize and Vsize can differ significantly, for example, if a programmer declared an array of one million elements in the program, but is going to fill it in the process of execution, the compiler will not increase psize (the contents of the array should not be stored on the disk before launching), but vsize will increase by million there (in the runtime for the array should be allocated enough memory).

Flags (access attributes) will be assigned to the memory pages to which the section will be displayed. For example, the section with executable code will have r_x attributes (read, execute), and the data section will have rw_ attributes (read, write). The processor, trying to execute the code on the page without the execution flag, generates an exception, the same applies to attempting to write to the page without the w attribute, therefore, when placing the virus code, the virmaker must take into account the attributes of the memory pages in which the virus code will be located. Until recently, standard sections of uninitialized data (for example, a program stack area) had rwx (read, write, execute) attributes, which allowed copying code directly to the stack and executing it there. This is now considered unfashionable and insecure, and in recent operating systems, the stack area is for data only. Of course, the program itself can change the attributes of the memory page at runtime, but this complicates the implementation.

Also, in the header is Entry Point - the address of the first instruction from which the execution of the file begins.

It is necessary to mention such an important for virmaker property of executable files as alignment. In order for the file to be optimally read from disk and displayed in memory, sections in executable files are aligned on borders that are multiples of two, and the free space left from padding is filled with something at the discretion of the compiler. For example, it is logical to align sections to the size of a memory page — then it is convenient to copy it entirely into memory and assign attributes. I will not even remember about all these alignments, wherever there is a standard piece of data or code, it is aligned (any programmer knows that there is exactly 1024 meters in a kilometer). Well, the description of the Portable Executable (PE) and Executable Linux Format (ELF) standards for the executable code that works with the protection methods is desktop books.

Since the addresses within all of these sections are connected, simply slapping a piece of code in the middle of the section, “bandaging” it with JMP will not work, the source file will break. Therefore, popular places to introduce a virus code are:

main code section (virus overwriting of the beginning of the executable code starting right from the Entry Point).
padding between the end of the title and the first section. There is nothing there and it is quite possible to fit there a small virus (or its loader) without breaking the file.
A new section that can be added to the header and placed in the file after all the others. In this case, no internal displacements will break, and there are no problems with the place as well. True, the last section in the file, in which execution is allowed, will of course draw attention to the heuristics.
padding between the end of the section content and its aligned end. It is much more difficult, because you first need to find this very "end", and not the fact that we are lucky and there will be enough space. But for some compilers, this place can be found simply by the characteristic bytes.

There are ways and more cunning, some I will describe in the second article.

Now about the transfer of control. For the virus to work, its code must somehow get control. The most obvious way: first, the virus gets control, and then, after it has completed, the host program. This is the easiest way, but also has the right to life and options when the virus gets control, for example, after the completion of the host’s work, or in the middle of execution, “replacing” the execution of some function. Here are a few control transfer techniques (the term Entry Point or EP, used later, is the entry point, that is, the address to which the system loader will transfer control after it has prepared the executable file for launch):

JMP on the body of the virus replaces the first bytes located in the Entry Point file. The virus saves the lost bytes in its body, and, at the end of its own work, restores them and transfers control to the beginning of the restored buffer.
A method similar to the previous one, but instead of bytes, the virus saves several complete machine instructions to Entry Point, then it can, without restoring anything (tracing only the correct stack cleaning), execute them after completing its own work and transfer control to the instruction address following "Stolen".
As in the case of implementation, there are ways and more cunning, but we also consider them below, or postpone until the next article.

All these are ways to make the correct insertion of the buffer with the code into some executable file. At the same time, clause 2 and clause 3. imply a functional that allows you to understand which bytes are instructions and where the boundaries between instructions are. After all, we can not "break" the instruction in half, in this case, everything will break. Thus, we smoothly proceed to the consideration of disassemblers in viruses. We will need the notion of the principle of work of disassemblers to consider all the normal techniques of working with executable code, so it's okay if I describe it a bit now.

If we embed our code in a position exactly between instructions, we can save the context (stack, flags) and, after executing the virus code, restore everything back, returning control to the host program. Of course, this can also be a problem if you use code integrity controls, anti-debugging, etc., but this is also discussed in the second article. To search for such a position, we need this:

put the pointer exactly at the beginning of some instruction (just take a random place in the executable section and cannot start disassembling it, the same byte may be the instruction opcode and data)
determine the length of the instruction (for the x86 architecture, the instructions have different lengths)
move the pointer forward to that length. We will be at the beginning of the next instruction.
repeat until we decide to stop

This is the minimum functionality necessary to avoid falling into the middle of an instruction, and a function that takes a pointer to a byte string and returns the length of the instruction in response is called the length disassembler. For example, the infection algorithm might be:

Choose a delicious executable file (thick enough to fit the body of the virus, with the necessary distribution of sections, etc.).
We read our code (virus body code).
We take the first few instructions from the victim file.
We add them to the virus code (we save the information necessary for the restoration of working capacity).
We add to the virus code the transition to the instruction that continues the execution of the victim code. Thus, after the execution of its own code, the virus will correctly execute the prolog of the victim code.
Create a new section, write the virus code there and edit the title.
In place of these first instructions put the transition to the virus code.

This is a variant of a completely correct virus, which can infiltrate into an executable file, break nothing, hide its code and return execution to the host program. Now, let's catch him.

Detector anatomy

Suddenly, from nowhere, a knight appears on a white computer, a debugger in his left hand, and a disassembler, antivirus company programmer, in his right hand. Where did he come from? Of course you guessed it. With a high degree of probability, he appeared there from the "adjacent area". In terms of programming, the antivirus area is highly respected by those who are in the subject, because these guys have to mess around with very sophisticated algorithms, and in quite cramped conditions. Judge for yourself: you have a hundred thousand copies of any infection and an executable file at your entrance, you should work in real time, and the cost of the error is very high.

For antivirus, as well as for any finite state machine that makes a binary yes / no decision (infected / healthy), there are two types of errors - false positive and false negative (mistakenly recognized file as infectious, mistakenly missed the infected one). It is clear that the total number of errors should be reduced in any scenario, but false negative for antivirus is much more unpleasant than false positive. “After downloading the torrent, turn off the antivirus before installing the game” - is it familiar? This is “false positive” - crack.exe, which writes something into an executable .exe file for a reasonably intelligent heuristic analyzer (see below), looks like a virus. As the saying goes: “it’s better to be outrun than to come short”.

I think you don’t need to describe to you the components of a modern antivirus, they all revolve around one functional - antivirus detector. A monitor that scans files on the fly, scanning disks, checking email attachments, quarantining and memorizing already scanned files is all a binding of the main detection core. The second key component of the antivirus is the updated bases of features, without which it is impossible to keep the antivirus up to date. The third, rather important, but deserving of a separate cycle of articles, component - monitoring the system for suspicious activity.

So (we consider classic viruses), at the entrance we have an executable file and one of the hundreds of thousands of potential viruses in it. Let's detect. Let it be a piece of executable virus code:

 XX XX XX XX XX XX ;    N  . . . 68 2F 2F 73 68 push 0x68732f2f ; “hs//” 68 2F 62 69 6E push 0x6e69622f ; “nib/” 8B DC mov ebx, esp ;  ESP    «/bin/sh» B0 11 mov al, 11 CD 80 int 0x80 XX XX XX XX ;    M  . . .

Immediately I just want to take a pack of opcodes (68 2F 2F 73 68 68 2F 62 69 6E 8B DC B0 11 CD 80) and look for this byte line in the file. If found - caught, reptile. But, alas, it turns out that the same pack of bytes is found in other files (well, who knows what the command interpreter calls), and even such strings to search for “one hundred”, if you search for each, then no optimization will help. The only, fast and correct way to check for the presence of such a line in a file is to check its existence by a FIXED offset. Where to get it from?

We recall the "adjacent area" - especially places about where the virus puts itself and how it transfers control to itself:

the virus is embedded in the padding between the header and the beginning of the first section. In this case, you can check the existence of this byte string by offset
"Header length" + N (where N is the number of bytes from the beginning of the virus to the byte line)
The virus lies in a new, separate section. In this case, you can check the existence of byte strings from the beginning of all sections with code
The virus infiltrated into padding between the end of the code and the end of the code section. You can use a negative offset from the end of the section, such as "end of the code section" - M (where M is the number of bytes from the end of the byte line to the end of the virus code) - "length of the byte line"

Now from there about the transfer of control:

the virus has written its instructions directly over the instructions in Entry Point. In this case, we are looking for a byte line just by the offset “Entry Point” + N (where N is the number of bytes from the beginning of the virus to the byte line)
The virus has written Entry Point JMP on its body. In this case, you must first calculate where this JMP looks, and then look for the byte-line by the offset “JMP transition address” + N (where N is the number of bytes from the beginning of the virus to the byte-line)

Something I am tired of writing “byte-string”, it is of variable length, it is inconvenient to store it in the database, and absolutely not necessary, therefore instead of a byte-string we will use its length plus CRC32 from it. Such a record is very short and the comparison is fast, because the CRC32 algorithm is not slow. There is no sense in pursuing resistance to checksum collisions, since the probability of a collision in fixed displacements is negligible. In addition, even in the event of a collision, the error will be of the type “false positive”, which is not so scary. We summarize all of the above, here’s an approximate entry structure in our anti-virus database:

Virus ID
flags indicating where to read the offset (from EP, from the end of the header, from the end of the first section, from the beginning of all sections, from the transition address of the JMP instruction to the EP, etc.)
offset
signature length (Lsig)
CRC32 Signatures (CRCsig)

We optimize the input (we leave only the signatures that “fit” into this file, immediately from the header we prepare the set of necessary offsets) and further:

 { #     -         (  , entry point  ..) -    offset -  Lsig  -    CRC32 -   –    }

Hurray, here is our first antivirus. It is pretty cool, because with a fairly complete database of signatures, normally selected flags and good optimization, this detector is able to catch 95% of all at once very quickly (the vast majority of modern malware are just executable files, without any ability to mutate). Next, the game begins "who will update the database of signatures faster" and "to whom a new copy of something nasty will be sent earlier".

Collecting and cataloging this "shit" is a task quite non-trivial, but absolutely necessary for high-quality testing of the detector. The collection of the reference database of executable files is not an easy task: try to find all instances of infected files (for complex cases in several instances), catalog them, mix them with “clean” files and regularly drive the detector over them in order to detect detection errors. Such a base has been going on for years, and is a very valuable asset of antivirus companies. I may be mistaken and actually get it (all sorts of online virus checking services are quite able to provide some of its analogs), but when I was working on this issue, nothing like that could be obtained (at least under Linux).

Heuristic analyzer

What a terrible word - “heuristic analyzer”, now you will not see it in the interfaces of antiviruses (it probably scares users). This is one of the most interesting parts of the anti-virus, since everything that does not fit into any of the engines (neither signature-based nor emulator) is pushed into it, and looks like a doctor who sees that the patient is coughing and sneezing, but to identify a specific disease can not. This is the code that checks the file for some signs of infection. Examples of such signs:

incorrect (corrupted by a virus, but workable) file header
JMP right at the entry point
"Rwx" on the code section

Well, and so on. In addition to indicating the fact of infection, a heuristic can help decide whether to run a more “heavy" file analysis? Each sign has a different weight, from “some kind of suspicious” to “I don't know what, but the file is infected exactly.” It is these signs that give most errors "false positive". , . , ? .

, . , , -. , – :

;
.

. , . , . , .

(Entry Point Obscuring) , , . , , , : JMP, CALL, RET , .. , .

, .

. , , EXE- «» «», . , ? , « ». – , . . , .

« ». , – , , -. --malware – , payload-, . : . , .

( , ) . , , , .

Source: https://habr.com/ru/post/228681/

All Articles

Viruses. Viruses? Viruses! Part 1

Virus anatomy

Detector anatomy

Heuristic analyzer

More articles: