What is a computer virus?

In the first , second, and third issues of the blog, we talked about how autoprogrammability, a congenital feature of von Neumann computers, led to the appearance of viruses in them. The theme of the fourth release was the fundamental method of anti-virus protection - multi-level fixing of programs based on the separation of programs and data. This method is considered the most reliable of all. It was shown its relationship with other major methods.

In this article we will look at three different definitions of the notion “computer virus”. The first two were formulated by Frederick Cohen and Leonard Adleman in the fundamental works of 1984-1992 and are considered classic in modern computer science. Unfortunately, it automatically follows from them that the problem of recognizing an arbitrary virus is insoluble. As these scientists proved in the same works where they gave their definitions.
')
Therefore, we specially formulate the third definition in such a way that it makes the same problem completely solvable. To do this, we abandon several poorly formalized concepts that underlie classical definitions, and assign a wider range of information objects to viruses. Such an approach — solving a complex problem by replacing categories in which it is formulated — is often convenient and useful in the practical work of an information security specialist.

The problem of computer viruses clearly manifested about 35 years after the emergence of the concept of "stored program computer". The first authors and researchers of viruses were strongly impressed by the unprecedented phenomenon of self-replication (in the biological sense of the term) of man-made objects. Later, experts realized that this phenomenon is not so much these objects themselves as the performing environment - even as relatively simple as a Turing machine or von Neumann machine with a typical operating system, not to mention more complex architectures and network environments. It became clear that mass replication of objects is not a specific property of any one class of programs, but quite typical for all levels of a complex auto-programmable system. For example, massively replicated blocks of code in the OOP paradigm or application programs downloaded by users from a central server. But despite the discovery of the strategic role of the environment, replication is still associated in the majority of people with viruses — and forms the basis of their classical definitions.

Cohen virus definition

Fred Cohen, a graduate student at the University of Southern California, became interested in the problem of computer viruses in the early 1980s. In 1984, with the support of his scientific adviser Leonard Adleman, he wrote the first article on this topic, where he cited a very far from perfect wording: “We’re possibly evolved copy of itself. " At first, everything seemed simple, but as the topic was further developed, Cohen discovered its extreme complexity and key environmental value - both for the replication process and for deciding whether an arbitrary information object under study is a virus.

In his book in 1985 and his dissertation in 1986, Cohen already gave a rigorous definition of a virus, in which he concentrated on his only property — recursive replication. The definition was given only for an abstract model based on a Turing machine (note that a real computer usually has less predictability than its ideal model). No other properties than recursive replication are considered in Cohen’s model. That is, it is a good particular case model of recursively reproducible algorithms, but a poor model of real computer viruses — especially given the observed diversity of their types and the need for strict recursion for distribution.

But the Cohen model of the virus became classical because it turned out to be simple, visual, and taking into account the role of the performing environment, which is necessary for understanding the essence of the problem. In this model, in order to determine whether a studied information object is a virus (finite sequence of characters, program, code), it should be considered only in the context of a computer-environment, in a pair with it and in interaction with it - for as many steps forward as it takes execution code by machine. The virus was defined as a code whose execution will result in writing a copy of this code on the tape of the Turing machine ahead of it, i.e. in future.

But for the Turing machine (as proved by its author in 1936) it is impossible to predict the future. For an arbitrary code, the result of its execution is unpredictable, i.e. a sequence of machine tape (memory) states for an unlimited number of ticks ahead. The only way to figure out how this code will end is to test it in practice. In other words, to find out whether the code being studied is a virus, you need to run it and see what happens. Given the uncertainty of the result, in a real system it is not safe. In addition, the waiting time for the execution of the code can be arbitrarily large (infinite). And without information about what the code will lead to, you cannot judge whether it is a virus.

Please note that analyzing the code without starting it is impossible even from the most convenient position - an external observer who is not limited in the means for studying the machine and the code that are in the original static states. Moreover, there can be no question of analyzing the code without starting it with the tools of the machine itself, as well as analyzing the code in the process of the execution of another code by the machine.

In addition to the disappointing conclusion about the impossibility of reliable recognition of viruses in his model, Cohen proved the following:
a) to an arbitrary code you can always choose a machine that interprets it as a virus; for example, for some machine, the virus will be a single-byte code - 42;
b) some machines interpret any code as a virus;
c) some machines do not interpret any code as a virus.

Subsequently, Cohen returned to this topic more than once. So, in 1992, he published an article where a strict definition of the concept “computer worm” was cited (a special case of a virus for some environments, in particular, multiprocessor ones). The article confirmed the previous conclusion: the recognition of viruses and similar objects in general is impossible.

Adleman virus detection

The topic of computer viruses has never been central to Leonard Adleman (co-author of public-key encryption technology and winner of the 2002 Turing Award), who pay much more attention to other problems of information technology, mathematics and molecular biology. But as a result of his collaboration with Cohen, he found it necessary, somewhat later, in 1990, to write and publish An Abstract Theory of Computer Viruses . The Adleman model is very different from the Cohen model. But even for this model an evidentiary conclusion is made about the impossibility of solving the problem of recognizing an arbitrary virus.

Adleman, as a professional mathematician, defined the virus in categories and terms of recursive functions. For the definition, a new concept was brought in - the original, reference, program not yet infected with this virus. The virus was defined as a recursive function that meets some of the described criteria and performs a mapping ("map") of the reference program to another that is considered infected. As actions of the infected program, Adleman called “damage” (information to which this program has access), “infection” (other information objects) and “imitation” (normal behavior of the original, uninfected program). As in Cohen's work, the virus is considered in conjunction with the environment (one function - in interaction with another). Recursive replication remained the decisive criterion of the virus, and another decisive criterion was harmfulness (formalization of which in the Adleman model is difficult to understand and use as a practically applicable algorithm).

Adleman’s definition is extremely latitude and abstract. We provided a link to the original work, so that readers can independently draw conclusions about how far this definition is proportionate, complete, accurate and applicable to real computer viruses in a real environment 20 years after publication.

In 2005, Chinese researchers Zhihong Zuo, Mingtian Zhou and Qing-xin Zhu published a paper , On the Time Complexity of Computer Viruses , extending Adleman’s model to complex types of viruses, in particular, to polymorphic viruses. This required a significant complication of the model, but did not change its essence. The paper confirms the previous conclusion that it is impossible to recognize an arbitrary virus, more precisely, the possibility of the existence of viruses that are fundamentally unrecognizable within the framework of the model used.

In the work of Adleman two points seem to us especially important. First, as a strategic — and, in fact, the only — means of protection against viruses, he rightly called complete and unconditional isolation of a computer, i.e. closeness of the environment (with regard to alternative methods of protection, the work only raised the question of how real they are). Secondly, the scientist showed the need to bring the concept of the original, uninfected program as a prerequisite for determining the virus.

Recall that the isolation of computer systems in relation to viruses always means the mutual isolation of systems controlled by different hosts. Of course, there is no need for mutual isolation of individual elements of any closed system, which is under the complete control of one owner. The owner programs his own system as he sees fit (this is the difference between the owner and the nominal owner). Therefore, it is hardly possible to use the concept of “harmfulness” outside the social context, exclusively in technological categories: no processes of changing programs and data themselves have “harmfulness”.

Virus detection by code violation

The impossibility of reliable recognition of viruses in the framework of classical models gives reason to think about the approach to solving the problem from the other side.

In the traditional approach, the virus model is created as accurate as possible, based on the concepts of “recursive replication” and, optionally, “harmfulness”. Within the framework of the model created, the definition of a virus is given. Then, it is stated that it is impossible to use this definition to assess the compliance of the information objects being examined with it.

But you can do the opposite: aiming at the practical applicability of the definition of a virus for a simple and guaranteed solution to the problem of its recognition — and for a comprehensive solution to the problem of digital infections in general. Then, using this definition, you can find out which model of the virus corresponds to it.

For information security purposes, the most convenient will be a strictly formalized, proportionate and system definition, which makes it easy to carry out an assessment of compliance with arbitrary information objects. The procedure for comparing the properties of each object under study with the definition should be in the form of a short, reliable and well-controlled algorithm, so that it can be automated and put into practice.

The simplest and most obvious solution is to consider a virus any intrusion of a third-party code (that is, an arbitrary sequence of characters) into a reference, source, uninfected program. This means the rejection of the resource-intensive and in the general case the unsolvable task of analyzing the processes of interaction between the code being tested and the machine - and the rejection of attempts to formalize these processes. A virus is determined not through the properties of a sequence of characters, but through its location in a machine (computer, computer system). In other words, the same sequence of characters may or may not be considered a virus, depending on where it is located in relation to the “coordinate system” of the machine — exactly where the Turing tape, exactly which area of RAM, hard disk, and t .P. A virus is any sequence of characters that is “in the wrong place”, i.e. in the area reserved for another information object - the protected code.

A rigorous definition for a conventional SISD machine, which can be adapted for a wider range of information systems, is formulated as follows: a computer virus is a modified part of the code compared to a standard.

Using in practice a model that meets the above definition of a virus, you are reinsured and go into a deep and very reliable defense. Instead of analyzing unknown code to accurately answer the question about the possibility of initiating recursive replication, you assume that this possibility is not excluded. Instead of analyzing the unknown code to accurately answer the question of its harmfulness, you proceed from the fact that entering into the protected code any, even the smallest change with a certain probability causes a violation of its functions - just as a violation of the functions of biological DNA can be caused by substitution single nucleotide in sequence.

For the convenience of solving many practical problems in complex information systems, the definition can be extended to those data areas that should have the same immunity in the system as the executable code. In general, any protected information object is declared infected, within the boundaries of which another object has fallen.

Note that in the absence of a reference for an arbitrary code, all this code, by this definition, is considered a virus. Although it sounds unusual, it must be well understood and always kept in mind. About where the benchmark comes from, we'll talk in the next blog posts: this will require the involvement of social factors.

Linguistic reference. The word "virus" appeared in modern languages a little over 100 years ago, after the discovery of biological viruses. Their characteristic feature is replication, executed by the corresponding medium. But the word "virus" has another, original meaning: in Latin, this word means poison. No connection with replication: a virus is an object characterized by a violation of the functions of an external object into which it has penetrated. The use of the term “computer virus” in the context of destructive impact is as correct as in the context of replication. The use of informal terms “malicious” or “malicious” is undesirable, especially in technological terminology, outside the social context. The term “anti-virus protection” (rather than “anti-malware” and “anti-malware”) is most stable in Russian and has obvious connotations.

Which virus definition is better?

Compare the approaches used in different models and definitions of the virus.

A virus is:
a) a function corresponding to certain criteria that displays a reference, uninfected object to an infected object that is different from it;
b) any function that maps a reference, uninfected object to an infected object that is different from it.

To recognize a virus in a scanned object, you need:
a) conduct a full analytical or algorithmic study of the properties and behavior of the system consisting of the object being tested and the machine (environment);
b) determine the location of the inspected object in the machine (environment).

The definition of a virus is based on:
a) the property of recursive replication;
b) method of comparing a sample with a reference.

We emphasize once again the essential role of the environment in the replication process. For example, the sequences of characters “SENDMAIL” (8 bytes), “COPY” (4 bytes) or “CP” (2 bytes), which can be contained in a simple virus and with which it can operate, do not carry anything like that would be specific to the replication process (especially recursive). Nothing of the kind also contains a short byte sequence that calls the copying function from the operating system API in native code. In this sequence, if we consider it outside the environment, there is neither replication mechanism, nor its description, nor even a hint of it.

Moreover: the real virus may not even contain this information - literally! Instead, a virus, for example, can include an arbitrarily complex cryptoalgorithm, and the key to this algorithm can be an arbitrary combination of environmental objects that can appear in it in an uncertain future. This means that until some unknown moment, before the medium performs some unknown condition, the fundamental absence in the virus code of any connections with the concepts of both replication and recursion can be guaranteed with the accuracy of a mathematical proof. But even in this case, the virus, merged in its properties with the environment and forming a complex system distributed in space and time with its elements, retains the possibility of reproducing the code.

Thus, in modern large information systems, the formalization and search for signs of recursive replication in separate objects seem useless even from a theoretical point of view. And to consider each object in the aggregate with the whole environment is impossible. If the input data for the analysis of a single object requires the substitution into the abstract formula of complete information about the state of a practically infinite (due to its huge size and permanent variability) environment, then a dead end path is chosen. Earlier we have already shown that the Internet is a unified auto-programmable medium, a unified computing system, events in which can change the code in a single system unit. Therefore, all these events should be taken into account in the formal definition of a virus, if it is based on the properties of the code, both in the abstract model and in practice. Both are excluded.

In the conditions of the global network, where unlimited combinations of direct and reverse connections between elements are possible, methods of mass replication of code that are not related to classical, recursive ones are obvious and used for a long time. If a dangerous code infects an arbitrary system block of this network, then it does not matter whether the infection is recursive — or the child code is formed by the parent using a much more complex algorithm. The algorithm can be any. And on the basis of an unknown algorithm it is impossible to build any exact models applicable in practice.

It should not be forgotten that in real computing systems viruses are programmed (i.e., controlled) by humans. And all that is controlled by man, almost defies formalization.

It follows from the above that a formal definition of a computer virus, based not on the basis of recursive replication, but on fundamentally different signs, may be the best in many cases.

This does not mean that the concept of recursive function has ceased to have meaning in the context of viruses. It remains extremely important, although it cannot be regarded as decisive. It does not allow to accurately position the goal, the phenomenon, the object of the threat and its semantic essence with the use of an acceptable mathematical apparatus. Therefore it is better to always keep in mind alternatives. A clear understanding by the information security specialist of several different definitions of a virus, their corresponding theoretical models and aspects of anti-virus protection provides the greatest opportunities for solving practical problems.

PS Of course, other definitions are possible, except those given in the article. You can search for them on the Web or create one yourself. The main thing to keep in mind is that according to the norms of formal logic, a competent definition should allow, at a minimum, to determine the desired object among others :)

* * *

Read in the next issue:
On the dangers of copyright and the benefits of licensed programs.

Source: https://habr.com/ru/post/123180/

All Articles

What is a computer virus?

Cohen virus definition

Adleman virus detection

Virus detection by code violation

Which virus definition is better?

More articles: