CleanTalk Malware Scanner - Heuristic Code Analysis

We already talked about the launch of WordPress Security Service in a previous article . Today we want to talk about the launch of heuristic analysis to identify malicious code.

The very presence of malicious code can lead to a ban in the search results or a warning in the search that the site is infected, to protect users from possibly dangerous content.

You can find the malicious code yourself as well, but this is a lot of work and most WordPress users do not have the necessary skills to find and remove unwanted lines of code.

Often, the authors of malicious code camouflage it, making it difficult to determine by signature. The malicious code itself can be located anywhere on the site, for example, obfuscated php code in the logo.png file, and the code itself is invoked by one imperceptible line in the index.php. Therefore, the use of plug-ins to search for malicious code is preferable.
')
CleanTalk when you first scan, scans all WordPress core files, plugins and themes. With repeated scans, only those files that have been changed since the last scan are scanned. This saves resources and increases scanning speed.

How heuristic analysis works

One of the main drawbacks of the heuristic analysis is that it is quite slow, so we only use it when it is really needed. First of all, we break the source code into lexemes (minimal language construction) and delete all unnecessary:

Space characters.
Comment of various kinds.
Not PHP code (out of <? Php?> Tags)

Next, we recursively simplify the code until there remains “complex structures”:

We perform string concatenation.
Variable substitution in variables.
And so on

Also in the process of simplifying the code, we follow the origin of variables and much more.

As a result, we get a clean code that can be analyzed. It is very important that we receive the code not as a string, but as a token. Thus, we know where the string token is located with the required text, and where is the token function.

In terms of searching for “bad construction”, eval for us is the difference:

<?php echo 'eval("echo \"some\"")'; ?>

- in this case there will be no lexeme T_EVAL,

there will be a token T_CONSTANT_ENCAPSED_STRING 'eval ("echo \" eval \ "")

 <?php eval('echo "some"'); ?>

- and there will be. And it is this option that we will discover.

We are looking for such constructions, we divide them into degrees of criticality:

Critical:
- eval
- include * and require *
  - with bad file extension
  - nonexistent files (will be removed in the next versions)
  - connection of remote files
Dangerous
- system
- passthru
- proc_open
- exec
- include * and require *
  - with the error suppression operator (to be removed in the next versions)
  - with variables dependent on POST or GET.
Suspicious
- base64_encode
- str_rot13
- syslog
Other.

We are constantly improving this analysis: we add new constructions to search, reduce the number of false positives, optimize the simplification of the code.

The plans teach him to detect and decode strings encoded in the URL and BASE64 and others.

The plugin itself is available in the WordPress directory .

Source: https://habr.com/ru/post/351572/

All Articles

CleanTalk Malware Scanner - Heuristic Code Analysis

How heuristic analysis works

More articles: