Monitoring Linux system calls

If you are an engineer in an organization that uses Linux for industrial use, I have two small questions for you.

How many unique outgoing TCP connections have your servers established in the last hour?
What processes and users initiated the installation of these connections?

If you are able to answer both questions, well, then you can stop reading. And if there is no answer, go-audit will help you with this information.

System calls (syscalls) is a means of communication between user programs and the Linux kernel. They are used for things like connecting network sockets, reading files, loading kernel modules, creating new processes, etc., etc. If you have ever used strace , dtrace , ptrace or something like that with the word trace in the title, the system calls you are no longer new.

Most people use the mentioned trace- tools to debug isolated individual cases, while in Slack we collect system calls as a source of data for continuous monitoring, which we advise you.

Linux Audit has been added to the kernel since version 2.6. (14?). The audit system consists of two main components: a kernel code for monitoring system calls and a daemon that logs the events associated with these calls.

Kauditd and go-audit architecture

Let's look at an example of using auditd . Suppose we want to log events to read the file /data/topsecret.data . ( Please do not store sensitive information in a file called topsecret.data ). When using auditd, you must first inform the kernel about our desire to receive information about events of interest. This is done using the auditctl , which runs as the superuser:

 auditctl -w /data/topsecret.data -p rwxa

Now the kernel will generate an event every time someone gets access to /data/topsecret.data (including via symlink ). This event is sent to a process in user space (usually auditd ) via something called a netlink socket. ( In other words, netlink is when you tell the kernel to send messages to a process using its PID on the corresponding socket. )

In most Linux distributions, the user-space process auditd writes data to /var/log/audit/audit.log . If there is no connected network process connected to the netlink socket, messages will appear in the console and can be viewed using dmesg .

All this is great, but observing a single file is too simple an example. Let's take something more interesting, something related to the network.

Daemons ( and even a poor netcat ) usually use the listen system call to arrange for incoming network connections. For example, if Apache wants to listen to port 80, it sends the corresponding request to the kernel. To log these events, we again inform the kernel about our intentions using auditctl :

 auditctl -a exit,always -S listen

Now, every time the process starts listening on any socket, we get the corresponding message. Great! Such logging can be configured to any system call. To solve the questions indicated at the beginning of this article, we need connect . To organize the monitoring of each new process and team, you can use execve .

Supermarking. We are not limited to user actions. Think about such interesting events as running Apache bash or creating an outgoing connection to some incomprehensible IP address, and what these actions can mean.

So now we have a number of events in /var/log/audit/audit.log , but the log files are not yet a monitoring system. What do we do with all the data we collect? Unfortunately, the auditd log format has features that make it difficult to handle.

The data format is mainly represented as “key = value”.
An event can span one or more lines.
Events can intertwine and appear out of order.

Sample auditd output

There are several tools for handling log events (for example, aureport and ausearch ), but they seem to focus on investigating an already happened event, rather than on constant monitoring.

We have seen a lot of usefulness options from auditd data. But it was necessary to think of a way to scale such a monitoring system. As a result, as a replacement for the user-space auditd part, we decided to create a go-audit . Here are the goals that we focused on:

Convert multi-line auditd events into one JSON-blob.
Communicating with the kernel directly using netlink.
High performance.
Minimization (or complete exclusion) of filtering events on the observed nodes.

The first three conditions are most likely obvious and understandable, and the latter requires clarification.

Here are two main questions that we must answer when discussing the fourth paragraph:

Why don't you want to filter events on every server?
Will you forward a lot of extra information?

Imagine that curl installed on your servers ( probably the way it is ). During the next exercise, a comrade from the “red team” (red team) uses curl to load the rootkit and then to upload the data. After receiving this lesson, you start logging each command and filtering everything except curl , at the start of which an alarm is raised every time.

There are several problems with this approach.

There are about 92,481,124.5 ways to download a rootkit and unload data without using curl . We can’t even list them.
A cracker might look at your auditd rules and notice that you are following curl .
There are legitimate uses for curl .

We need something better ...

Instead of sending all the data to a centralized logging and alert system, instead of filtering specific teams in the field? This approach has several surprisingly useful properties.

The hacker does not know which system calls you are interested in. ( And as aptly noticed by Rob Fuller , invisible traps cause nightmares for hackers. )
We can analyze the interrelation of various events in order to understand when the execution of curl is completely legal and when not.
New rules can be evaluated and tested on archived data.
Now we have an external independent repository of forensic information.

So, friends, welcome - go-audit . We release this tool as open source software. It can be used freely and completely free of charge. The guys from our Secops team created the first version of the go-audit more than a year ago, and we use it in commercial operation almost as much. This is a small but very important element of our monitoring infrastructure. I recommend to look at my previous article about our approach to working with alerting (alerting).

The go-audit repository contains many examples of configuration and data collection. Here in Slack, we use rsyslog + relp , because we want to remove data from the server as soon as possible, but still be able to write events to disk if syslog is temporarily unable to transmit them over the network. We could easily switch to another log-delivery mechanism and we will be glad to hear your ideas.

We invite new members to the project and hope that the go-audit will be useful not only for us. During the year, this repository was opened to a certain number of people outside Slack, and some of our friends already use it in commercial operation.

You may have noticed that the word “security” has never been encountered in this article. Let's discuss this. I believe that many general-purpose tools can often be used to provide security, but the opposite is usually not true. Auditd provides security monitoring data in such a way that it is difficult to use it in any other way, and the go-audit was designed as a general-purpose tool. Its usefulness is immediately apparent to service engineers and developers, who can use a go-audit to solve problems on a whole variety of modern, including very large systems.

Let's return to the questions posed at the beginning of the article. Any company with IDS / IPS / Netflow / PCAP, etc. on the network gateway can say a lot about its network connections and, perhaps, answer the first question. But none of these solutions will provide the context in the form of user / pid / command , necessary to answer the second question. And it is precisely the difference between “someone launched something somewhere in our network, and it connected to such IP” and “Mallory launched curl as root on bigserver01 and connected to 1.2.3.4, port 1337 ".

In Slack, we often say: "Do not let the best become the enemy of the good." Go-audit is not perfect, but we believe that this tool is really very good, and we are glad to share it with you.

FAQ:

Why auditd, but not sysdig or osquery?

Osquery is a great tool. In fact, in Slack, we also use it. But for combat servers, we prefer go-audit . The reason is that these systems operate 24/7, and we need to constantly transfer data. With osquery, we get a snapshot of the current state of the machine. If something has completed execution in the interval between polls, there is every chance to skip it. I think this model is good for laptops and other user devices, but for high-availability systems I prefer to transmit data in a continuous stream.

Sysdig is also a great debugging tool, and I used it quite actively. The main problem is that it requires a kernel module on each machine. Sysdig Falco seems useful, but they prefer to filter events at each monitoring site. And, as mentioned above, it is more interesting for us to keep the rules centrally, out of reach of the hacker who has gained access to the system.

Auditd is good because it has been around for a long time and is part of the main kernel. This is the most common mechanism for auditing system calls in the Linux world. That's why we chose it.

What do we do with all these alarms?

They are sent to the Elasticsearch cluster. There, we use ElastAlert to generate alarms based on the analysis of constantly incoming data and for general monitoring. You can also look at other popular journaling systems that can work with large amounts of data, but ( my personal opinion does not have to coincide with the opinion of my employer ) I have fundamental problems with the monetary valuation of structures that encourage less data to be recorded in journals in order to save finances .

What size magazines are we talking about?

The short answer is: the value can vary within very wide limits.

Long answer: depends on which system calls log and how many servers will do it. We write hundreds of gigabytes per day. It seems to be a lot, but at the moment we have constantly logged events from 5500 cars. You should also consider allocating additional resources to the cluster to repel attacks such as DoS.

Why rsyslog?

We have considerable experience with rsyslog , which has several useful features. We strongly recommend using version 8.20+, where we fixed a few bugs. We had the opportunity to entrust the execution of message delivery and the go-audit itself , but the benefits of this approach did not outweigh the benefits of using a tool that has been serving us regularly for several years.

Thanks

My colleague Nate Brown , who did a go-audit is much better.
The audit-go guys from Mozilla for inspiring us to this project.
A lot of people have read this article and have written helpful reviews.
Chicago Cubs for winning the championship.

Source: https://habr.com/ru/post/316902/

All Articles