📜 ⬆️ ⬇️

Five steps to save the Linux server that crashed

I have seen many Linux servers that, without a single reboot, have been working for years in 24x7 mode. But no computer is immune from surprises, which can lead to "iron", software and network failures. Even the most reliable server can fail once. What to do? Today you will learn what to do first of all in order to find out the cause of the problem and return the car to the system.

image

And, by the way, at the very beginning, right after the failure, it is worth answering a very important question: “Is the server to blame for what happened?”. It is possible that the source of the problem is not at all in it. But let's not get ahead.

Troubleshooting: earlier and now


When, in the 1980s, I started working as a Unix system administrator — long before Linus Torvalds caught on the idea of ​​Linux — if there was something wrong with the server, it was a real ambush. Then there were relatively few tools for finding problems, so it might take a long time for the failed server to work again.

Now it's not at all like before. Somehow, one system administrator quite seriously told me, speaking of a problem server: “I destroyed it and raised a new one”.
')
In the old days, this would have sounded crazy, but today, when IT infrastructures are built on the basis of virtual machines and containers ... In the end, deploying new servers as needed is common in any cloud environment.

Here you need to add DevOps tools, such as Chef and Puppet , using which it is easier to create a new server than to diagnose and "repair" the old one. And if we talk about such high-level tools as Docker Swarm, Mesosphere and Kubernetes, then thanks to them, the performance of the failed server will be automatically restored before the administrator knows about the problem.

This concept has become so common that it was given the name - serverless computing . Among the platforms that provide such features - AWS Lambda , Iron.io , Google Cloud Functions .

Through this approach, the cloud service is responsible for server administration, decides on scaling issues and a host of other tasks in order to provide the client with the computing power necessary to run its applications.

Serverless computing, virtual machines, containers — all of these levels of abstraction hide real servers from users, and, to some extent, from system administrators. However, at the heart of all of this is physical hardware and operating systems. And, if something at this level suddenly goes wrong, someone must put everything in order. That is why what we are talking about today will never lose relevance.

I remember a conversation with one system operator. Here is what he said about how to act after a failure: “Reinstalling the server is the path to nowhere. So do not understand - what happened to the car, and how to prevent this in the future. No tolerable administrator does this. ” I agree with that. Until the original source of the problem is discovered, it cannot be considered solved.

So, we have a server that failed, or at least we suspect that the source of the trouble is in it. I propose to go through five steps together, from which it is worthwhile to begin searching and solving problems.

Step one. Hardware check


First of all - check the iron. I know that it sounds trivial and out-of-date, but, anyway, do it. Stand up from the chair, go to the server rack and make sure that the server is properly connected to everything necessary for its normal operation.

I can’t count how many times the search for the cause of the problem led to cable connections. One glance at the LEDs - and it becomes clear that the Ethernet cable has been pulled out, or the server has been turned off.

Of course, if everything looks more or less decent, you can do without a visit to the server and check the status of the Ethernet connection with this command:

$ sudo ethtool eth0 

If her answer can be interpreted as “yes”, it means that the interface under study is able to exchange data over the network.

However, do not neglect the opportunity to personally inspect the device. This will help, for example, to find out that someone pulled out some important cable and de-energized the server or the entire rack in this way. Yes, it is ridiculously simple, but surprisingly - how often the cause of system failure is precisely this.

Another common hardware problem with the naked eye is not recognizable. So, bad memory causes all sorts of problems.

Virtual machines and containers can hide these problems, but if you are faced with the regular appearance of failures associated with a specific physical dedicated server , check its memory.

To see what BIOS / UEFI reports on computer hardware, including memory, use the dmidecode command:

 $ sudo dmidecode --type memory 

Even if everything looks normal here, in fact it may not be so. The fact is that SMBIOS data is not always accurate. Therefore, if after dmidecode memory is still under suspicion - it is time to use Memtest86 . This is an excellent program to check the memory, but it works slowly. If you run it on a server, do not expect to be able to use this machine for anything else until the verification is complete.

If you encounter a lot of memory problems — I’ve seen this in places with unstable power supplies — you need to load the Linux kernel module edac_core . This module constantly checks memory in search of bad sites . To load this module, use the following command:

 $ sudo modprobe edac_core 

Wait a while and see if you can see anything by running this command:

 $ sudo grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count 

This command will give you a summary of the number of errors broken down by memory modules (metrics that start with csrow ). This information, if compared with the dmidecode data on memory channels, slots and serial numbers of components, will help to identify a bad memory bar.

Step two. Finding the true source of the problem


So, the server began to behave strangely, but the smoke from it has not yet come. Is it the server? Before you try to solve the problem, you first need to accurately determine its source. For example, if users complain about weirdness with a server application, first verify that the cause of the problem is not a failure on the client.

For example, a friend once told me how his users reported that they could not work with IBM Tivoli Storage Manager. At first, of course, it seemed that the server was guilty of everything. But in the end, the administrator found out that the problem was not related to the server part at all. The reason was the unsuccessful patch of the Windows client 3076895 . But how this security update failed made it look like a server side problem.

In addition, you need to understand whether the server itself or the server application is causing the problem. For example, the server program can work as something, and the iron is in perfect order.

For starters, the most obvious. Does the application work? There are many ways to test this. Here are two of my favorites:

 $ sudo ps -ef | grep apache2 $ sudo netstat -plunt | grep apache2 

If it turns out that, for example, the Apache web server is not working, you can start it with the following command:

 $ sudo service apache2 start 

If in a nutshell, before diagnosing the server and looking for the cause of the problem, find out if the server is at fault, or something else. Only when you understand exactly where the source of the failure is located will you be able to ask the right questions and proceed to further analysis of what happened.

This can be compared with an unexpected stop of the car. You know that the car does not travel further, but before you drag it into service, it would be good to check if there is gasoline in the tank.

Step three. Using the top command


So, if it turns out that all the paths lead to the server, then another important tool for checking the system is the top command. It allows you to find out the average load on the server, the use of the paging file, to find out what system resources use processes. This utility displays general system information and displays data on all running processes on a Linux server. Here is a detailed description of the data that this command displays. Here you can find a lot of information that can help in finding problems with the server. Here are some useful ways to work with top , allowing you to find problem areas.

In order to detect the process that consumes the most memory, the list of processes must be sorted interactively by typing M on the keyboard. In order to find out the application that consumes the most CPU resources, sort the list by typing P To sort processes by activity time, type T on the keyboard. In order to better see the column on which the sorting is performed, press the b key.

In addition, process data that is displayed by an interactive command can be filtered by entering O or o . The following prompt will appear, prompting you to add a filter:

 add filter #1 (ignoring case) as: [!]FLD?VAL 

You can then enter a template, say, for filtering on a specific process. For example, thanks to the COMMAND=apache filter, the program will only display information about Apache processes.

Another useful feature of top is to display the full path of the process and the launch arguments. To view this data, use the c key.

Another similar top feature is activated by typing the V character. It allows you to switch to hierarchical display of information about processes.

In addition, you can view specific user processes using the u or U keys, or hide processes that do not consume processor resources by pressing the i key.

Although top long been the most popular interactive Linux utility for viewing the current situation in the system, it also has alternatives. For example, there is a program htop has an enhanced set of features, which is characterized by a simpler and more convenient graphical interface Ncurses . When working with htop , you can use the mouse and scroll through the list of processes vertically and horizontally in order to view their full list and full command lines.

I don't expect top let me know what the problem is. Rather, I use this tool to find something that makes me think: “And this is already interesting,” and will inspire me to further research. Based on data from top , I know, for example, which logs are worth looking at first. I view the logs using a combination of less , grep and tail -f .

Step Four. Check disk space


Even today, when you can carry terabytes of information in your pocket, on the server, completely unnoticed, disk space may run out. When this happens - you can see very strange things.

The good old df command, whose name is short for “disk filesystem”, will help us to deal with disk space. With it, you can get a summary of free and used disk space.

Usually df used in two ways.


Another useful df — T flag df — T It allows you to display data about the types of file systems storage. For example, a command like $ sudo df -hT shows both the amount of used disk space and data about its file system.

If something seems strange to you, you can dig deeper by using the Iostat command. It is part of sysstat , an advanced system monitoring toolkit. It displays information about the processor, as well as data about the I / O subsystem for block storage devices, partitions, and network file systems.

Probably the most useful way to call this command is:

 $ iostat -xz 1 

This command displays information about the amount of read and recorded data for the device. In addition, it will show the average time of I / O operations in milliseconds. The larger this value is, the more likely it is that the drive is overloaded with requests, or we have a hardware problem. What exactly? Here you can use the top utility to find out if the server is loading MySQL (or any other DBMS running on it). If such applications could not be found, then there is a possibility that something is wrong with the disk.

Another important indicator can be found in the %util section, where device usage information is displayed. This indicator indicates how hard the device is working. Values ​​in excess of 60% indicate low disk subsystem performance. If the value is close to 100%, it means that the disk is running at full capacity.

When working with utilities for checking disks, pay attention to what exactly you are analyzing.

For example, a load of 100% on a logical disk, which consists of several physical disks, can only mean that the system constantly processes some input-output operations. What matters is what exactly happens on physical disks. Therefore, if you analyze a logical disk, remember that disk utilities will not give useful information.

Step five. Check logs


The last in our list, but only in order, and not in importance - check logs. Usually they can be found at /var/log , in separate folders for various services.

For Linux newbies, log files can look like a terrible mess. These are text files that record information about what the operating system and applications do. There are two types of entries. Some records are what happens in the system or in the program, for example, each transaction or data movement. The second are error messages. The log files can contain both. These files can be huge.

The data in the log files usually look pretty mysterious, but you still have to deal with them. Here , for example, is a good introduction to this topic from Digital Ocean.

There are many tools to help you check the logs. For example - dmesg . This utility displays kernel messages. Usually there are a lot of them, so use the following simple command line script to view the last 10 entries:

 $ dmesg | tail 

Want to follow what is happening in real time? I definitely need this when I search for problems. To achieve this, use the tail command with the -f . It looks like this:

 $ dmesg | tail -f /var/log/syslog 

The above command observes the syslog file, and when it receives information about new events, it displays them on the screen.

Here is another handy command line script:

 $ sudo find /var/log -type f -mtime -1 -exec tail -Fn0 {} + 

It scans the logs and shows possible problems.

If systemd is used on your system, you will need to use Journalctl’s built -in journaling tool . Systemd centralizes logging control using the journald daemon. Unlike other Linux logs, journald stores data in binary rather than text format.

It can be useful to set up journald so that it journald logs after rebooting the system. You can do this using the following command:

 $ sudo mkdir -p /var/log/journal 

To enable persistent record keeping, you will need to edit the /etc/systemd/journald.conf file to include the following:

 [Journal] Storage=persistent 

The most common way to work with these journals is the following command:

 journalctl -b 

It will show all log entries after the last reboot. If the system has been rebooted, you can see what it was before using the following command:

 $ journalctl -b -1 

This will allow you to view the log entries made in the previous server session.
Here is helpful material on how to use journalctl .

Logs are very large, it is difficult to work with them. Therefore, although you can deal with them using command line tools such as grep , awk , and others, it is useful to use special programs to view logs.

For example, I like the Graylog open source logging system . It collects, indexes and analyzes a variety of information. It is based on MongoDB for working with data and Elasticsearch for searching log files. Graylog makes it easy to track server status. Graylog, if you compare it with the built-in Linux, easier and more convenient. In addition, among its useful features include the ability to work with many DevOps systems, such as Chef, Puppet and Ansible .

Results


No matter how you treat your server, it may not get into the Guinness Book of Records as the one that worked the longest. But the desire to make the server as stable as possible, getting to the essence of the problem and correcting them is a worthy goal. Hopefully, what we told you today will help you achieve this goal.

Dear readers! How do you usually deal with fallen servers?

Source: https://habr.com/ru/post/330350/


All Articles