📜 ⬆️ ⬇️

Kdump - diagnosis and analysis of the causes of kernel failures

Kdump

Although in modern Linux-systems, the kernel has a fairly high level of stability, the probability of serious system errors, however, is always there. When an unrecoverable error occurs, a condition called kernel panic occurs: the standard handler displays information that should help troubleshoot the problem and goes into an infinite loop.

To diagnose and analyze the causes of kernel failures, the developers of RedHat developed a specialized tool - kdump. The principle of its operation can be briefly described as follows. Two cores are created: the main and the emergency one (it is used to collect the memory dump). When the main kernel is loaded, a certain size of memory is allocated for the emergency kernel. Using kexec during a panic of the main kernel, it crashes and collects a dump.

In this article we will discuss in detail how to configure kdump and analyze system errors with it. We consider the features of working with kdump in UC Ubuntu; in other distributions, the kdump configuration and configuration procedures differ significantly.
')

Install and configure kdump


Install kdump with the command
 $ sudo apt-get install linux-crashdump kdump-tools


The kdump settings are stored in the / etc / default / kdump-tools configuration file.

 # kdump-tools configuration
 # ------------------------------------------------- --------------------------
 # USE_KDUMP - controls kdump will be configured
 # 0 - kdump kernel will not be loaded
 # 1 - kdump is configured
 # KDUMP_SYSCTL - controls when a panic occurs, using the sysctl
 # interface.  The contents of this variable should be the
 # "variable = value ..." portion of the 'sysctl -w' command.
 # If not set, the default value "kernel.panic_on_oops = 1" will
 # be used.  Disable this feature by setting KDUMP_SYSCTL = ""
 # Example - also panic on oom:
 # KDUMP_SYSCTL = "kernel.panic_on_oops = 1 vm.panic_on_oom = 1"
 #
 USE_KDUMP = 1
 # KDUMP_SYSCTL = "kernel.panic_on_oops = 1"


To activate kdump, edit this file and set the value of the parameter USE_KDUMP = 1.
The configuration file also contains the following parameters:


After setting all the necessary parameters, run the update-grub command and select install the package maintainer's version.

Then restart the system and make sure that kdump is ready for operation:
 $ cat / proc / cmdline

 BOOT_IMAGE = / boot / vmlinuz-3.8.0-35-generic root = UUID = bb2ba5e1-48e1-4829-b565-611542b96018 ro crashkernel = 384 -: 128M quiet splash vt.handoff = 7


Pay particular attention to the parameter crashkernel = 384 -: 128M. It means that the crash kernel will use 128 MB of memory at boot. You can write the crashkernel = auto parameter to grub: in this case, the memory for the abnormal kernel will be allocated automatically.

In order for us to analyze the dump using the crash utility, we also need a vmlinux file containing debugging information:

 $ sudo tee /etc/apt/sources.list.d/ddebs.list << EOF
 deb http://ddebs.ubuntu.com/ $ (lsb_release -cs) main restricted universe multiverse
 deb http://ddebs.ubuntu.com/ $ (lsb_release -cs) -security main restricted universe multiverse
 deb http://ddebs.ubuntu.com/ $ (lsb_release -cs) -updates main restricted universe multiverse
 deb http://ddebs.ubuntu.com/ $ (lsb_release -cs) -proposed main restricted universe multiverse
 EOF
 $ sudo apt-key adv - keyserver keyserver.ubuntu.com --recv-keys ECDCAD72428D7C01
 $ sudo apt-get update
 $ sudo apt-get install linux-image - $ (uname -r) -dbgsym


Once the installation is complete, check the kdump status again:

 $ kdump-config status


If kdump is operational, the following message will be displayed on the console:

 current state: ready to kdump


Test kdump


Call the kernel panic using the following commands:

 echo c |  sudo tee / proc / sysrq-trigger


As a result of their execution, the system will “hang”.

After that, a dump will be created within a few minutes, which will be available in the / var / crash directory after a reboot.

Information about a kernel crash can be viewed using the crash utility:

 $ sudo crash /usr/lib/debug/boot/vmlinux-3.13.0-24-generic /var/crash/201405051934/dump.201405051934
 crash 7.0.3
 Copyright (C) 2002-2013 Red Hat, Inc.
 Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
 Copyright (C) 1999-2006 Hewlett-Packard Co
 Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
 Copyright (C) 2006, 2007 VA Linux Systems Japan KK
 Copyright (C) 2005, 2011 NEC Corporation
 Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
 Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
 This program is free software, covered by the GNU General Public License,
 distribute copies of it under
 certain conditions.  Enter "help copying" to see the conditions.
 This program has absolutely no warranty.  Enter "help warranty" for details.

 GNU gdb (GDB) 7.6
 Copyright (C) 2013 Free Software Foundation, Inc.
 License GPLv3 +: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
 This is free software:
 There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
 and "show warranty" for details.
 This GDB was configured as "x86_64-unknown-linux-gnu" ...

       KERNEL: /usr/lib/debug/boot/vmlinux-3.13.0-24-generic
     DUMPFILE: /var/crash/201405051934/dump.201405051934 [PARTIAL DUMP]
         CPUS: 4
         DATE: Mon May 5 19:34:38 ​​2014
       UPTIME: 00:54:46
 LOAD AVERAGE: 0.14, 0.07, 0.05
        TASKS: 495
     NODENAME: Ubuntu
      RELEASE: 3.13.0-24-generic
      VERSION: # 46-Ubuntu SMP Thu Apr 10 19:11:08 UTC 2014
      MACHINE: x86_64 (2675 Mhz)
       MEMORY: 8 GB
        PANIC: "Oops: 0002 [# 1] SMP" (check log for details)
          PID: 7826
      COMMAND: "tee"
         TASK: ffff8800a2ef8000 [THREAD_INFO: ffff8800a2e68000]
          CPU: 2
        STATE: TASK_RUNNING (PANIC)

 crash>


Based on the above output, we can conclude that the system failure was preceded by the event “Oops: 0002 [# 1] SMP”, which occurred on CPU2 when the tee command was executed.
The crash utility also has a wide range of capabilities for diagnosing the causes of a kernel crash. Consider them in more detail.

Diagnosing the causes of failure using the crash utility


Crash stores information about all system events that preceded the collapse of the kernel. With it, you can recreate the state of the system at the time of failure: find out what processes were running at the time of the crash, what files are open, etc. This information helps to make an accurate diagnosis and prevent future nuclear failures.

The crash utility has its own set of commands:

 $ crash> help
 * files mach repeat timer          
 alias foreach mod runq tree           
 ascii fuser mount search union          
 bt gdb net set vm             
 btop help p sig vtop           
 dev ipcs ps struct waitq          
 dis irq pte swap whatis         
 eval kmem ptob sym wr             
 exit list ptov sys q              
 extend log rd task           

 crash version: 7.0.3 gdb version: 7.6
 For help on any command above, enter "help <command>".
 For help on input options, enter "help input".
 For help on output options, enter "help output".

 crash>


For each of this command, you can call a brief manual, for example:

 crash> help set


We will not describe all the teams (detailed information can be found in the official RedHat user manual ), but we will only tell you about the most important ones.

First of all, you should pay attention to the bt command (abbreviation from backtrace is reverse tracing). With its help, you can view detailed information about the contents of the kernel memory (for details and usage examples, see here ). However, in many cases, the log command displays the contents of the kernel message buffer in chronological order to determine the cause of a system failure.

Let's give a fragment of its output:
 [3288.251955] CPU: 2 PID: 7826 Comm: tee Tainted: PF O 3.13.0-24-generic # 46-Ubuntu
 [3288.251957] Hardware name: System manufacturer System Product Name / P7P55D LE, BIOS 2003 12/16/2010
 [3288.251958] task: ffff8800a2ef8000 ti: ffff8800a2e68000 task.ti: ffff8800a2e68000
 [3288.251960] RIP: 0010: [<ffffffff8144de76>] [<ffffffff8144de76>] sysrq_handle_crash + 0x16 / 0x20
 [3288.251963] RSP: 0018: ffff8800a2e69e88 EFLAGS: 00010082
 [3288.251964] RAX: 000000000000000f RBX: ffffffff81c9f6a0 RCX: 0000000000000000
 [3288.251965] RDX: ffff88021fc4ffe0 RSI: ffff88021fc4e3c8 RDI: 0000000000000063
 [3288.251966] RBP: ffff8800a2e69e88 R08: 0000000000000096 R09: 0000000000000387
 [3288.251968] R10: 0000000000000386 R11: 0000000000000003 R12: 0000000000000063
 [3288.251969] R13: 0000000000000246 R14: 0000000000000004 R15: 0000000000000000
 [3288.251971] FS: 00007fb0f665b740 (0000) GS: ffff88021fc40000 (0000) knlGS: 0000000000000000
 [3288.251972] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
 [3288.251973] CR2: 0000000000000000 CR3: 00000000368fd000 CR4: 00000000000007e0
 [3288.251974] Stack:
 [3288.251975] ffff8800a2e69ec0 ffffffff8144e5f2 0000000000000002 00007fff3cea3850
 [3288.251978] ffff8800a2e69f50 0000000000000002 0000000000000008 ffff8800a2e69ed8
 [3288.251980] fffffffff1414eaff ffff88021017a900 ffff8800a2e69ef8 ffffffff8121f52d
 [3288.251983] Call Trace:
 [3288.251986] [<fffffffff1414e5f2>] __handle_sysrq + 0xa2 / 0x170
 [3288.251988] [<fffffffff1414eaff>] write_sysrq_trigger + 0x2f / 0x40
 [3288.251992] [<ffffffff8121f52d>] proc_reg_write + 0x3d / 0x80
 [3288.251996] [<ffffffff811b9534>] vfs_write + 0xb4 / 0x1f0
 [3288.251998] [<ffffffff811b9f69>] SyS_write + 0x49 / 0xa0
 [3288.252001] [<fffffffff1717663f>] tracesys + 0xe1 / 0xe6
 [3288.252002] Code: 65 34 75 e5 4c 89 ef e8 f9 f7 ff ff eb db 0f 1f 80 00 00 00 66 66 66 66 90 55 c7 05 94 68 a6 00 01 00 00 00 48 89 e5 0f ae f8 <c6 > 04 25 00 00 00 00 01 5d c3 66 66 66 66 90 55 31 c0 c7 05 be 
 [3288.252025] RIP [<ffffffff8144de76>] sysrq_handle_crash + 0x16 / 0x20
 [3288.252028] RSP <ffff8800a2e69e88>
 [3288.252029] CR2: 0000000000000000


One of the lines of output will indicate the event that caused the system error:

 [3288.251889] SysRq: Trigger a crash


Using the ps command, you can display a list of processes that were running at the time of the crash:

 crash> ps
    PID PPID CPU TASK ST% MEM VSZ RSS COMM
       0 0 0 ffffffff81a8d020 RU 0.0 0 0 [swapper]
       1 0 0 ffff88013e7db500 IN 0.0 19356 1544 init
       2 0 0 ffff88013e7daaa0 IN 0.0 0 0 [kthreadd]
       3 2 0 ffff88013e7da040 IN 0.0 0 0 [migration / 0]
       4 2 0 ffff88013e7e9540 IN 0.0 0 0 [ksoftirqd / 0]
       7 2 0 ffff88013dc19500 IN 0.0 0 0 [events / 0]


To view information about the use of virtual memory, use the vm command:

 crash> vm
 PID: 5210 TASK: ffff8801396f6aa0 CPU: 0 COMMAND: "bash"
        MM PGD RSS TOTAL_VM
 ffff88013975d880 ffff88013a0c5000 1808k 108340k
       VMA START END FLAGS FILE
 ffff88013a0c4ed0 400000 4d4000 8001875 / bin / bash
 ffff88013cd63210 3804800000 3804820000 8000875 /lib64/ld-2.12.so
 ffff880138cf8ed0 3804c00000 3804c02000 8000075 /lib64/libdl-2.12.so


The swap command will display information on the use of the paging area to the console:

 crash> swap
 FILENAME TYPE SIZE USED PCT PRIORITY
 / dm-1 PARTITION 2064376k 0k 0% -1


CPU interrupt information can be viewed using the irq command:

 crash> irq -s
            CPU0
   0: 149 IO-APIC-edge timer
   1: 453 IO-APIC-edge i8042
   7: 0 IO-APIC-edge parport0
   8: 0 IO-APIC-edge rtc0
   9: 0 IO-APIC-fasteoi acpi
  12: 111 IO-APIC-edge i8042
  14: 108 IO-APIC-edge ata_piix


You can display to the console a list of files opened at the time of the crash using the files command:

 crash> files
 PID: 5210 TASK: ffff8801396f6aa0 CPU: 0 COMMAND: "bash"
 ROOT: / CWD: / root
  FD FILE DENTRY INODE TYPE PATH
   0 ffff88013cf76d40 ffff88013a836480 ffff880139b70d48 CHR / tty1
   1 ffff88013c4a5d80 ffff88013c90a440 ffff880135992308 REG / proc / sysrq-trigger
 255 ffff88013cf76d40 ffff88013a836480 ffff880139b70d48 CHR / tty1


Finally, you can get a summary of the general state of the system using the sys command:

 crash> sys
       KERNEL: /usr/lib/debug/lib/modules/2.6.32-431.5.1.el6.x86_64/vmlinux
     DUMPFILE: /var/crash/127.0.0.1-2014-03-26-12:24:39/vmcore [PARTIAL DUMP]
         CPUS: 1
         DATE: Wed Mar 26 12:24:36 2014
       UPTIME: 00:01:32
 LOAD AVERAGE: 0.17, 0.09, 0.03
        TASKS: 159
     NODENAME: elserver1.abc.com
      RELEASE: 2.6.32-431.5.1.el6.x86_64
      VERSION: # 1 SMP Fri Jan 10 14:46:43 EST 2014
      MACHINE: x86_64 (2132 Mhz)
       MEMORY: 4 GB
        PANIC: "Oops: 0002 [# 1] SMP" (check log for details)


Conclusion


Analysis and diagnosis of the causes of the fall of the nucleus is a very specific and complex topic that cannot be contained within the framework of a single article. We will return to it in the following publications.

For those who want to learn more - a few useful links:


Readers who can not leave comments here are invited to our blog .

Source: https://habr.com/ru/post/226487/


All Articles