📜 ⬆️ ⬇️

SLES 12, Watchdog Timer and IBM / Lenovo Servers

UPD: The latest research is outlined here: IBM / Lenovo servers and watchdog: episode II . The further presentation is a prologue to the article by reference.

Faced a significant regression in SLES version 12 related to watchdog timer support (device / dev / watchdog) on ​​IBM / Lenovo servers.

First, a short educational program, if someone is not in the subject. How should it work and why is it needed? Who already knows the subject, can safely skip the next paragraph.

The server and industrial platforms have a special scheme - a watchdog timer. When activated, it starts to count down the specified time (for example, one minute). If during this time it is not re-addressed, then at the end of the interval the hardware will be executed. If you turn, the interval begins to re-count. This is necessary in order to automatically restore the computer in the event of an operating system freezing or providing some important software service. Such a solution is mandatory applied in high availability (HA) clusters and other applications that require constant system availability. For computers with Intel architecture, several watchdog timer hardware interfaces are used, depending on the system manufacturer, of which Intel TCO (iTCO) is the most common. In Linux, watchdog drivers are implemented as kernel modules that provide a programming interface to it in the form of a / dev / watchdog device.
')
In IBM's IBM servers, which are now being manufactured by Lenovo, the interface to the watchdog timer is the Intel TCO hardware level and the iTCO_wdt Linux kernel module that supports it. With SLES version 11, everything was fine with this and worked automatically; the device / dev / watchdog, supported by the iTCO_wdt driver, appeared on the system with the default settings. However, in the 12th version of SLES, iTCO_wdt driver was rewritten, reducing its size by 3 times, and something was broken. As it turned out, everything in SLES 11 was also bad, the file / dev / watchdog was created, but it was not connected to the driver and did not provide timer functionality.

Now (in SLES12) the following happens. The iTCO_wdt module loads, leaves the diagnosis in the system log: “iTCO_wdt: unable to reset NO_REBOOT flag, device disabled by hardware / BIOS”, remains loaded into memory, but does nothing, and the device / dev / watchdog does not appear. Manual loading and unloading of the module does not change anything in this behavior. The BIOS settings and the integrated service module (IMM) also do not affect this in any way. The problem is exactly the same on several IBM / Lenovo HS23 and x3250 servers. If you load SLES11 on the same machine, everything works fine.

Bypassing the issue can be prescribing a softdog module in /etc/modules-load.d, which provides an interface to the watchdog timer by its software emulation at the OS kernel level. But in fact, this is just a stub, not at all a decisive question of the possible failure of the operating system itself.

Worse, in one of the recent interim updates of the SLES12, the softdog driver was loaded by default. Although this behavior was turned off very soon, we can’t be sure now that the hardware or software driver provides you with a watchdog service until we check for a specific version of Linux.

I passed diagnostic information and error description to kernel developers from Novell and have been working on the incident with IBM / Lenovo support, but for two months the situation has not been resolved, although formally SLES12 is a fully supported and recommended operating system for the specified servers. So, if the reader suddenly faces the inoperability of the watchdog timer (resulting in, for example, the inability to start a cluster) or the incomplete implementation of its functions associated with replacing the hardware driver with software, then at least it will know where to dig.

It seems that there was a way to solve the problem, wrote a new article about it, mentioned at the top.

Source: https://habr.com/ru/post/276025/


All Articles