A short list of what needs to be controlled on a virtualization host running Xen. A full-fledged "read" does not pull, but those who work with Xen will be useful. Additions and clarifications are welcome.
This note is about monitoring the host, not the virtual machines running on it or their services.
So let's start with a simple one:
Ping to host, availability management ssh. No comments required.
Messages ipmi / ilo another management system. Catching MCE (hardware failures of the processor, motherboard), memory errors, failures of power supplies, fans, other playful pens that open the case, etc. It is better to monitor via the IPMI external interfaces, although the ipmitool sel list on the host will also fail due to poverty.
Dom0 state (typical):
LA (its excess indicates problems, on a normal dom0 la it should not exceed 0.1, more than 2-3 - a problem)
cpu usage. Monitoring is usually uncomfortable (because it requires a measurement interval), most often implemented through zabbix / cacti / munin
Free space on the sections with the log. For example, XenServer'y like to go into the "bibikat" mode if there is a place on / var, and by default they have 4GB for everything
Free memory (dom0 itself). If applications from dom0 go to swap, there will be trouble for all virtualoks
The number of open network connections. The value itself is chosen experimentally, should not be excessive
The state of the raid array and hard drives. Failure or degradation of disks on the host, even if they are used “only” for root (that is, virtual data separately), the brake / var / log can spoil the nerves. Particular attention in the case of hardware raid - you need to find the vendor utility and use it. The software raid handles mdadm perfectly if you configure mail for it. The drives themselves are controlled by smartmontools or something from the vendor.
Xen state:
The number of domains - exceeding a certain number is fraught with problems at the start of the next virtual locks. Loops, minor numbers for tapdisk, lvm, iscsi, etc. are exhausted.
The presence of forgotten domains. Some tulsteks may forget a domain with the 's' status (shutdown) - if such a domain hangs in the domain list for more than 1-3 seconds, something is wrong.
The presence of zombie domains. If the domain has d and s flags at the same time for more than a second, then the host is very bad - there are shared memory pages between the dom0 and the domain, and they are not given to the hypervisor, that is, the domain cannot be killed. In my practice, this was most often a “bad” tapdisk.
Hypervisor free memory. You need to reserve at least 100-200 MB of memory for the hypervisor in order to be able to use shadow pages and live migration from the host. Note that this task is different from “load balancing”, since it is a pre-emergency for the host itself, that is, it must be monitored independently.
The presence on the host (more than a second) of domains with uuid 00000000-0000-0000-0000-000000000000 (domain initialization error) or with uuid starting on deadbeaf-dead-beaf- ... (failure sign in xapi. Swine method to encode an error, but observation you need him)).
The presence of extraneous (new) messages in the console ring zen. The presence of something there usually indicates a problem or disorderly conduct with guests (for example, trying to record an MSR)
Dom0 services status
The time difference between ntpd and reference. A discrepancy of more than 0.2s (pick a figure depending on the quality of the network) indicates problems and can directly affect guests who have time to crawl.
The size of the arp-table. If it is too large, it can significantly impair the performance of the SAN, leading to a constant rerun of ARP, that is, guest lags.
Loads of network interfaces used for management / san. If it is above a certain limit (for a SAN it is above 30-40%), this indicates a potential source of brakes. Speech about 10-20G, since on the gigabit interface, one way or another, it will periodically be stuck in this gigabit
Relevance of all SSL certificates. Unlike “oh, server API is not available” this check allows you to say “WARNING, in a month the certificate will be gone”.
For tulsters that support queue depth (xapi) - the size of this queue. 30-50 tasks in a queue are usually a sign of problems. At the same time, such a check also checks the connectivity of the slaves and the master - i.e. it is better to check on the slaves.
Monitoring dmesg. Most often this is done with grep, and not on the host, but outside of it. Sending logs via syslog and netconsole to send dmesg is a mandatory practice. Monitoring grepom looks undignified, but fig message is better than architectural silencing of the problem.
Kernel traces. The reasons are different - most often it is OOM, plugging with IO, etc. The appearance of the trace in the guest dmesg is a clear sign of the problem.
Segmentation fault. Inside dom0 there can be no user programs, and any falling program is either a crash or a future exploit.
The presence of flap symptoms (link down / link up) of network interfaces. If the interface has started to flop, it should be immediately taken out of service and find out who is at fault. It so happens that the flaps can go for a long time, not detected by anyone, but at this time the problem is progressing. There may be a bad cable plugged in, maybe an SFP or network card die.
Reporting NFS / ISCSI timeouts, switching paths multipath (or whatever you use in SAN). Part of the timeout is "soft", that is, the guests do not reach. But you need to know about them. The exact type of message is detected experimentally (redeem the test storage system and contemplate)
Monitoring storage parameters. This area is strongly beyond Xen, so I’ll name only the very minimum that is relevant to the maintenance of xen (that is, the monitoring of disks, the state of arrays, interconnects, the closure of SAS loops, etc.) will not be discussed:
Monitor latency on exported moons / volumes / catalogs. Set yourself some reasonable line (5-10ms) and watch - as soon as 95% percentile (that is, 5% of requests) crawls out of this line - this is a clear sign of future problems
Monitoring IOPS. To monitor or not the maximum is a business thing, but what is necessary to monitor the minimum is definitely. If you have a load of 25k IOPS on your storage system dropped to 200 IOPS - you need to find out why.
The number of active connections from hosts. A change in this value in the absence of technical work or host I / O is a sign of problems (of which, perhaps, the host does not even suspect)
Recycling place. Games with thin provision, oversubscription, deduplication, compression, and other space saving technologies can be painful if you don’t keep track of the fulfillment of promises.