📜 ⬆️ ⬇️

Protection mechanisms in the HP Superdome X

Hi, Habr! In this post we’ll talk again about the HP Superdome X , or rather, about some of the defense mechanisms that exist in it.



Advanced Error Recovery mechanism


Shows the processing of the HP Superdome X firmware uncorrectable errors in memory on the example of Linux. Consider the mechanism for handling uncorrectable errors in memory.

A user application encounters a memory error, HP Superdome X server firmware detects this error, and the Firmware — MCA Recovery mechanism detects regions of memory containing an error. How often do memory errors occur in the server? You will be surprised how often! A Google study conducted in their data centers showed that uncorrectable errors in memory are quite common (the cause can be either application errors or external factors, such as cosmic radiation).
')
At the same time, the probability of an uncorrectable error occurring in a module that has survived even a corrected error, corrected by ECC or Chipkill, is 70-80%, so take a closer look at these modules in your server park and, if possible, do not place critical applications on them ). About 8% of memory modules in the data center Google is faced with malfunctions in RAM. By the way, this report reveals such an interesting fact as the susceptibility of memory to the “aging effect”.



HP's research confirms this data - below shows the statistics of server component failure per year (ACR, annual crash rate) according to the Pareto principle (memory servers did not use memory mirroring, the standard and most popular correction mode used by customers was SDDC + 1 — when one chip for every 4 chips is adjusted):



Therefore, it is very important for servers of this class to have a mechanism that can isolate memory errors without failing the application, operating system, and server, especially for business critical tasks.

An error search is performed by the constant memory check mechanism Patrol scrubber . After detecting an error, the Hardware server tries to correct the error (using ECC or Chipkill). If the Hardware cannot correct the error, the OS alerts. After that, the HP Memory Quarantine mechanism isolates the error before it can damage the data, which reduces the frequency of application failure due to uncorrectable memory errors. Next, the bad section is isolated to avoid further accesses, and the memory module can be replaced during the next routine diagnostic procedure.

Parallel to this process, the Linux kernel “remembers” the failed memory address, sends a SIGBUS signal to the application using this chunk. An application that receives such a signal starts moving the affected area to another memory address without suspending its operation.

In addition to the HP Superdome X, this mechanism is used in HP DL580 Gen9 4-socket servers.


Stages of the Advanced Error Recovery mechanism

A video showing how the HP Superdome X system deals with uncorrectable memory error is available here .

HP's report on increasing the HP Superdome X server's continuous uptime using special firmware to track uncorrectable memory errors in multiprocessor servers.

Live Error Recovery mechanism in HP Superdome X


Shows the HP Superdome Firmware X run off on-the-fly I / O errors using the Linux OS example. As we know, the PCI bus is built on the Serial Bus Architecture architecture, which means that errors occurring on this bus can potentially spread to other devices running on this bus, which can lead to inconsistent data. More than 18 possible I / O errors have been documented, and the chance of such an error increases with the addition of PCI devices. Knowing this, a mechanism for interacting with Intel Live Error Recovery was added to the HP firmware - I / O isolation in the event of such errors.

If IO errors appear, Intel Live Error Recovery isolates the error, preventing the OS or application from crashing. In parallel with this, Intel Live Error Recovery notifies HP microcode about the occurrence of such an error, after that it stops any I / O to avoid leakage of corrupted data outside the server. Further HP Firmware notifies the upstream I / O device driver and OS about the appearance of an error.

Firmware features allow Linux to create an extended report (syslog) for detailed I / O error handling by an administrator or help desk.

In addition, the Firmware includes the Error Analysis Engine, which analyzes I / O errors and gives recommendations to service personnel about their possible causes. In the demo video, you can see a comparison of error handling in a network card in a standard server and in HP Superdome X and a comparison of the LOG files of these servers.

The video handling of this HP Superdome X error. For more information about other RAS functionality implemented in HP Superdome X, please read the document: “ HP Superdome X system architecture and RAS ”.

Thus, the uniqueness of the HP firmware in HP Superdome X multiprocessor systems is that it allows you to realize all the capabilities of the server components (processors, memory, devices) in reliability, availability and ease of use (RAS features). Report from IDC , which analyzed the HP Superdome X system and its applicability for Mission-Critical tasks.

Partition and error isolation (passive midplane) mechanism that provides electrical isolation of blade servers


An important feature that the new HP Superdome X possesses is the electrical isolation of the blade partitions. Partitioning allows you to configure the HP Superdome X as one large system consisting of several blade servers, or as several independent, isolated small systems. Each partition has its own independent set of CPU, memory, I / O, which allows the system to remain operational in case of failure of the entire blade servers, unlike multiprocessor systems that have a common PCI bus.



Systems with a common midplane between a CPU (A) are potentially vulnerable to error propagation between nodes and are dependent on a common bus, which, moreover, limits the performance of the entire system, which is not able to quickly process a large number of CPU-CPU hits, unlike electrically independent partitions. nPar in HP Superdome X (B), which are devoid of these shortcomings.

This functionality has been migrated from the Superdome Integrity platform and allows for flexible delineation of HP Superdome X Recycle Bin resources for different tasks. For example, for databases on a single HP Superdome X, you can run multiple environments at once (productive, test and development), add virtualization, and place several database containers within a single partition. This approach does not require physical movement of the components and can be done from the administrator console.


Flexibility to differentiate resources for different tasks in the HP Superdome X Recycle Bin

Container type of database placement is supported by SAP and Oracle products. In one of our implementations, the customer using the SAP HANA platform used such a container allocation of resources, the isolation of the HP Superdome X partitions allowed him to run a load on a single OLAP and OLTP platform, which is not yet available on standard x86 systems.

Ok, but what about the protection of the application? Do not worry, HP has a well-proven tool in its arsenal - HP Serviceguard , which supports a large number of applications, including such critical ones as databases. HP Serviceguard carefully monitors the operation of the hardware, network, storage, OS, hypervisor. As soon as a failure occurs, HP Serviceguard automatically resumes service operation on redundant cluster nodes. At the same time, Serviceguard supports horizontal-scalable systems (Scale-Out), which standard Linux clusters are not yet able to do. To create disaster-resistant systems, there is support for geographically distributed clusters (Metroclusters). More about this product will be written in a separate article.

FAQ


Duplicate useful material and questions from the first part.

Q1: Are there any open HP Superdome X system performance tests?
A1: Yes, HP Superdome X showed high performance in the standard SPECjbb2013 test, the first among x86 systems to overcome the 1 million jOPS mark.

June 2014 | November 2014 | December 2014

SPEC CPU2006 test

Q: I heard that as the number of processors in the system grows, the performance does not grow linearly, does it?
A: Yes, when using the standard Intel architecture, this is true, but in the HP Superdome X system, when adding processors, there is an almost linear increase in performance due to the use of the high-performance Crossbar architecture (factor 1.92x with the system growing from 4 to 8 sockets and factor 1.86x with the system from 8 to 16 sockets, confirmation can be seen from the test results above.

Q: Are there any open implementations of the HP Superdome X system among Russian customers?
A: There is, for example, the company MTS .

Q: Are there numbers on the HP Superdome X database performance?
A: There is, for example, for SQL 2014 .

Q: Are there any documents showing the HP Superdome X test on Oracle?
A: Yes, there is on Oracle 12c , there are real customers who tested their data on HP Superdome X under Oracle, the references are not public, but the figures are available during the discussion.

Q: Is installation of hypervisor supported on HP Superdome X?
A: Yes, for example, VMware, this can be checked in the compatibility matrix (http://www.vmware.com/resources/compatibility/search.php)

Read


» Running Linux on BL920c Gen8
» Running Windows on HP Superdome X
» Running SQL 2014 on HP Superdome X - reference guide
» Best Practices for Optimizing Superdome X Performance in Linux: NUMA, Power Consumption, Network, I / O

findings


1. You have the opportunity to transfer your business-critical tasks to the standard x86 platform at competitive prices. According to two IDC reports (1 IDC's Server Workloads 2008, June 2008; IDC Special Study Server Workloads Forecast and Analysis Study, 2008 - 2013 (IDC # 219746)) 85% of modern heavy loads, including BI, CRM and ERP can be placed on x86 architecture servers;
2. The openness of the HP Superdome X platform on x86 processors reduces the cost of acquiring a hardware platform and speeds up the deployment, compared to closed architectures;
3. The wide range of applications available for the HP Superdome X: E7 Xeon processors support open source operating systems such as Linux and Windows, which has a positive effect on application development speed and development;
4. Low total cost of ownership (TCO) HP Superdome X: transition to systems with E7 Xeon processors allows you to get a reduction in TCO with an average of 20-50% compared to RISC systems ( ITIC report , 2013);
5. The availability level of the HP Superdome X system based on Intel Xeon E7 reaches 99.999% +, which is comparable with the level of availability of modern RISC systems (reports - once and twice );
6. Using HP Superdome X will provide long-term investment protection: this year, you can place blade servers with different generations of Ivy Bridge and Haswell processors in the Superdome X basket, and support for new Intel processors in the Superdome X servers in the future.

Source: https://habr.com/ru/post/262733/


All Articles