What hung the system: a bug or flare at the sun?

Source data: three Oracle database systems that actively exchange data with each other through the Oracle distributed transaction mechanism. At one point, all servers admins began to observe the occurrence of queues when an application tries to insert / update data. Immediately sent signals for the immediate completion of processes that blocked the work of users. After half an hour of proceedings, it was decided to restart all three systems. Then we started the procedures for clearing the hung up distributed transactions, which were in the COLLECTING and PREPEARED statuses. Details on what to do in this case can be read in the document: How To Resolve Stranded DBA_2PC_PENDING Entries [ID 401302.1]. After the systems were brought into working condition, the analysis began.

The main question is what caused the system crash? Yes, during the incident, a system was detected (one of the three), which at the initial diagnostics stage worked with anomalies (let's call it system X). Login to the OS passed, but it was impossible to connect to the database. Moreover, these health check notification systems recorded that with system X, full order and during hangup, they didn’t pay attention to it. The problems were initially fixed in two other systems (let's call them Y and Z), so all attention was paid to them.

Began collecting diagnostics. Something has already been done before and during the hang. Looking ahead, I will say that the screenshots of the profiles of the processes being deleted from the monitoring system really helped, just before they were deleted.
')

On systems Y and Z from the history of active sessions, the statistics of processes deleted in the first place were unloaded. The processes before their deletion appeared in the history of active sessions, each of them was expecting SQL * Net message from dblink from remote system X. And then the first anomaly appeared. After the 12:32:08 time stamp, the active session history lines in the statistics columns of the consumed CPU, DB time (TM_DELTA_CPU_TIME, TM_DELTA_DB_TIME) showed empty data.
Analysis of database trace files and system X listener. In the alert.log database, the last timestamp before system restart is Time: 06-SEP-2017 12:27:18. In the listener magazine - Wed Sep 06 12:27:18 2017. In the directory with trace files, the latest change in the NMON file was at 12:26.
The history of active sessions on system X showed that the last entry from the range 12:20 - 14:00 was made by background processes at 12.27.33. In our case, the Oracle server has stopped saving history data since 12:27:33.
Obviously, something happened with the database on the system side X. We decided to look in detail at the statistics of the OS (CPU, I / O, etc.) of server X.

There were significant expectations. Wait% - when user processes were expecting the ability to write / read data from the disk (there was also a complete lack of kernel write / read system calls, network I / O and a drop in RunQueue). The waiting time is also 12:28.
With a DB and OS understood. We decided to look at the underlying level - the file system and storage. No data was found on the storage system, but on Veritas, the output of the errpt team showed the following picture (too much was removed):
BC3BE5A3 0906140317 PS SRC SOFTWARE PROGRAM ERROR
A6DF45AA 0906140317 IO RMCdaemon The daemon is started.
BC3BE5A3 0906140217 PS SRC SOFTWARE PROGRAM ERROR
D221BD55 0906140217 IO perftune RESTRICTED TUNABLES MODIFIED AT REBOOT
B258EF25 0906140117 TS VxDRV: vxdmp vxdmp driver
B258EF25 0906140117 TS VxDRV: vxdmp vxdmp driver
B258EF25 0906140117 TS VxDRV: vxdmp vxdmp driver
... deleted ... <
B258EF25 0906122817 TS VxDRV: vxdmp vxdmp driver
B258EF25 0906122817 TS VxDRV: vxdmp vxdmp driver

Subtotal

The main cause of the problems was the denial of service to the system X in the area of 12:27 (28) local server time. This can be clearly seen from the database logs, the listener, trace files, data of the history of active sessions and Veritas FS logs. It is also confirmed by the absence of activity after the mark of 12:32 on systems Y and Z with network expectations of processes during data insertion and committing of distributed transactions. These processes blocked other sessions, which led to the formation of large queues of processes on server Y. Blocking sessions were subsequently terminated abnormally.

The difference in the timestamps of service stopping on Y, Z and X is due to the server X delaying by ~ 4 minutes from other systems. If we subtract this delta from the time of the end of activity of processes on Y and Z (see paragraph 1 above), we get 12:27 (28) - the numbers of the last activity, which are shown by the DB logs, the listener and the history of active sessions on system X.

If you restore the sequence of events, the chain is as follows: inaccessibility of the system X-> appearance of queues on Y and Z -> then - removal by administrators of blocking processes and further avalanche-like development with the hang of distributed transactions of the remaining system components.

So, it looks like the file is in the VxFS file system from Veritas (FS) on which the database files, the database trace files and the listener were located. The FS hibernated and the rest of the processes waited until the I / O operation was completed. We decided to reproduce the problem using the fsfreeze command on test systems - a very similar picture. A search on the Veritas vendor site found a description of a bug with similar symptoms: Doc Id: 3898129 with symptoms: IO's hung in DRL.

Unexpected

The next day I open the news feed - and there are reports of flares on the Sun on September 6. On September 7, news headlines were full of reports about how this anomaly affected communications systems, people, etc.

September 6, 2017 14:05, rg.ru, “Scientists: A powerful flare on the Sun will impact the Earth”
September 6, 2017 14:44, rbc.ru, “Astronomers have registered the most powerful flare in the Sun in 12 years”
September 6, 2017 15:34, rg.ru, “Video: Due to a powerful flash on the Sun, Moscow will see aurora”
September 6, 2017 23:54, rg.ru, “Scientists have told about the consequences of a powerful flare on the Sun”

Some people thought that it was necessary, along with the bug in Veritas FS, to consider this version as working. Then I did not attach much importance to this.

The other day I decided to tightly deal with this issue. On the Internet, he went to the website of the Laboratory for the X-ray Astronomy of the Sun (FIAN), which publishes statistics on solar activity from the GOES-15 satellite. I open the page with data for September 6 - and so what? At KDPV, the surge of activity on the C2.7 flash just falls at the start of problems with the operation of server X, and very close to the maximum flash - 12:34. I checked the time again - isn't it UTC? No, the time is local. Well, this can not be. Coincidence?

However, an additional search brought to the article on Geektimes: Cosmic radiation can affect systems .

It turns out that this is rare, but it happens. Such events are called “single-event upset (SEU)”. Most often occur at high altitude or near-earth orbit, but they also occur on Earth. Quote:

Although there are quite vivid examples of equipment failures, SEUs remain an extremely rare phenomenon. But experts note that electronic chips are increasingly used in various household appliances. The density of transistors on chips increases, as well as their number. Because of this, the likelihood of meeting with the "space crash" is growing every year. Electrical manufacturers are exploring the problem.

It seems that solar activity may well affect the stability and sustainability of information systems. What will be their reliability with the further miniaturization of technical processes and more dense packing of chips? Question.

That's all.
Take care of yourself!

Alexander Kardapolov, service engineer of Jet Infosystems

Source: https://habr.com/ru/post/346502/

All Articles

What hung the system: a bug or flare at the sun?

Subtotal

Unexpected

More articles: