📜 ⬆️ ⬇️

So what happened to Sberbank?

Instead of an epigraph:
May I be a Old Believer?
I don't care, I'm glad.
I write Goblin size,
Sing, friends, in an old way
(almost M. Yu. Lermontov)

So,

- What happened to the processing of Sberbank?
- An Oracle DBMS error occurred that caused the instance to stop.
')
- What kind of strange things did the vice president of the bank write ? What is the removal of events from the logs?
- Please note, the commentary is written far away after midnight, and not on a specialized resource like sql.ru, and not by the Oracle administrator, but by the Vice President. It is strange in such a situation to expect a deep technical description. However, the description is quite accurate. The technical details will be slightly lower - try to formulate “for the common people” more clearly than Orlovsky did.

- Well, let's have the technical details!
- Oracle rerun log is implemented as a ring buffer. The LGWR process writes new changes to the database in the “head”, and the “tail” is cleared by the CKPT process as the changes recorded in the “tail” will be recorded by the DBWn processes in the data files. The “head” is the current (current) log file, the tail is the active (active) files. Tail trimming is that logs are marked as reusable (inactive). The problem was that the “cleanup” stopped; all the operational log files have become active, and the database instance has nowhere to write new changes.

- And what, such an important database does not have a workable backup node?
- Yes, but in this case it turned out to be useless, because all errors were replicated to the backup byte-per-byte node. In such cases, a low-level copy (storage-level replication or standby-base) is useless, only application replication can help. The performance of IBM InfoSphere CDC for Oracle Replication was not enough to replicate such a database. GoldenGate has not been tested on it yet. The most correct option would be when the application itself writes data to several databases at the same time, but do you know a lot of applications that can do this?

- And why Sberbank does not have a support contract with Oracle?
- There is a contract. Either premium, or platinum, or something like that, only Everest is cooler on a moonless night. Actually, Oracle experts helped solve the problem. Believe me, three hours to restore the service in such an accident is not so much.

- How is it - such a base without RAC!
- RAC is not a panacea at all. For example, CFT, one of the leading manufacturers of banking software in Russia, has adapted its applications for Exadata. At the same time, it is officially stated that the previous versions of the software will not work on Exadata, that is, the software requires significant improvements even for the best implementation of the RAC, not to mention the “knee” assemblies. Way4 software for RAC is not yet adapted, although the work is underway. And the speed of these works depends not only on Sberbank.

- Hire, finally, normal administrators!
- Administrators in Sberbank are good. But they, alas, are not wizards.

- But in PostgreSQL it would be possible to fix all the bugs yourself, without waiting for the mercy of the supplier!
- Do not believe it, but tuning database performance, backup, equipment sizing, etc., is a completely different set of skills and competencies than writing a fairly low-level code. Therefore, even Postgres is acquired by large companies in the form of EnterpriseDB or GreenPlum.

- On the metalink for a long time you can find a solution to any problem!
- You never paid attention that the description of the problem is accompanied by a detailed description of the hardware configuration, version, OS ... the installation of Oracle used in Sberbank processing is unique in its own way: there are not so many IBM P795 servers in full configuration in the world. Therefore, the patch may not be. The server itself was released only in October 2010, so the period of "childhood diseases" has not yet passed. So it is very likely (and there are a number of indirect signs pointing to this) that the error is not in Oracle, but in AIX.

- Well, get SR, and you will immediately release a patch!
- Look at the same metalink history of the patches. When a problem is revealed, when confirmed, when solved. At 24 o'clock it does not fit ever.

- But why in other banks nothing falls!
Firstly, in other banks combined cards almost as many as in one Sberbank. Therefore, processing systems operate with less load on simpler equipment. And secondly, literally the next day after the failure under discussion, my colleague could not perform a single operation on the “A ...” card all day. But who cares?

- Yes, Sberbank has ATMs running under Windows, what kind of reliability can we talk about here?
ATMs to processing have no relation at all. Management of ATMs and POS-terminals in Sberbank assigned to another system. Which, by the way, during the Friday crash continued to work, allowing the creative class to buy iPhones in boutiques using cards from other banks. Well, in other banks, ATMs, obviously, are running completely different operating systems. And the ATMs themselves are produced in other factories. Obviously elves :)

- Somehow, suspiciously quickly, Sberbank had a platform for discussing the causes of the accident!
- Sberbank is trying to follow fashion trends. Today, “crowdsourcing”, “social networks” and other “web-dance” are no less important for the image of a bank than reliable operation of the main systems. From here and ready platform for discussion.

“There are no specialists left to see, once they turned to“ collective intelligence. ” And who will pay “collective intelligence” for solving the problem?
- As far as I understand, the goal is different: a specialist, having seen the logs, will understand that there is a really serious problem. A non-specialist, having had the opportunity to speak, Sberbank will wash mud - yes, for the sake of b ~ ha, "the dog barks - the caravan moves on." Well, and to worry that in open access will get more information than you can, do not. All information is verified.

Source: https://habr.com/ru/post/148102/


All Articles