Oracle RAC. General Description / Part 1

Highly loaded sites, availability of "5 nines". In the background (backend) a bunch of processed information in the database. And what if the iron goes berserk, if some long-time non-existent error in the OS crashes, the network interface drops? What will happen to the availability of information? Out of pure curiosity, I decided to consider what solutions Oracle offers to solve the problems listed above. The latest versions, unlike Oracle 9i, are called Oracle 10g (or 11g), where g means “grid”, distributed computing. The basis of distributed computing "whatever one may say" is the cluster, and additional data replication technologies (DataGuard, Streams). This article outlines how the cluster is based on Oracle 10g. It is called Real Application Cluster ( RAC ).

The article does not claim to be complete and comprehensive, also it excludes settings (so as not to increase in volume). The point is just to give an idea of RAC technology.

PS Beware of the multi-book
Article continuation
')

The article wanted to write as accessible as possible, to read it was interesting even for a person who was not familiar with Oracle DBMS. Therefore, I would venture to begin the description with aspects of the most common database configuration - single-instance, when one Oracle server (RDBMS) resides on the same physical server. This is not directly related to the cluster, but the basic requirements and principles of operation will be the same.

Introduction Single-instance.

The most common database installed on a single physical server is called single-instance . I have not betrayed the special meaning of the difference in concepts between: the instance of the database and the database itself (as a whole). Now I want to especially note that an instance means software (processes, threads, services) which is located in RAM and processes data (sorting, buffering, maintenance) received directly from the disk. Thus, a database means a combination of:

data storage area, i.e. physical files on disk (datastorage) (the database itself)
database instance (receiving and processing this data in RAM) (DBMS)

In all modern relational databases, data is stored in tables. Tables, indexes and other objects in Oracle are stored in logical containers - table spaces ( tablespace ). Physically, the tablespace is located in one or more files on the disk. They are stored as follows:
Each database object (tables, indexes, rollback segments, etc.) is stored in a separate segment — a disk area that can take up space in one or more files. Segments in turn consist of one or more extents. An extent is a continuous piece of space in a file. Extents consist of blocks. A block is the smallest unit of space allocation in Oracle, the default is 8K. Blocks store data rows, indexes, or intermediate results of locks. It is in blocks that Oracle server usually reads and writes to disk. Blocks have an address, the so-called DBA ( Database Block Address ).

Whenever DML (Data Manipulation Language) accesses the database, Oracle loads the appropriate blocks from disk into RAM, namely into the buffer cache . Although it is possible that they are already there, and then the disk does not need to be addressed. If the request changed the data (update, insert, delete), then the changes to the blocks occur directly in the buffer cache, and they are marked as dirty. But blocks are not immediately flushed to disk. After all, the disk is the bottleneck of any database, so Oracle tries to access it as little as possible. Dirty blocks will be flushed to disk automatically by the background DBWn process when passing a checkpoint (checkpoint) or when switching logs.

Suppose that one long-running transaction was started reading data, and somewhere in the process of its execution another transaction was launched with the intention to change one of the read blocks. How will the server coordinate these requests? In fact, the question is divided into two:

What happens if Oracle falls somewhere in the middle of a long transaction (if it made changes)?
What data will the first transaction read when in the cache “under the nose” another transaction has changed the block?

To answer these questions, we consider a mechanism for ensuring consistent CR reading ( consistency read ). It's all about the ~~magic bubbles~~ transaction logs, which in Oracle are represented by two types:

redo log
undo segment

When a change request arrives at the database, Oracle applies it in the buffer cache, in parallel, entering information sufficient to repeat this action into the redo log buffer in RAM. As soon as the transaction is completed, it is confirmed, and the server flushes the contents of the redo buffer log to disk in redo log in append-write mode and commits the transaction. Such an approach is much less expensive than writing a directly modified block to a disk. If the server crashes, the cache and all changes in it will be lost, but the redo log files will remain. When enabled, Oracle will start by looking at them and re-doing changes to the tables (transactions) that were not reflected in the datafiles. This is called "rolling" changes from redo, roll-forward. Online redo log is flushed to disk ( LGWR ) when confirming a transaction, passing a checkpoint or every 3 seconds (default).

With undo a little more complicated. An associated cancel segment is stored with each table in the adjacent segment. When requesting DML along with the table blocks, the data from the rollback segment is necessarily loaded and also stored in the buffer cache. When the data in the table is changed in the cache, the undo data also changes in the cache, “counteractions” are entered there. That is, if insert was inserted into the table, then delete is inserted into the rollback segment, delete - insert, update - the previous value of the string is entered. The blocks (and the corresponding undo data) are marked as dirty and go to the redo log buffer. Yes, yes, not only instructions are recorded in the redo log, which changes should be made (redo), but also what countermeasures (undo) they have. Since the LGWR resets the redo log buffer every 3 seconds, if the long transaction fails (for a couple of minutes), when after a minute the server crashes, the redo records will not complete commit. Oracle, as it wakes up, rolls them (roll-forward), and on the recovered data rollback segments (from the redo log) in memory, roll-back cancels all uncommitted transactions. Justice restored.

Briefly worth mentioning is another indisputable advantage of the undo segment. In the second scenario (from the diagram), when select reaches the block read (DBA) 500, it suddenly finds that this block in the cache has already been changed (marked dirty), and therefore will turn to the rollback segment in order to get the corresponding previous block state. If there was no such previous state (flashback) in the cache, it will read it from the disk and continue the execution of select. Thus, even with a long “select count (money) from bookkeeping,” the debit and credit will converge. Coordinated reading (CR).

Distracted. It's time to look for approaches to the cluster configuration. =)

Level of access to data. ASM.

Storage ( datastorage ) in large databases is almost always SAN ( Storage Area Network ), which provides a transparent interface for servers to disk arrays.
Third-party vendors (Hitachi, HP, Sun, Veritas) offer comprehensive solutions for organizing such SANs based on a number of protocols (the most common is Fiber Channel), with additional functionality: mirroring, load balancing, connecting disks on the fly, partitioning space between partitions and .t.p.
The position of Oracle Corporation in building a database of any scale comes down to the fact that you only need the appropriate software from Oracle (with the appropriate licenses), and the selected equipment - if possible (if the funds remain after the purchase of Oracle :). Thus, to build a high-load database, you can do without expensive SPARC servers and stuffed SANs using servers on free Linux and cheap RAID arrays.

At the level of access to data and disks, Oracle offers its solution - ASM ( Automatic Storage Management ). This is a separately installed Oracle mini-instance (INSTANCE_TYPE = ASM) that provides disk services for each node of the cluster.

Oracle is trying to avoid disk access because this is probably the main bottleneck of any database. Oracle performs data caching functions, but after all, file systems also buffer write to the disk. And why double buffer data? Moreover, if Oracle has confirmed the transaction and received notifications that the changes have been made to the files, it is desirable that they were already there, and not in the cache, in case the database is “dropped”. Therefore, it is recommended to use RAW devices (disks without a file system), which makes ASM.

ASM runs on top of the RAW device, its advantages are:

no need for separate disk partition management software
no need for file system

Disk group - combining multiple disks. When writing files to disks, data is recorded in extents of 1 MB in size, distributing them across all disks in a group. This is done in order to ensure high availability, because parts of the same table (from the tablespace) are scattered across different physical disks.

ASM Abilities:

Data Mirroring:
as a rule, 2 or 3-step, i.e. data is simultaneously recorded on 2 or 3 disks. For disk mirroring, no more than 8 partner disks are specified, to which copies of data will be distributed.
Automatic disk load balancing (ensuring high availability):
if the tablespace data is placed on 10 disks and, at some point in time, reading data from certain disks will go off-scale, ASM itself will access the same extents, but residing on mirrored disks.
Automatic rebalancing:
When you remove a disk, ASM on the fly duplicates the extents that it contained on the other disks remaining in the group. When added to a group of disk, will move the extents in the group so that on each disk will be approximately equal number of extents.

Suppose that several disks are connected to a specific controller — and thus represent SPF — single point of failure (if the controller fails, we lose the entire disk array). ASM has a technology for defining Failure Groups within the Disk Group. With this mechanism, mirroring will scatter copies of extents across disks in different failure groups to avoid SPF ( Single Point of Failure ), for example, when a SAN or a RAID controller dies.

Thus, the cluster can now store and read data from a shared file storage.

It's time to level up.

Clusterware. CRS.

At this level, it is necessary to ensure coordination and joint work of cluster nodes, i.e. clusterware layer: somewhere between the database instance itself and disk storage:

CRS ( Cluster-Ready Services ) is a set of services that ensures the joint operation of nodes, fault tolerance, high system availability, and system recovery after a failure. CRS looks like a “mini-instance” of the database (software) installed on each node of the cluster. Installing CRS is mandatory for building Oracle RAC. In addition, CRS can be integrated with third-party clusterware solutions such as HP or Sun.

Again a bit of "terminology" ...

CRS consists of 3 main components:

CSSD - Cluster Synchronization Service Daemon
CRSD - Cluster Ready Services Daemon
EVMD - Event Monitor Daemon

x	Assignment (in brief)	What rights does it work with?	When the process dies, it reboots:
CSSD	A synchronization mechanism for communicating nodes in a clustered environment.	user	process
CRSD	The main "engine" to support the availability of resources	root	host
EVMD	The process of notification of events occurring in the cluster	user	process

Cluster settings are stored in OCR ( Oracle Cluster Registry ). OCR is a special file of database node profiles that stores their current configuration: node availability, service distribution (several databases can be supported by different groups of nodes in a cluster), network settings, etc. Physically, OCR is stored in general datastorage. When a cluster is running, each node stores OCR in memory, and only one node (master) directly updates the OCR on the disk.

As it became clear from the tablet, the most important process, the “most powerful demon”, is CRSD ( Cluster Ready Services Daemon ). His responsibilities include: starting, stopping the node, generating failure logs, reconfiguring the cluster in case of a node falling, he is also responsible for recovering from failures and maintaining the OCR profile file. If the daemon crashes, the node reboots completely. CRS manages OCR resources: Global Service Daemon (GSD), ONS Daemon, Virtual Internet Protocol (VIP), listeners, databases, instances, and services.

The responsibilities of the CSSD ( Cluster Synchronization Service Daemon ) service include coordinating the interaction of cluster nodes, synchronizing nodes and resources between them, determining their availability through the following functions:

Node Membership (NM). Every second checks the heartbeat between the nodes. NM also shows the rest of the sites that it has access to the so-called voting disk (if there are several, then at least to the majority), making regular entries there. If the node does not respond to the heartbeat or does not leave a recording on the voting disk within a few seconds (10 for Linux, 12 for Solaris), then the master node excludes it from the cluster.
Group Membership (GM). The function is responsible for the timely notification when adding / removing / dropping a node from the cluster, for subsequent reconfiguration of the cluster.

CSSD provides dynamic information about the nodes and instances that are currently part of it, and is responsible for locking resources in a cluster.

EVMD ( Event Manager Daemon ) acts as an informant in the cluster, which notifies nodes about events: that the node has been started, has lost communication, is restored. It acts as a link between CRSD and CSSD. Alerts are also sent to ONS (Oracle Notification Services), Oracle's universal gateway through which alerts can be sent, for example, as an SMS or e-mail.

The cluster starts approximately as follows: CSSD reads from the shared storage OCR, from where it reads the cluster configuration, to identify where the voting disk is located, reads the voting disk, to find out how many nodes (ascended) in the cluster and their names, establishes connections with neighboring nodes IPC protocol. By exchanging heartbeat, it checks whether all neighboring nodes have risen, and finds out who in the current configuration has been defined as master. The master node becomes the first starting node . After the start, all running nodes are registered with the master, and will subsequently provide it with information about their resources.

Levels above CRS on the nodes are database instances.
With each other nodes communicate over a private network - Cluster Interconnect , using IPC ( Interprocess Communication ) protocol. To her requirements: high bandwidth and low latency. It can be built on the basis of high-speed versions of Ethernet, third-party solutions (HP, Veritas, Sun), or InfiniBand , which is gaining popularity. The latter, in addition to high bandwidth, writes and reads directly from the application buffer, without the need for making kernel-level calls. Over IP, Oracle recommends using UDP for Linux, and TCP for Windows. Also, when transferring packets via interconnect, Oracle recommends a set of 6-15 ms for delays.

Article continuation

Source: https://habr.com/ru/post/72122/

All Articles