Continuation of article about
Real Application Cluster (
RAC ). Ending.
Beginning of the article')
We believe that the cluster has risen and everything is spinning.
The interaction of nodes. Cache-fusion.
Many instances of the database, many disks. Custom requests rushed ... here they are, the customers we were waiting for. =)
The bottleneck of any database is disk I / O. Therefore, all databases try to access disks as rarely as possible using deferred recording. In RAC, everything is the same as for a single-instance database: each node in RAM has an
SGA (
System Global Area ) area, inside it there is a buffer cache (
database buffer cache ). All blocks that were once read from the disk fall into this buffer and are stored there as long as possible. But the cache is not infinite, therefore, to assess the importance of the stored block,
TCA (Touch Count Algorithm) is used, which counts the number of calls to the blocks. When you first hit the cache, the block is placed in its cold-end. The more often the block is addressed, the closer it is to the hot-end. If the block is "stale", it gradually loses its position in the cache and risks being replaced by another record. Overwriting blocks starts with the least used. The node cache is extremely important for the performance of the nodes, so in order to maintain high performance in a cluster, the cache needs to be shared (as you-know-who testified). The blocks stored in the cluster node cache can have the role of
local ones , i.e. for his own use, but some will already have a
global mark, which he, having gritted his
teeth with disks, will be shared with other nodes of the cluster.
The shared cache technology in a cluster is called
Cache-fusion (cache synthesis). CRS at each node spawns LMSn synchronous processes, their common name as a service is
GCS (
Global Cache Service ). These processes copy the blocks (global) read on this instance from the buffer cache to the instance that accessed them over the network, and are also responsible for rolling back unconfirmed transactions. One copy of them can be up to 36 pieces (GCS_SERVER_PROCESSES). It is usually recommended to have one LMSn for two cores, otherwise they spend too much resources. Their coordination is the responsibility of the
GES (
Global Enqueue Service ) service, which is represented on each node by the LMON and LMD processes. LMON monitors the global resources of the entire cluster, calls for blocks to neighboring nodes, manages the recovery of GCS. When a node is added or leaves the cluster, it initiates a reconfiguration of locks and resources. LMD manages host resources, controls access to shared blocks and queues, is responsible for blocking requests to GCS, and manages the maintenance of the LMSn request queue. LMD is also responsible for eliminating global deadlocks within multiple cluster nodes.
But a special role in cluster resource coordination and cache unification is assigned to the
GRD (
Global Resource Directory ) table, where it constantly records on which instance, which block (or its copy) is available, the mode in which the instance holds the block, the role of the block, SCN. In the single-instance variant, SCN is just an incremental counter of changes made to the database. In the same SCN cluster, you need to synchronize between nodes. There are two SCN synchronization methods, depending on the value of the MAX_COMMIT_PROPOGATION_DELAY parameter, specified in milliseconds:
- If> 100, SCN are generated (incremented) on instances in parallel. These SCNs are embedded in each message sent over interconnect, or at certain intervals Oracle polls instances of the current SCN values ​​for mutual synchronization when idle. Typically, the interval is up to 700 milliseconds. It can be used for long-term transactions with "late" commit.
- If = 0 (default), then as soon as there is a COMMIT on any instance, it immediately sends the rest an alert about what number it has now. The exchange takes place very often (up to fractions of seconds) and therefore a lot of traffic is generated via interconnect.
As a result, SCNs in RAC are regularly synchronized to the largest SCN known in the cluster. SCN is required in order to record the changes in the blocks lined up in chronological order, which is necessary when restoring transactions (roll-forward).
The GRD table is distributed between the cluster nodes. Each node participates in the allocation of cluster resources, updating its part of the GRD. Part of the GRD table relates to resources - objects: tables, indexes, etc. It is constantly synchronized (updated) between nodes.
When a node reads a block of data from disk, it becomes the master of this resource and makes the corresponding mark in its part of the GRD table. The block is marked as local, because the node is using it alone. If this block was required by another node, then the GCS process will mark this block in the table as global (“published” for the cluster) and transfer it to the requested node.
DBA | location | mode | role | SCN | PI / XI |
500 | node number 3 | shared | local | 9996 | 0 |
Ok, let's bring 'em all together! In GRD at the master node, to coordinate the distribution of blocks between cluster instances, additional information is recorded with each block:
- Data Block Address (DBA): physical block address
- Location: node on which this block is available.
- Resource mode: determined by who currently owns the block and what operation will be applied to it
- null: node does not claim to change this block (select only)
- shared: block protected multiple read-only access on multiple nodes.
- exclusive: the node is about to change (or has already changed) this block. Although at the same time the former (agreed) versions of the same block may be contained in the cluster, they cannot be changed.
- Resource role: in which mode the block is stored in the cache of the node
- local: when the node just read the block from the disk and has not shared it with anyone.
- global: when the node was initially read by this block, but after it was transferred to the node that requested it in a certain mode (mode). Now the same block may be present on other nodes.
- System Change Number (SCN): SCN in RAC when the block was changed
- Image: characterizes a copy of a block when several of them are simultaneously stored in the global cache of the entire cluster.
- Past Image (PI): global dirty block (old version, after change), stored in the node's cache after the node has transferred it over the network to another. The block is kept in memory until it or a later version is written to disk, which GCS will notify when the block is no longer needed.
- Current Image (XI): the current last copy of the block contained in the last node of the cluster in the query chain of this block.
Some of the most important principles (single-instance) of the database remain true for RAC:
- access the disk as rarely as possible, due to active work with the cache
- ensure consistency read ( CR ), read consistency , i.e. data of an unconfirmed transaction no one will ever see in any (parallel) session
The difference of the RAC environment from the usual single-instance database is that, despite all the desire, locks are implemented
not at the row level, but at the block level . Those. an instance can block the whole block (contents and other necessary data for someone).
We describe several situations typical in the life of a cluster:
- Read / read behavior (no transfer) .
Let the data of table A be the first to consider the node number 4. He is the master of this table and is responsible for the corresponding part in GRD.- Node 3 received a read request from table A. Node 3 in the cache does not have the necessary block. From GRD, he learns that the master of table A is node number 4, and refers to it.
- Node 4 looks at GRD for the presence of the requested block. If he was in his cache, he would just pass it. But suppose that the necessary block was not there. Node number 4 will send nodes number 3 independently read this block from the disk.
- Node 3 reads it from the disk itself, so far only for itself and with no one the block is divided (local), but later it can provide access to other nodes through an intermediary - the master of this table (shared).
- Node â„–3 reports to the master of table A with node â„–4, and he makes the corresponding entry in GRD (on node â„–4):
DBA | location | mode | role | SCN | PI / XI |
500 | node number 3 | shared | local | 9996 | 0 (xi) |
- Read / read behavior (transfer) .
- Suppose now that the request to read the same block fell on the node number 2. Node â„–2 knows from its local copy of GRD that master node â„–4 is responsible for this resource (table A) and refers to it.
- Node 4 on its table GRD finds out that the block is now located on node 3 and sends him instructions to share the block with node 2 in read mode.
- The node number 3, having received such a command, sends a copy of the block to the node number 2. In the message header, he also indicates that he is sharing a copy, and he also leaves a copy of this block.
- Node 2 receives a block and, through GCS, notifies master node 4 that it has received the necessary block. Master updates GRD (node ​​number 4):
DBA | location | mode | role | SCN | PI / XI |
500 | node number 2 | shared | local | 9996 | 0 (xi) |
500 | node number 3 | shared | local | 9996 | 0 (xi) |
- Read / Write behavior .
Finally, the user decided to make changes to table A regarding the block. Let the user is connected to node # 1.- Node 1 sends a request for exclusive mode for master block to node 4.
- Node 4 sends a message to all nodes that hold the block, except for some one (say, node 3), so that they switch the block to a completely local mode (null mode, local role). It ceases to be in the global cache, locks are removed from it (it can now be overwritten), and it is still stored in the cache only for consistent read requests.
- master requests one of the nodes (node ​​number 3), transfer this block in exclusive mode to node number 1.
- Node 3 sends the block to node 1, indicating that the block is transmitted in exclusive mode, and therefore its own will be removed from the global cache. Node 3 removes locks from the block (it can now be overwritten), leaving it in the cache only for consistent read requests.
- Requesting node number 1 finally gets its block, makes the necessary changes, assigns a new SCN. Notifies the master node number 4 that the receiving unit and changed, additionally the message includes information that the node number 3 has closed this unit.
- Site number 4 confirms the deletion of the block for site number 3. And now GRD on node # 4 contains:
DBA | location | mode | role | SCN | PI / XI |
500 | node number 1 | shared | local | 10010 | 0 (xi) |
- Write / Write behavior .
Suppose now that at the same time you need to make updates in a block in table A. A single node is always the master of a resource, and the GCS (Global Cache Service) acts on it consistently using the mechanisms of resource locks:- A request came for update to node # 2, which has already removed this block from the global cache. The node again refers to the master node with a request for exclusive mode on the block.
- Node 4 requests the current block holder to issue an image (block image) to node 2.
- Node â„–1, after making sure that redo logs with applied changes to the block were flushed to disk (important: the transaction on it can still continue to be executed!), Transfers the image of its copy into the past image (after all, node â„–2 will insert into it changes), which he now can not modify. And it sends the block in exclusive mode to the requesting node # 2, additionally including in the message that it now holds the PI block.
- Requesting node number 2 receives the block, makes the necessary changes, assigns a new SCN. Notifies master node # 4 that the receiving unit is modified and, additionally, the message includes information that node 2 holds the block in exclusive mode, and block 1 contains its previous version (SCN = 10010).
- Node # 4 amends GRD:
DBA | location | mode | role | SCN | PI / XI |
500 | node number 1 | shared | global | 10010 | 1 (PI) |
500 | node number 2 | exclusive | local | 10016 | 0 (xi) |
- Write / Read behavior .
The situation is different: the block was changed in one of the nodes and now the latest version is quite different from the one stored on disk.- A select request came from table A to node 3, it accesses the master resource.
- The master node from GRD finds out that the latest version of the block is contained at node 2 and sends a request to transfer the block to node 3.
- At node number 2, the block was received and stored in exclusive mode. As soon as work with the block is terminated, node 2 transfers it to shared mode, marks its copy as past image (SCN = 10016) and transfers it to node 3.
- Node # 4 amends GRD:
DBA | location | mode | role | SCN | PI / XI |
500 | node number 1 | shared | local | 10010 | 1 (PI) |
500 | node number 2 | shared | global | 10016 | 1 (PI) |
500 | node number 3 | shared | global | 10016 | 0 (xi) |
- Write-to-disk behavior .
Suppose it is time to flush the data to disk. It always comes, as it does not otgagyvay:- the amount of dirty data in the buffer will exceed a certain value (threshold). You need to write these modified data in order to free the buffer, and you can add something new to it.
- when the buffer ends up in free space
- DBWR regularly flushes changed data from the buffer to disk (LOG_CHECKPOINT_TIMEOUT).
- By the way, this can be controlled through another parameter: fast_start_mttr_target = 3 sec (default), which determines the frequency of passing the checkpoint, i.e. how often will the records accumulated in the redo log buffer and buffer cache be flushed? This is to ensure that the recovery of a node's transactions (roll-forward), in the event of a failure, takes approximately 3 seconds.
- Suppose node number 1 with an obsolete block is about to be written to disk, it refers to master resource node number four, providing the SCN of its block.
- Node 4 learns from GRD that the latest version is held by node 2. He refers to it to the node number 2 to write to the disk.
- As long as node two will write to disk, no one can change the block. They will have to stand in line at the GRD. This happens with shared shared blocks (outdated and local blocks can be thrown out and not taken into account).
- Node number 2 writes data to disk.
- Node 2 notifies the master that the data is written to disk. His block goes to the local role (only he, this block will remain).
- Having received such a message, GCS sends a signal to all nodes to clear their PI, and if someone kept the block with exclusive mode switches to the local role.
- Now this block can be read from the disk, or it can access the node containing it in memory.
Without the need for no recordings on the disk does not occur. Always a copy of the block is stored on the site where it is most often used. If a certain block is not yet in the global cache, then when requesting the master, it will ask the corresponding node to read the block from the disk and share it with the other nodes (as needed).
Based on the above, it becomes clear that cache-fusion involves 2 scenarios:
- Two nodes are involved: when the target node needed a block that was stored in the master cache.
- 3 nodes participate: when master sends a request to an intermediate node, and that one sends a block to the node requested by it.
No matter how many nodes there are in a cluster, the number of hops (nodes participating in block sending)
will never exceed 3 . This fact explains the ability of the RAC cluster to scale to a large number of nodes without pain.
Taking fire, need assistance! Workload distribution.
The described Cache-fusion device provides the cluster with the ability to (automatically) respond to node loading. Here’s how
workload distribution or
resource remastering occurs (redistribution of computing resources):
If, say, through node No. 1, 1500 users access resource A, and approximately at the same time 100 users access the same resource A through node No. 2, then it is obvious that the first node has a greater number of requests, and will more often read with disk. Thus, node 1 will be defined as master for requests to resource A, and GRD will be created and coordinated starting from node 1. If node 2 needs the same resources, then in order to gain access to them it will have to coordinate its actions with GCS and GRD node No. 1, in order to obtain resources through interconnect.
If the allocation of resources changes in favor of node 2, then processes No. 2 and No. 1 coordinate their actions through interconnect, and master A of resource A will become node No. 2, since Now he will more often refer to the disk.
This is called resource
affinity , i.e. resources will be allocated to the node on which there is more action to receive and block them. Resource Affinity Policy coordinates the activities of the nodes so that resources are more accessible where it is more needed. Here, briefly, and the entire
workload distribution .
Redistribution (remastering) also occurs when a node is added or leaves the cluster. Oracle redistributes resources according to an algorithm called "lazy remastering" (lazy remastering), because Oracle almost does not take active action on the redistribution of resources. If a node falls, all that Oracle takes is to transfer the resources that belonged to the collapsed node to one of the remaining ones (less loaded). After stabilization of the load, GCS and GES will re-automatically (automatically) redistribute resources (workload distribution) to those positions where they are more in demand. A similar action occurs when a node is added: an approximately equal amount of resources is separated from the active nodes and assigned to the newcomer. Then workload distribution will happen again.
As a rule, to initiate a dynamic redistribution, the workload on a particular node should exceed the workload of the rest for more than 10 minutes.
Here the bullet flew, and ... huh? Recovery.
Suddenly, a node did not respond to heartbeat, the CSSD process on the node that first discovered it “sounds the alarm” and reports to the master node (if it is still connected, otherwise you will have to assume the responsibility of the master). The master initiates the “voting” procedure on all nodes, the surviving cluster nodes begin to be marked on the voting disk. If the missing node leaves no trace here, then the master begins the process of excluding the missing from the cluster. The Redo log file will be read twice: once by redo records, and a second time (again) by undo records, to make the database available for queries as early as possible.
- Part of the GRD table with the resources of the fallen node is “frozen”.
- A node that has not contacted it is marked as “missing” so that the remaining nodes do not contact it in vain via interconnect.
- The node that first detected the loss starts restoring the information that was processed on the disappeared node:
- Lowers the rate of servicing your own transactions, throwing computing resources to recovery
- It accesses the shared file storage (datastorage), and on itself begins to apply online redo logs belonging to the missing node. Taking into account the sequence number of SCN blocks, merge them with what is stored in the buffer, and rolls (roll-forward) in its cache. In this case, the node skips those outdated block entries (PI), the later versions of which have already been flushed to disk. If the read blocks in the cluster have the master of the corresponding resource, then the node reports the list of read blocks, and the master on these resources sets a lock so that the nodes do not access them (until they are restored).
- After that, by the second reading on the redo log, taking into account the undo of the record, it rolls back (roll-back) uncommitted transactions. This happens using fast-recovery technology, i.e. transaction rollback will be performed by a separate background process. Oracle will return blocks blocked by incomplete transactions (uncommitted) in a consistent state (consistent), to the same values ​​as soon as a request for these blocks arrives. Or they will be restored by this parallel background process by that time. Thus, locks are already released in the cluster and new user requests can be executed.
- The part of the GRD table belonging to the fallen node is thawed on the restoring node (now it is the master of the resource). Thus, in a cluster, the state of processed transactions on the missing node is restored at the time of the crash.
But while all these processes are happening, an impatient client has something to offer.
As long as the knots save each other ... Failover.
Failover is the handling of a node crash situation in a cluster.
It's time to mention another layer in a clustered environment - a public (public) network through which clients access the database. On physical servers, preferably at least 2 network cards:
- The first network card is assigned a static IP through which the host will exchange messages with its neighbors in the cluster (interconnect).
- The second network interface card is assigned a logical Virtual IP (virtual IP) through which clients will send requests to the cluster node.
Virtual IP (
VIP ) is a logical network address assigned to a host on the external network interface. It provides the ability for CRS to quietly start, stop and transfer work with this VIP to another node. A listener (process accepting connections) on each node will listen on its VIP. As soon as a node becomes unavailable, its VIP picks up another node in the cluster, thus temporarily servicing its own and the requests of the fallen node.
This is done in order to reduce customer downtime if the node on which his transaction was performed fell off. After all, the client can wait for TCP timeout in a few minutes. In this case, the VIP will immediately be picked up by another node, and further events may develop according to two scenarios, using
TAF (
Transparent Application Failover ) technology:
- Database VIPs : The client will connect via VIP, but will already connect to another node. The temporarily replacing node will answer “logon failed”, despite the fact that VIP will be active, the necessary database instance will be absent. And the client will immediately try again, but to another cluster instance / node from its list in the configuration.
- Application VIP : the same as before. But only now on this VIP it will be possible to turn to the application, on whatever node it spins.
Database VIPs can only provide applications to their site, and if the application has migrated, they refuse. Application VIP even after migration performs the functionality provided by the node (positive).
If the node is restored and goes online, CRS will recognize this and will ask you to reset it offline in the replacement node and return the VIP address back to the owner. VIP belongs to the CRS, and may not be transferred if it is a copy of the database that fails.
It is important to note that with failover, only select queries are transferred, together with open cursors (returning results). Transactions are not transferred (PL / SQL, temp tables, insert, update, delete), they will always need to be restarted.
There are two ways to configure a TAF:
- Connect-time failover and client load-balancing
In this case, the client always randomly selects which cluster node to connect to from its network connection configuration list. If the node performing the request fails, the client selects another cluster node by TAF and reconnects.
- Preconnect
In this case, the client always connects to all nodes when a connection to the cluster is established, although the request will run only on one instance. If the node fails, it simply transfers the request to another node. Failover is faster, but consumes connection resources on all nodes in the cluster.
The client connection has retries and delay parameters. They can be configured, how many times the client (silently) will try to reconnect if the node falls, and what delay to expose. Perhaps the most interesting thing is that if a node falls, Oracle can notify the client about it through FAN (Fast Application Notification), which is part of ONS (Oracle Notification Services). If the client uses a thick connection driver to Oracle, before calling the database, you can register the callback function (callback) to which the event will come, in the case of TAF (failover). This can either be displayed as a “slight hitch” on the user’s screen and can control the process of manually restarting the request.
Don't go there, go here ... Load-balancing.
When performing any operations, information related to query performance (like “debugging”), Oracle collects in
AWR (
Automatic Workload Repository ). It is stored in the SYSAUX tablespace. Statistics collection starts every 60 minutes (default): I / O waits, wait events, CPU used per session, I / O rates on datafiles (which file is most often accessed).
The need for load
balancing (load distribution) across nodes in a cluster is determined by a set of criteria: by the number of physical connections to a node, by the processor load (CPU), by traffic. It is a pity that you can’t load-balance by the average request execution time on nodes, but, as a rule, this is in some way related to the resources involved on the nodes, and therefore the remaining free resources.
About
client load-balancing was a little said above. It simply allows the client to connect to a randomly selected cluster node from the list in the configuration. To implement
Server-side load-balancing, a separate process PMON (process monitor) collects information about the load of cluster nodes. The frequency of updating this information depends on the workload of the cluster and can vary from about 1 minute to 10 minutes. Based on this information, the Listener on the node to which the client connects will redirect it to the least loaded node.
Oracle provides DBA to select the most relevant criteria for load balancing:
- Based on elapsed-time (CLB_GOAL_SHORT): by the average query execution time on a node
- Based on number of sessions (CLB_GOAL_LONG): by the number of connections to a node
If the application has a connection pool, Oracle provides the
Runtime Connection Load Balancing (
RCLB )
option . Instead of the usual option, when we try to predict which of the nodes will be less loaded, and send a request there, we will use the notification mechanism (events) of the application about loading on the nodes. And now the application itself will determine where to send the request, based on this data. Notification occurs through ONS (Oracle Notification Service). RCLB regularly receives data (feedback) from cluster nodes, and the connection pool will distribute connections to clients, relying on some relative number that reflects what percentage of connections each instance can make. These metrics (average node load), which the RAC sends, each node builds itself in AWR. Based on them, the required load advisory is formed and placed in the AQ (advanced querying) queue, from where data is sent via the ONS to the client.
Notifications will be based on one of the mechanisms:
- Focusing on time delays in query execution (GOAL_SERVICE_TIME)
Time delays in processing previous requests are taken into account. - Focusing on server load (GOAL_THROUGHPUT)
CPU bandwidth is taken into account.
Beginning of the article