Oracle Real Application Clusters (RAC) databases form an increasing proportion of Oracle database systems. RAC was introduced in 2001 as part of Oracle 9i and was based on Oracle’s earlier Oracle Parallel Server architecture. RAC is almost unique as a mature implementation of a shared disk database clustering system: it is the basis for Oracle’s Exadata database machine, and allows for economical scalability and high availability.
All the usual principles of database performance tuning apply to RAC. However, the RAC architecture provides some additional challenges and opportunities.
The key principles in RAC performance management are:
- Maintaining high speed communications between the instances in the cluster.
- Eliminating unnecessary inter-instance traffic.
- Balancing workload between the instances in the cluster.
RAC architecture
RAC is a shared disk clustered databases: every instance in the cluster has equal access to the database’s data on disk. This is in contrast to the shared nothing architecture employed by other RDBMS clusters. In a shared nothing architecture, each instance is responsible for a certain subset of data. Whenever a session needs that data, then the appropriate instance must be involved in serving up the data.
Shared nothing database clusters have their advantages, but transparent and rapid scalability is not one of them. The shared nothing architecture requires that data be distributed across the instances in the cluster. When a new instance is added to the cluster, data needs to be redistributed across the cluster in order to maintain a reasonable balance of load. In contrast, a new instance can be added to a RAC cluster without any data rebalancing being required: the new instance has immediate and independent access to all the data in the database.
It is this ability to rapidly add or remove instances from RAC clusters that justifies the “G for Grid” in the Oracle 10g and 11g naming convention: RAC has the ability to rapidly scale database clusters by adding or removing instances: that capability is required (but perhaps not sufficient) to create a truly grid-enabled database.
The main challenge in the shared disk architecture is to establish a global memory cache across all the instances in the cluster: otherwise the clustered database becomes IO bound. Oracle establishes this shared cache via a high-speed private network referred to as the cluster interconnect.
All the instances in a RAC cluster share access to datafiles on shared disk, though each have private redo logs and undo segments. Each instance has its own SGA and background processes and each session that connects to the cluster database connects to a specific instance in the cluster.
Global Cache Requests
Understanding how instances in the cluster communicate is critical to understanding RAC performance. As with single-instance Oracle, we want to avoid disk IO whenever possible – primarily by keeping frequently-accessed data in memory. In a RAC configuration, the data we want might be in the memory of one of the other instances. Therefore, RAC will use the interconnect to request the required data from another instance that has it in memory, rather than by reading it from disk. Each request across the interconnect is referred to as a Global Cache (GC) request.
To co-ordinate these inter-instance block transfers, Oracle assigns each block to a “master” instance. This instance is essentially responsible for keeping track of which instance has last accessed a particular block of data.
Whenever an Oracle instance wants a block of data that is not in its buffer cache, it will ask the master instance for the block. If the master instance has the data concerned, it will send it back across the interconnect: this is recorded as a “2-way” wait and is referred to as a 2-way get.
If the master instance does not have the block in memory, but has a record of another instance accessing the block, then it will forward the block request to this third instance. The third instance will then return the block to the requesting instance: this is recorded as a “3-way” wait and is referred to as a 3-way get.
If no instance has the block in memory, then the master will advise the requesting instance to retrieve the block from disk: this is recorded as a “grant”.
Regardless of which instance wants the block, which instance has the block, and which instance is mastering the block, the number of instances involved in the transfer will never be more than three. This means that the performance penalty as additional instances are added is minimized. However, as we increase the number of instances the ratio of 3-way waits to 2-way waits will increase and some reduction in Global Cache performance should be expected.
Figure 2 illustrates the sequence of events in 2-way gets, 3-way gets and grants.
Figure 2 two-way gets, three-way gets and grants
The key background process in these scenarios is the LMS (Global Cache Service - previously known as the Lock Management Service; hence the abbreviation) process. One or more of these are initiated at startup, depending on the parameter gcs_server_processes.
Block requests can be either be made for the “current” copy of the block or for a “consistent read” copy. Consistent read blocks are required when performing query processing so that all blocks are consistent as at the start of the query or a read only transaction. Most query blocks will be consistent read blocks. Current blocks are more often associated with DML operations.
RAC tuning principles
The RAC architecture outlined in the previous section leads directly to the general principles of RAC performance. RAC will perform well, and scale well, if the following are true:
- The time taken to request a block across the interconnect (Global Cache requests) is much lower – say ten times less – than the time to retrieve a block from the disk. Global Cache requests are intended to avoid the necessity of a disk read, and sometimes the disk read must occur even after the Global Cache request. If the Global Cache request time is anywhere near the time it takes to read from disk, then the approach back-fires. Luckily, properly optimized Global Cache requests are quick – typically 10 times less than disk read time.
- The cluster is well balanced, or at least there are no overloaded instances in the cluster. Since so many RAC operations involve two or three instances, an overloaded instance might cause problems for its neighbors as well as itself. Indeed, an overloaded CPU on a remote instance is one of the most common causes for long global cache wait times on an otherwise idle local instance.
- The overhead incurred through cluster activities is a small proportion of the total database time. We want our RAC database to be a database first, and a cluster second. If the proportion of time spent performing Global Cache activities is high in proportion to other activities, then we may need to look at ways of reducing the Global Cache traffic.
Three key principles of RAC performance are:
- Global Cache lookups should be much quicker than disk reads
- Instances in the cluster should have a well balanced workload
- Time spent performing cluster related activities should not dominate active database time.
Single instance contention and RAC
Oracle makes a point of emphasizing that applications that don’t scale well on single instance Oracle will probably not scale well on RAC. While this is generally true, the switch from single instance to RAC tends to magnify some issues while alleviating others.
Performance issues that relate to contention for specific “hot” data blocks tend to be magnified in RAC, since these contentions now also take on a cross-instance overhead. For instance, buffer busy waits can occur for a buffer which is on another instance; because of the interconnect overhead, the average time spent waiting for a buffer busy wait to complete may increase. Specific types of contention thatincrease in a RAC database are:
- Buffer busy waits. The sort of operations that cause buffer busy in single-instance will also cause buffer busy in RAC, but the wait will be amplified by the need to transfer the block across the cluster once it is free.
- Cache buffer chains latch contention in a single instance database will probably reduce in RAC, since latches are not acquired across instances. However, for every relatively short latch free wait, you will probably experience a much longer global cache wait when transferring the block concerned across instances.
- Sequence number generation. If there are SQ enqueue waits then these will likely spread and magnify across instances in a RAC environment.
However, it’s also true that some single-instance contention issues can be reduced under RAC. RAC divides up the SGA and sessions across each instance in the cluster, effectively allowing some operations to experience an increase in concurrency. In particular, shared pool related latches and mutexes waits might reduce, since the activity will be distributed across the multiple shared pools in the cluster.
Contention for data and index blocks in a single instance database will probably magnify in a RAC environment. However, other contention points - library cache mutexes for instance - might be reduced.
Cluster overhead
Asides from anything else, we want to make sure that a RAC cluster is able to perform database activities without being impeded by cluster related overheads. In a healthy cluster, the time spent in cluster related activities is mainly determined by the average time to make a Global Cache request (Global Cache latency) multiplied by the number of Global Cache requests which must be made.
ClusterTime = AvgGCLatency x GCRequests
It therefore follows that reducing cluster overhead is mainly a process of minimizing the Global Cache latency and eliminating any unnecessary Global Cache requests. The importance of those optimizations will depend upon the relative time spent in cluster related activities.
We can see the overall contribution of cluster related waits in comparison to other high level time categories in the following query:
SQL> SELECT wait_class time_cat ,ROUND ( (time_secs), 2) time_secs,
2 ROUND ( (time_secs) * 100 / SUM (time_secs) OVER (), 2) pct
3 FROM (SELECT wait_class wait_class,
4 sum(time_waited_micro) / 1000000 time_secs
5 FROM gv$system_event
6 WHERE wait_class <> 'Idle'
7 AND time_waited > 0
8 GROUP BY wait_class
9 UNION
10 SELECT 'CPU',
11 ROUND ((SUM(VALUE) / 1000000), 2) time_secs
12 FROM gv$sys_time_model
13 WHERE stat_name IN ('background cpu time', 'DB CPU'))
14 ORDER BY time_secs DESC;
Time category TIME_SECS PCT
-------------------- ---------- ------
CPU 21554.33 43.45
Cluster 7838.82 15.80
Other 6322.23 12.75
Application 5077.09 10.24
System I/O 3387.06 6.83
User I/O 3302.49 6.66
Commit 1557 3.14
Concurrency 371.5 .75
Network 142.06 .29
Configuration 49.59 .10
As a rule of thumb, we might expect that cluster-related waits comprise less than 10% of total database time. Waits above 20% certainly warrant investigation.
Cluster waits times greater than 10-20% of total database time probably warrant investigation.
Although cluster waits will usually be comprised mainly of straightforward Global Cache request waits, it’s not uncommon for more “sinister” Global Cache waits to emerge: lost blocks, congestion, Global Cache buffer busy waits. Drilling down into the low level
wait events will often reveal these conditions. The following query breaks out the cluster wait times:
SQL> WITH system_event AS
2 (SELECT CASE
3 WHEN wait_class = 'Cluster' THEN event
4 ELSE wait_class
5 END wait_type, e.*
6 FROM gv$system_event e)
7 SELECT wait_type, ROUND(total_waits/1000,2) waits_1000 ,
8 ROUND(time_waited_micro/1000000/3600,2) time_waited_hours,
9 ROUND(time_waited_micro/1000/total_waits,2) avg_wait_ms ,
10 ROUND(time_waited_micro*100
11 /SUM(time_waited_micro) OVER(),2) pct_time
12 FROM (SELECT wait_type, SUM(total_waits) total_waits,
13 SUM(time_waited_micro) time_waited_micro
14 FROM system_event e
15 GROUP BY wait_type
16 UNION
17 SELECT 'CPU', NULL, SUM(VALUE)
18 FROM gv$sys_time_model
19 WHERE stat_name IN ('background cpu time', 'DB CPU'))
20 WHERE wait_type <> 'Idle'
21 ORDER BY time_waited_micro DESC;
Waits Time Avg Wait Pct of
Wait Type 1000 Hours Ms Time
----------------------------------- ----------- ---------- --------- ------
CPU 6.15 43.62
Other 38,291 1.76 .17 12.50
Application 32 1.41 157.35 10.00
User I/O 822 .97 4.25 6.88
System I/O 995 .96 3.46 6.78
gc current multi block request 9,709 .87 .32 6.15
gc cr multi block request 16,210 .48 .11 3.37
Commit 300 .44 5.31 3.13
gc current block 2-way 5,046 .37 .26 2.59
gc current block 3-way 2,294 .28 .43 1.97
gc cr block busy 984 .16 .58 1.11
Cluster waits summary
Here are descriptions for some of the more important Global Cache wait events:
gc cr/current block 2-way: These are waits for Global Cache block requests involving only 2 instances. As outlined at the beginning of this article, these occur when the block master instance is able to forward a block directly to the requesting instance
gc cr/current block 3-way: These waits occur when the block master does not have the block concerned, and forwards the request to a third instance.
gc cr/current multi block request: A wait that occurs when requesting multiple blocks in a single request. This is typically associated with full table or index scans.
gc cr/current grant 2-way: The block master informs the requesting instance that the requested block is not available from another instance. The requesting instance will then perform a disk IO to retrieve the block.
gc cr/current block busy: The requesting instance must wait for the instance that holds the block to complete some other operation before the block can be forwarded. This can happen in the same circumstances as for single instance buffer busy or because the requesting instance must flush redo records to the redo log before shipping a consistent copy.
gc cr/current block congested: This wait can be reported when CPU or memory pressure prevents the LMS process from keeping up with requests. Prior to Oracle 10.2, you could manually set LMS to run at a higher than default OS priority to alleviate this situation. From 10.2, LMS runs at a higher priority by default. Changing the number of LMS processes might mask the symptom; however preventing instances from overloading is a more effective overall solution.
gc cr/current block lost: Lost block waits occur when a block that has been transmitted is not received. If using of UDP – which is an “unreliable” protocol in the sense that a network operation does not require an acknowledgement – then some small number of lost blocks are to be expected. Moderate rates might suggest that the interconnect is overloaded. High rates probably indicate network hardware issues. We’ll look closer at lost blocks later in this article.
Reducing Global Cache latency
The RAC architecture requires and expects that instances will fetch data blocks across the interconnect as an alternative to reading those blocks from disk. The performance of RAC is therefore going to be very sensitive to the time it takes to retrieve a block from the Global Cache; which we will call Global Cache latency.
Some documents or presentations suggest that Global Cache latency is primarily or exclusively Interconnect latency: the time it takes to send the block across the interconnect network. Interconnect latency is certainly an important part of overall Global Cache latency: but it’s not the only part. Oracle processes such as the Global Cache Service (LMS) have to perform a significant amount of CPU intensive processing each time a block is transferred, and this CPU time is usually as least as significant as any other factor in overall Global Cache latency. In certain circumstances non-CPU operations – such flushing redo entries to disk – will also contribute to Global Cache latency.
Interconnect latency is an important factor in Global Cache latency – however Oracle CPU and IO are also important contributors.
To measure Global Cache latency, we use the wait interface as exposed by GV$SYSTEM_EVENT (The “V$” views report data for the current instance: “GV$” views report across the entire cluster.).
The following query reports on average times for each of the Global Cache request types as well as single-block read times (for comparison):
SQL> SELECT event, SUM(total_waits) total_waits,
2 ROUND(SUM(time_waited_micro) / 1000000, 2)
3 time_waited_secs,
4 ROUND(SUM(time_waited_micro)/1000 /
5 SUM(total_waits), 2) avg_ms
6 FROM gv$system_event
7 WHERE wait_class <> 'Idle'
8 AND( event LIKE 'gc%block%way'
9 OR event LIKE 'gc%multi%'
10 or event like 'gc%grant%'
11 OR event = 'db file sequential read')
12 GROUP BY event
13 HAVING SUM(total_waits) > 0
14 ORDER BY event;
Total Time Avg Wait
Wait event Waits (secs) (ms)
------------------------------ ------------ ------------ ---------
db file sequential read 283,192 1,978 6.99
gc cr block 2-way 356,193 396 1.11
gc cr block 3-way 162,158 214 1.32
gc cr grant 2-way 141,016 25 .18
gc cr multi block request 503,265 242 .48
gc current block 2-way 325,065 227 .70
gc current block 3-way 117,913 93 .79
gc current grant 2-way 45,580 20 .44
gc current grant busy 168,459 296 1.76
gc current multi block request 91,690 42 .46
This example output provides reason for concern. The average wait for Global Cache consistent read requests (as shown by ‘gc cr block 2-way’ and ‘gc cr block 3-way’) is more than 1 millisecond and more than 1/10th of the time for a db file sequential read. While the Global Cache is still faster than disk, it’s taking longer than we’d expect if the interconnect and RAC were fully optimized.
Examining the interconnect
When Global Cache waits are high, we should first determine if the latency is primarily the result of interconnect network waits.
The best way to determine the interconnect contribution to overall performance is to use the ping utility to measure latency independently of the Oracle stack. Ping packet handling is not identical to RAC packet handling, but if ping latency is high then you can confidently assume that network responsiveness is an issue.
In Oracle 10g the view X$KSXPIA shows the private and public IP addresses being used by the current instance. In Oracle 11g this information is available in the view Gv$cluster_interconnects. The following query shows us the private interconnect IP address plus other identifying information for the current instance (this query must be run as SYS):
SQL> SELECT instance_number, host_name, instance_name,
2 name_ksxpia network_interface, ip_ksxpia private_ip
3 FROM x$ksxpia
4 CROSS JOIN
5 v$instance
6 WHERE pub_ksxpia = 'N';
Inst Host Net Private
# Name INSTANCE_NAME IFace IP
---- ------------------------- ---------------- ----- ------------
3 melclul32.melquest.dev.me MELRAC3 eth1 192.168.0.12
l.au.qsft
We can then ping the IP address from another node in the cluster to determine average latency. On a Linux system, we can use the “–s 8192” flag to set an 8K packet size so as to align with the block size of this Oracle database. On Windows the appropriate
flag is “-l”:
$ ping -c 5 -s 8192 192.168.0.12
PING 192.168.0.12 (192.168.0.12) 8192(8220) bytes of data.
8200 bytes from 192.168.0.12: icmp_seq=0 ttl=64 time=0.251 ms
8200 bytes from 192.168.0.12: icmp_seq=1 ttl=64 time=0.263 ms
8200 bytes from 192.168.0.12: icmp_seq=2 ttl=64 time=0.260 ms
8200 bytes from 192.168.0.12: icmp_seq=3 ttl=64 time=0.265 ms
8200 bytes from 192.168.0.12: icmp_seq=4 ttl=64 time=0.260 ms
--- 192.168.0.12 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 3999ms
rtt min/avg/max/mdev = 0.251/0.259/0.265/0.020 ms, pipe 2
The ping output above indicates low latency – about .25 ms - across the interconnect.
Use the ping utility to measure the interconnect latency independently of the Oracle software stack.
Dell Software’s Spotlight on RAC presents both ping latencies and Global Cache latencies for each instance side by side, as shown in Figure 3[3].
Figure 3 Global Cache and ping latencies in Quest's Spotlight on RAC
Very high network latencies may indicate the need to tune the interconnect as outlined below. However, probably the number one “newbie” error is to inadvertently configure RAC to use the public LAN network rather than the private interconnect network. Before tuning the private interconnect, make absolutely sure that you didn’t accidentally configure RAC to use the public LAN as the interconnect network. Use the X$KSXPIA or V$CLUSTER_INTERCONNECTS view to double check the IP addresses are those associated with the interface that is connected to the switch that implements the private network.
It’s a common mistake to configure RAC to use the public network – rather than the private network – for the interconnect. Before tuning the interconnect make absolutely sure that the interconnect is private and that RAC is configured to use the correct network.
Signs of interconnect problems
Asides from high latencies – as exposed by the ping command - interconnect issues can show up as “lost” or congested blocks.
Lost blocks occur when a block is transmitted, but never received. The following query shows the number of blocks lost compared to the number sent and received:
SQL> SELECT name, SUM(VALUE)
2 FROM gv$sysstat
3 WHERE name LIKE 'gc%lost'
4 OR name LIKE 'gc%received'
5 OR name LIKE 'gc%served'
6 GROUP BY name
7 ORDER BY name;
NAME SUM(VALUE)
---------------------------------------------------------------- ----------
gc blocks lost 99
gc claim blocks lost 0
gc cr blocks received 14207701
gc cr blocks served 14207721
gc current blocks received 14471301
gc current blocks served 14471393
Time spent waiting for lost block retransmission is recorded in the wait events ‘gc cr request retry’, ‘gc cr block lost’ and ‘gc current block lost’. The times associated with these waits should be low: typically less than 1% of total when compared to the total number of blocks recorded in the ‘gc cr/current blocks received/served’ statistics.
If there are very high lost block counts (relative to blocks received) – or if the time associated with lost blocks becomes significant compared to total database time – then the most likely cause is a network hardware issue. This can be as simple as a poorly mounted network card, crimped networking cables or faulty network components.
Moderate lost block counts – especially if associated with very high levels of activity – might indicate an overloaded interconnect. The network optimizations below might alleviate the problem, or you may need to increase the throughput of the interconnect hardware (upgrading to Gigabit Ethernet, 10 Gigabit Ethernet or Infiniband for instance).
Global Cache “lost blocks” can be indicative of an overloaded or miss-configured interconnect or (at high levels) faulty network hardware.
Optimizing the interconnect
If the interconnect is identified as a problem, or even if we just want to optimize it to squeeze the Global Cache latencies down as far as possible, we have a few networking options we can try.
Network hardware and protocols
It’s possible to use dual Network Interconnect Cards (NICs) to reduce points of failure in the overall RAC architecture. If so, you should use NIC “bonding” (also known as link aggregation) to present the two NICs to Oracle as single logical interface. This will allow for the aggregate network bandwidth of both cards to be fully utilized.
The two most commonly used link and transport protocol combinations for the RAC interconnect are:
- Gigabit Ethernet (GBe) or 10 Gigabit Ethernet (10GBe) in combination with UDP.
- InfiniBand in combination with either Reliable Datagram Sockets (RDS) or Internet Protocol (IP).
The GBe/UDP option has the advantage of using standards-based commodity hardware, and being supported across a wide set of hardware and operating systems. InfiniBand offers superior throughput and latency, but at greater cost and administration effort. Note that Oracle uses InfiniBand/RDS inside Exadata both to connect the RAC instances and to attach the database nodes to the storage nodes: it’s clearly the highest performance solution.
However Gigabit Ethernet is able to sustain very high bandwidth – somewhere in the vicinity of 5,000-10,000 Global Cache transfers per second. Most RAC databases – especially those with an OLTP style workload - are unlikely to overload a GBe or 10GBe interconnect.
Many RAC databases – especially OLTP style - will be adequately served by a Gigabit Ethernet or 10 Gigabit Ethernet based interconnect. However, InfiniBand offers superior throughput and scalability.
Ethernet Jumbo frames
By default, the maximum sized packet that can be transmitted across an Ethernet network is only 1500 bytes. This is determined by the Maximum Transmission Unit (MTU). Because Oracle’s default block size is 8K, most Global Cache transfers will have to be made in multiple Ethernet packets. This multiple packets will increase network load and possibly overload the interconnect. If any one of the packets that represent a block is lost, then Oracle will have to re-transmit the lot.
Jumbo frames allows you to set a MTU of up to 9000 – large enough for a single default size Oracle block to transmit in a single packet. However, to increase the MTU you may need to apply changes at the switch, the network card and the Operating System level. Setting jumbo frames at the operating system level, but failing to do so at the switch level is a common mistake. Note too, that not all switches support jumbo frames.
To set jumbo frames at the host level, you use a utility such as ifconfig. On Linux, you can use the ifconfig command to check the MTU size:
#ifconfig eth0
eth0 Link encap:Ethernet HWaddr 00:0D:56:18:2C:60
inet addr:10.20.12.154 Bcast:10.20.12.255 Mask:255.255.255.0
inet6 addr: fe80::20d:56ff:fe18:2c60/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:951407 errors:0 dropped:0 overruns:0 frame:0
TX packets:544283 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:119973568 (114.4 MiB) TX bytes:124883921 (119.0 MiB)
Base address:0xdf40 Memory:feae0000-feb00000
The MTU is at the default of 1500. To set it – at the Operating system level – to 9000 we again use the ifconfig command:
# ifconfig eth0 mtu 9000
# ifconfig eth0
eth0 Link encap:Ethernet HWaddr 00:0D:56:18:2C:60
inet addr:10.20.12.154 Bcast:10.20.12.255 Mask:255.255.255.0
inet6 addr: fe80::20d:56ff:fe18:2c60/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:951649 errors:0 dropped:0 overruns:0 frame:0
TX packets:544386 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:119996835 (114.4 MiB) TX bytes:124902029 (119.1 MiB)
Base address:0xdf40 Memory:feae0000-feb00000
Setting jumbo frames at the OS level will achieve nothing if the NICs or switches are not configured to support an MTU of 9000 – you may need to consult with your MIS department or consult the hardware documentation. You also need to make sure that every
host in the cluster has the same MTU setting.
When enabling Jumbo Frames, be certain that you enable it on all hosts in the cluster and also at the hardware level (switches and NICs).
Each OS supports a different mechanism for enabling Jumbo Frames. On Solaris, you need to edit the interface configuration file (/kernel/drv/e1000g.conf for instance).
See Oracle support note Recommendation for the Real Application Cluster Interconnect and Jumbo Frames (341788.1) for more information.
If your interconnect is configured using Ethernet (GBe or 10GBe) then enabling Jumbo Frames will reduce the packet rate, increasing reliability and throughput.
UDP buffer size
When a RAC host receives a network packet over the interconnect, the packet is held in a small memory buffer until Operating System gets around to processing the buffer. If the system is particularly busy and/or the interconnect is overloaded, then packets might be lost. Increasing the UDP receive buffer size can help.
The OS default value for the UDP receive buffer is generally in the vicinity of 128-256K which can be inadequate for the bursts of load that can occur across the interconnect. If the buffer is too small then various network errors may be reported by utilities such as netstat or ifconfig. These symptoms will include dropped packages, overflows, fragmentation or reassembly errors.
Oracle installation pre-requisites require that the value be increased, typically to about 4M.
In Linux, the kernel parameter net.core.rmem_max controls the receive buffer size. The sysctl command can be used to obtain the current value:
# sysctl -n net.core.rmem_max
4194304
Ensure that your UDP receive buffer size is set above the default value – probably to the OS maximum.
LMS waits
Interconnect performance is at the heart of Global Cache latency, but high Global Cache latencies are as often the result of delays in the Oracle software layers. The LMS service on the remote instances contributes most of the non-network latency to Global Cache requests; it is responsible for constructing and returning the requested blocks. The following query shows LMS latencies for each instance for current and consistent read requests:
SQL> WITH sysstats AS (
2 SELECT instance_name,
3 SUM(CASE WHEN name LIKE 'gc cr%time'
4 THEN VALUE END) cr_time,
5 SUM(CASE WHEN name LIKE 'gc current%time'
6 THEN VALUE END) current_time,
7 SUM(CASE WHEN name LIKE 'gc current blocks served'
8 THEN VALUE END) current_blocks_served,
9 SUM(CASE WHEN name LIKE 'gc cr blocks served'
10 THEN VALUE END) cr_blocks_served
11 FROM gv$sysstat JOIN gv$instance
12 USING (inst_id)
13 WHERE name IN
14 ('gc cr block build time',
15 'gc cr block flush time',
16 'gc cr block send time',
17 'gc current block pin time',
18 'gc current block flush time',
19 'gc current block send time',
20 'gc cr blocks served',
21 'gc current blocks served')
22 GROUP BY instance_name)
23 SELECT instance_name , current_blocks_served,
24 ROUND(current_time*10/current_blocks_served,2) avg_current_ms,
25 cr_blocks_served,
26 ROUND(cr_time*10/cr_blocks_served,2) avg_cr_ms
27 FROM sysstats;
Current Blks Avg CR Blks Avg
Instance Served CU ms Served Cr ms
------------ ------------ ------ ------------ ------
MELRAC1 7,342,829 .03 7,647,581 .05
MELRAC2 7,330,661 .03 7,418,901 .04
MELRAC3 7,310,866 .03 12,696,127 .08
If the network is responsive and fast, but LMS latency is high, then one of the following might be implicated:
- An overloaded instance is unable to respond fast enough to Global Cache requests. In particular, the LMS processes might be overloaded with requests or starved for CPU.
- IO bottlenecks – particularly in redo log IO – are slowing down the response to Global Cache requests.
In the first case, the LMS process on the remote instance is simply too busy to process the Global Cache request. This can be due to an excessive volume of requests or because CPU load on the host is making it impossible for the LMS to obtain CPU. The later situation is less common from Oracle 10.2 onwards, because Oracle now runs the LMS processes at an elevated priority. Severe memory pressure may also lead to a lack of LMS responsiveness.
The “too busy” phenomenon is probably a result of an imbalanced cluster: if any instance in the cluster is significantly overloaded then Global Cache response times on the idle instances will suffer. The best solution is to try and achieve a better cluster balance – see the section on Cluster Balance below.
High Global Cache latencies can occur when one or more instances in the cluster become overloaded. Balancing the workload across the cluster is indicated.
The other typical cause of high latencies is when the LMS process must flush uncommitted changes to the redo log prior to sending the block to the requesting instance. If the application design is such that uncommitted blocks are often in demand across instances
in the cluster then these redo log flushes might become common. If there are bottlenecks in the redo log devices then the IO waits will be magnified.
We can measure the impact on LMS response time by leveraging the timing information in GV$SYSTAT and the FLUSHES statistic in GV$CR_BLOCK_SERVER. Putting the two together we can calculate the proportion of blocks that required flushing and the proportion of LMS time spent performing the flush:
SQL> WITH sysstat AS (
2 SELECT SUM(CASE WHEN name LIKE '%time'
3 THEN VALUE END) total_time,
4 SUM(CASE WHEN name LIKE '%flush time'
5 THEN VALUE END) flush_time,
6 SUM(CASE WHEN name LIKE '%served'
7 THEN VALUE END) blocks_served
8 FROM gv$sysstat
9 WHERE name IN
10 ('gc cr block build time',
11 'gc cr block flush time',
12 'gc cr block send time',
13 'gc current block pin time',
14 'gc current block flush time',
15 'gc current block send time',
16 'gc cr blocks served',
17 'gc current blocks served')),
18 cr_block_server as (
19 SELECT SUM(flushes) flushes, SUM(data_requests) data_requests
20 FROM gv$cr_block_server )
21 SELECT ROUND(flushes*100/blocks_served,2) pct_blocks_flushed,
22 ROUND(flush_time*100/total_time,2) pct_lms_flush_time
23 FROM sysstat CROSS JOIN cr_block_server;
PCT_BLOCKS_FLUSHED PCT_LMS_FLUSH_TIME
------------------ ------------------
.25 36.03
Note how even a very small proportion of block flushes can still account for a large proportion of total LMS time. Tuning the redo log IO layout might be indicated.
The LMS will sometimes need to flush redo entries to disk before returning a block. Redo log IO can therefore be a critical factor in Global Cache latency.
Cluster balance
Achieving balance in a RAC configuration is important for scalability, manageability and performance. While some variation in workload across the cluster is to be expected, in an unbalanced cluster the following undesirable situations can arise:
- Sessions on busy instances get poor service time. Even though there may be spare capacity in the cluster as a whole, sessions on busy instances will be unable to utilize that capacity and will experience poor performance.
- Sessions on idle instances wait for blocks from busy instances. Because a lot of operations result in requests to remote instances, an over-loaded instance can cause performance problems across the entire cluster. A session on an “idle” instance may experience high Global Cache wait times waiting on blocks from the busy instance.
- Benefits of adding new instances may not be realized. If some of the instances in the cluster are subject to a higher workload, then these instances may become bottlenecks to overall database throughput. As instances are added to the cluster, expected performance improvements might be unattainable.
- Tuning is harder because each instance has different symptoms. In an unbalanced cluster, sessions on busy instances may experience high CPU waits while sessions on less busy instances will experience high Global Cache waits. Troubleshooting performance problems on an unbalanced cluster can therefore be more challenging because of the inconsistent symptoms.
We can assess cluster balance fairly easily: the following query reports on CPU, DB time and logical reads on each instance within the cluster since startup:
SQL> WITH sys_time AS (
2 SELECT inst_id, SUM(CASE stat_name WHEN 'DB time'
3 THEN VALUE END) db_time,
4 SUM(CASE WHEN stat_name IN ('DB CPU', 'background cpu time')
5 THEN VALUE END) cpu_time
6 FROM gv$sys_time_model
7 GROUP BY inst_id )
8 SELECT instance_name,
9 ROUND(db_time/1000000,2) db_time_secs,
10 ROUND(db_time*100/SUM(db_time) over(),2) db_time_pct,
11 ROUND(cpu_time/1000000,2) cpu_time_secs,
12 ROUND(cpu_time*100/SUM(cpu_time) over(),2) cpu_time_pct
13 FROM sys_time
14 JOIN gv$instance USING (inst_id);
Instance DB Time Pct of CPU Time Pct of
Name (secs) DB Time (secs) CPU Time
-------- ------------- ------- ------------- --------
MELRAC3 3,705.30 24.48 1,119.99 17.03
MELRAC2 6,278.23 41.48 4,010.85 61.00
MELRAC1 5,150.96 34.03 1,444.06 21.96
In this example it is clear that MELRAC2 is being subjected to a disproportionate level of CPU load: if this is not addressed, increasing cluster workload will almost certainly lead to performance degradation as MELRAC2 becomes the bottleneck for the entire cluster.
The above query summarizes performance since the instances in the cluster were started. Of course, instances in a cluster can start and stop independent of the cluster as a whole, which might result in different totals even if the cluster is experiencing a balanced workload.
Quest Software’s Spotlight on RAC probably has the most advanced RAC balance monitoring. Spotlight on RAC displays cluster balance from a number of perspectives and performs a statistical analysis to determine if the imbalance is systematic or due to short term random fluctuations. Figure 4 shows example output from Spotlight on RAC.
Figure 4 RAC balance display in Quest's Spotlight on RAC
An imbalance in RAC load can be due to a single session – or just a few sessions – placing heavy load on specific instances. These imbalances might be unavoidable although parallelizing operations across the cluster might be an option. Other possible causes include:
- Sessions directly connecting to individual instances in the cluster. This might happen if the TNSNAMES.ORA file contains entries for individual instances as well as cluster entries.
- Out of date TNSNAMES.ORA files on clients or on the servers which are causing RAC load balancing to fail: see the next section for more details.
- Unbalanced services configuration resulting in specific services placing excessive load on a subset of the instances in the cluster. We’ll also be discussing services in detail in the next section.
Balancing load across the instances in the cluster is essential for optimal RAC performance. In particular, ensure that no single instance becomes overloaded. An overloaded instance can lead to high Global Cache waits, inconsistent service times and poor scalability.
Cluster balance and services
Services allow you to allocate workloads to specific instances within a cluster. Instances within the cluster can be associated with specific services and connection requests for those services will then be directed to those instances. Each service will have one or more preferred instances and optionally one or more available instances. Requests for a service will be directed to preferred instances initially; if none of the preferred instances are available then the request will be sent to an available instance.
After installation, each RAC database will have a few default services: one for each instance, one for the cluster as a whole, and special or internal services (SYS$BACKGROUND, SYS$USERS, the XDB service, etc).
Additional services can be configured and managed by the DBMS_SERVICES package, the srvctl command line utility, through the Database Configuration Assistant (DBCA) in Oracle 10g or through Enterprise Manager in Oracle 11g or 12g. In Figure 5 we add a service (BATCHJOBS) that runs on instance rac11ga1 by default, can be run on instance rac11ga2 if rac11ga1 is not available but will not run on instance rac11ga3.
Figure 5 Configuring Services using DBCA
Users request to connect to a specific service when connecting. For instance, the TNSNAMES alias below requests a connection to the BATCHJOBS service; if we arrange for batch jobs to specify the BATCHJOBS TNS alias we can be assured that they will connect to the appropriate service and hence be allocated to the desired instances:
BATCHJOBS=
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = host1-vip)(PORT = 1521))
(ADDRESS = (PROTOCOL = TCP)(HOST = host2-vip)(PORT = 1521))
(ADDRESS = (PROTOCOL = TCP)(HOST = host3-vip)(PORT = 1521))
(LOAD_BALANCE = YES)
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = batchjobs)
)
)
Services serve two main purposes in RAC:
- By partitioning certain types of workload to certain instances, we can reduce the amount of Global Cache traffic, since similar workloads are most likely to utilize similar data blocks.
- Services can help you share a RAC cluster across multiple applications, some of which may have different service level objectives. By allocating more instances in the cluster to a specific service, we effectively allocate the service a bigger share of cluster resources.
When we make use of services, we are deliberately manipulating the workloads on each instance, so we’re unlikely to see the perfect balance that we might aspire to when every instance in the cluster is subject to the same workload. However, avoiding an overloaded instance within the cluster is just as important when using services.
We can view service workload through the GV$SERVICE_STATS view, which shows various workload statistics for each service. This query breaks down service CPU across the cluster, showing the percentage of total CPU is consumed by the service on the instance, and how the service workload is distributed across the cluster:
SQL> BREAK ON instance_name skip 1
SQL> COMPUTE SUM OF cpu_time ON instance_name
SQL>
SQL> WITH service_cpu AS (SELECT instance_name, service_name,
2 round(SUM(VALUE)/1000000,2) cpu_time
3 FROM gv$service_stats
4 JOIN
5 gv$instance
6 USING (inst_id)
7 WHERE stat_name IN ('DB CPU', 'background cpu time')
8 GROUP BY instance_name, service_name )
9 SELECT instance_name, service_name, cpu_time,
10 ROUND(cpu_time * 100 / SUM(cpu_time)
11 OVER (PARTITION BY instance_name), 2) pct_instance,
12 ROUND( cpu_time
13 * 100
14 / SUM(cpu_time) OVER (PARTITION BY service_name), 2)
15 pct_service
16 FROM service_cpu
17 WHERE cpu_time > 0
18 ORDER BY instance_name, service_name;
Instance Service Cpu Pct Of Pct of
Name Name secs Instance Service
-------- --------------- ----------- -------- -------
MELRAC1 MELRACDB 4,292 51.50 34.70
MEL_SRV3 619 7.42 100.00
MEL_SRV4 621 7.45 53.29
SYS$BACKGROUND 0 .00 42.86
SYS$USERS 2,802 33.62 1.32
******** -----------
sum 8,334
MELRAC2 MELRACDB 3,834 1.78 31.00
MEL_SRV1 990 .46 100.00
MEL_SRV4 544 .25 46.71
MEL_SRV5 1,228 .57 100.00
SYS$BACKGROUND 0 .00 28.57
SYS$USERS 208,954 96.94 98.32
******** -----------
sum 215,551
MELRAC3 MELRACDB 4,241 71.87 34.29
MEL_SRV2 385 6.52 100.00
MEL_SRV6 507 8.60 100.00
SYS$BACKGROUND 0 .00 28.57
SYS$USERS 768 13.01 .36
******** -----------
sum 5,901
Other views – such as GV$SERVICE_EVENT – provide additional performance data aggregated to the instance and service level.
Dell’s Spotlight on RAC provides fairly extensive real time services monitoring capabilities: it can show how services are utilizing various resources across instances and how the load on each instance is driven by the various services. It also allows you to view wait time and session information for specific services.
Figure 6 Spotlight on RAC services monitoring
Getting acceptable cluster balance and avoiding overloaded instances when services are deployed requires fairly careful monitoring and adjustment of service configuration. But in general, you achieve better balance by making instances that are under-utilized available to more services and moving services off over-utilized instances.
When using cluster services to distribute workload, ensure that no instances are overloaded, while also avoiding under-utilization of instances.
RAC Load Balancing facilities
By default, Oracle uses client-side round-robin load-balancing: each request from a client connects to a different instance in the cluster and the overall result is that each instance receives approximately the same number of connections.
The client side load balancing is enabled within the TNSNAMES definition for the cluster database service. Unless the LOAD_BALANCE clause is set to OFF, round robin load balancing will be enabled across all the hosts listed in the TNS entry. So for instance the following TNSNAMES definition will result in connections requests being alternated between the hosts host1-vip, host2-vip and host3-vip:
racdb =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = host1-vip)(PORT = 1521))
(ADDRESS = (PROTOCOL = TCP)(HOST = host2-vip)(PORT = 1521))
(ADDRESS = (PROTOCOL = TCP)(HOST = host3-vip)(PORT = 1521))
(LOAD_BALANCE = YES)
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = racdb)
)
)
For client-side load balancing to work properly all TNSNAMES definitions should be kept up to date. For instance, if you add a new instance to the cluster but fail to update TNSNAMES.ORA files on the clients, then the new instance might not pick up its share of the load (although server side load balancing will probably save the day).
To keep client-side load balancing current, it is advisable to keep all TNSNAMES entries up to date when instances are added or removed from a cluster.
If you fail to update the TNSNAMES files on all the clients, load balancing should still occur providing that server-side load balancing is enabled. Server side load balancing works at the listener level. Listeners on each node will direct connection requests
to the instance or node which is currently least heavily loaded.
Server side load balancing requires that the REMOTE_LISTENER parameter in each instance be set to the name of a TNSNAMES entry that defines the listeners on all of the nodes in the cluster. When adding or removing an instance, you should update the TNSNAMES entries on all hosts in the cluster to ensure that the remote listener definitions are up to date.
Server side load balancing requires that the LOCAL_LISTENER parameter point to a TNSNAMES entry that defines all of the listeners in the database cluster.
Advanced load balancing capabilities are available within the Oracle client libraries – such as JDBC and ODP.NET. Applications that take advantage of these interfaces can obtain better load balancing – for instance load balancing at the transaction level rather
than at the connection level. Oracle’s Fast Application Notification (FAN) and Load Balancing Advisories can be used to optimize client workload to match service goals (throughput or response time) and connection duration (long lived or short lived session
durations). See the Oracle RAC Administration and Deployment guide for more details.
Minimizing Global Cache traffic
As we saw earlier, Global Cache requests are integral to RAC and represent both the “cost” of the RAC architecture and the basis of its scalability. Avoiding a disk read by fetching a needed block from another instance prevents RAC databases from becoming IO bound. However, each Global Cache request adds overhead: it’s far better to find the data you want in the local buffer cache than to retrieve it from another instance.
Very high Global Cache request rates generally result in a poorly performing RAC cluster. The overhead of the Global Cache activity probably means that response time for the RAC database will be higher than for its single-instance equivalent and may suggest that scalability problems will emerge as more instances are added.
High Global Cache request rates are generally the result of application processing patterns:
- “Hot” rows or blocks that are needed by every instance and/or every transaction. For instance a table-based sequence number generator may contain a single row which has the next order sequence number. This block will be required every time an order is created on any instance and will therefore be constantly shipped across the cluster.
- Other “hot” blocks. These are the same sorts of blocks that might have caused cache buffers chains latch contention in a single instance database. The cures for cache buffers chains latch contention (reducing rows per block or partitioning) may also be effective in the RAC environment.
- Full table scans that retrieve large numbers of blocks across the interconnect.
To determine how often the database needs to make Global Cache requests, we can compare the number of blocks fetched across the interconnect with the total number of block accessed (e.g., the number of logical reads). The following query performs that calculation as well as determining the ratio of physical to logical reads (yes, the notorious Buffer Cache Hit Ratio):
SQL> WITH sysstats AS (
2 SELECT inst_id,
3 SUM(CASE WHEN name LIKE 'gc%received'
4 THEN VALUE END) gc_blocks_recieved,
5 SUM(CASE WHEN name = 'session logical reads'
6 THEN VALUE END) logical_reads,
7 SUM(CASE WHEN name = 'physical reads'
8 THEN VALUE END) physical_reads
9 FROM gv$sysstat
10 GROUP BY inst_id)
11 SELECT instance_name, logical_reads, gc_blocks_recieved, physical_reads,
12 ROUND(physical_reads*100/logical_reads,2) phys_to_logical_pct,
13 ROUND(gc_blocks_recieved*100/logical_reads,2) gc_to_logical_pct
14 FROM sysstats JOIN gv$instance
15 USING (inst_id);
Instance Logical GC Blocks Physical Phys/Logical GC/Logical
name Reads Received Reads Pct Pct
---------- ------------ ------------ ------------ ------------ ----------
MELRAC3 15,353,311 1,730,818 23,099 .15 11.27
MELRAC2 148,903,331 1,756,882 438,531 .29 1.18
MELRAC1 21,792,614 1,730,366 39,471 .18 7.94
Note how in the above example it’s the least busy instances (in terms of logical reads) that have the highest Global Cache/Logical request ratio: the less busy an instance is, the more likely that the blocks it needs are in the memory of another, more busy, instance.
As is typical in general with “hit” or “miss” ratios, one should be cautious about forming too many conclusions, since the ratio will vary widely depending on workload patterns. However, the higher the percentage of blocks that result in a Global Cache request, the more likely it is that we may achieve performance improvements by reducing either Global Cache latency or Global Cache traffic.
To determine which segments are resulting in the highest rates of Global Cache activity, we can exploit the GV$SEGMENT_STATISTICS view. The following query lists the segments that are associated the highest number of Global Cache blocks received:
SQL> WITH segment_misses AS
2 (SELECT owner || '.' || object_name segment_name,
3 SUM(VALUE) gc_blocks_received,
4 ROUND( SUM(VALUE)* 100
5 / SUM(SUM(VALUE)) OVER (), 2) pct
6 FROM gv$segment_statistics
7 WHERE statistic_name LIKE 'gc%received' AND VALUE > 0
8 GROUP BY owner || '.' || object_name)
9 SELECT segment_name,gc_blocks_received,pct
10 FROM segment_misses
11 WHERE pct > 1
12 ORDER BY pct DESC;
SEGMENT_NAME GC_BLOCKS_RECEIVED PCT
---------------------------------------- ------------------ ------
TRANSIM.GH_SALES_TOTALS 1,641,395 79.31
TRANSIM.G_CART_PK 104,014 5.03
TRANSIM.G_CART 86,185 4.16
SYS.ALERT_QT 39,476 1.91
SYS.SYS_IOT_TOP_8797 22,895 1.11
The “hot row” scenario – for instance an order number generated from a sequence table – is hard to correct without application changes. If many transactions frequently access a specific row, then that row is inevitably going to circulate at high rates throughout the cluster. If this overhead is unacceptable, then you will probably want to isolate the workloads concerned to specific instances in the cluster – perhaps by using services. Alternatively, you could avoid these single-row contention points by changing your application design (perhaps using an Oracle sequence generator rather than a sequence table).
For instance, lets say we are trying to reduce the interconnect load caused by the GH_SALES_TOTALS table from the example output above. We could create a service that is associated with a single instance of the cluster and then allocate all sessions that access GH_SALES_TOTALS to that service. The result would be that all SQLs that access GH_SALES_TOTALS would execute on a specific instance and the interconnect activity would be eliminated.
In the more general case where blocks or specific segments are associated with heavy Global Cache activity, we can attempt to reduce the amount of inter-instance traffic through one of the following techniques:
- Isolating workloads to a particular instance or groups of instances. We can do this through services configuration as discussed earlier. If transactions and queries that access specific segments or blocks are isolated to a smaller number of instances, then the amount of Global Cache traffic will be reduced. However this does make balancing the cluster more difficult.
- Isolating sessions that are likely to work on the same data. This is similar to isolating workloads, but instead of isolating specific transaction types, we isolate sessions that are likely to work on the same sets of data. For instance, we might allocate sessions from the east coast to one set of instances, and west coast to another. Again, this raises concerns about balance: in the East coast/West coast scenarios we imbalance the cluster because of the differences in the number of users in each region and the time zones that correspond to peak processing. Nevertheless, this sort of partitioning will result in a reduction in Global Cache traffic and - if we are careful with our services configuration - we can still achieve an acceptable distribution of work across the cluster.
- Partitioning the segments with the highest levels of Global Cache activity. Hash partitioning can split up the hot blocks, hopefully reducing Global Cache contention for those blocks.
- Range or list partitioning the segments in conjunction with isolation of user populations can also be considered. For instance as well as creating services to support geographical regions (East Coast/West Coast for instance), we could also consider partitioning key tables and indexes on the same geographic basis. Partitions specific to a particular region will naturally tend to become cached (and mastered) on the instances to which those users connect, resulting in fewer Global Cache requests.
- Reverse key indexes can help relieve hot Global Cache index leaf and branch blocks. When a segment has an ascending numeric primary key, all new rows require an update to the leading edge of the index. Freelist management will distribute the new table rows across the cluster, but the leading leaf and branch blocks of the index will be in contention by all instances in the cluster. Reverse key indexes distribute new entries across all the leaf blocks, relieving the contention. However, be aware that reverse key indexes cannot be used for index range scans (though you don’t often need to perform range scand on primary keys).
- If you have legacy tablespaces that don’t use Automatic Segment Storage Management (ASSM), then you must make sure that you use multiple freelist groups on tables that have non-trivial insert rates. Otherwise all new rows will be directed to the same new blocks, creating Global Cache contention for that block.
Very high Global Cache request rates can limit RAC performance and scalability. Global Cache activity can be reduced by adjusting application design, isolating workloads or sessions using services, reverse key primary indexes and by partitioning segments.
Inter-instance parallel query can improve parallel SQL performance but can also increase Global Cache traffic. Restricting parallel execution to a single instance or smaller set of instances will often be indicated. The Oracle 10g INSTANCE_GROUP parameter or
the Oracle 11g PARALLEL_INSTANCE_GROUP parameter can control the instances that are involved in a parallel SQL.
Summary
Most of the performance optimization techniques applicable to single instance databases are just as applicable to RAC databases. In particular, before moving to a RAC architecture make sure you have eliminated any “hot block” forms of contention such as buffer busy or cache buffers chains latch contention as these tend to magnify in RAC.
The most significant difference in a RAC database is the use of Global Cache requests to fetch blocks from other instances in the cluster rather than to read them from disk. RAC will scale and perform well, providing that:
- Global Cache latency is much less than disk read latency. Achieving this involves both optimizing the interconnect network, and making sure that no instances get too busy to respond to Global Cache requests in a timely manner.
- The rate of Global Cache requests is reasonable. In particular, “hot” blocks that are in constant contention across the cluster should be minimized.
- The cluster is reasonably well balanced. In particular, no instance should be overloaded: an overloaded instance is likely to cause performance problems both for itself, and other instances in the cluster.