New Generation Oracle RAC Performance

New Generation Oracle RAC 19c
Diagnosing Oracle RAC Performance Issues
Copyright © 2020, Oracle and/or its affiliates
Anil Nair
Sr Principal Product Manager,
Oracle Real Application Clusters (RAC) & ASM
@RACMasterPM
http://www.linkedin.com/in/anil-nair-01960b6
http://www.slideshare.net/AnilNair27/

Joining me today is Oracle RAC Performance Engineering Team
Elite Engineers with the charter to
• Instrument and measure key code areas
including cpu, memory usage between
releases including patches
• Provide ideas to improve scalability and
reduce reconfiguration time for High
Availability
• Using new OS features
• Threads vs Processes
• Poll vs Event
• Both on generic systems and Engineered
Systems like Exadata
• Improve Diagnosability by adding additional
statistics to AWR

How do I ask
Questions?
Please use the Q & A option in Zoom for
questions
Copyright © 2020, Oracle and/or its affiliates |

New Generation Oracle RAC Scalability and HA
New Apps based on Converged DB’s using IoT, Kafka, Blockchain & Traditional packaged Apps
Private
Network
Failover
Private
Network
New Apps Traditional Apps
＆

Oracle RAC provides HA and Scalability
Available and Active Active Instances scale Writes, Reads and Hybrid workloads
Private
Network
Failover
Private
Network
0
5000
10000
15000
20000
25000
30000
35000
40000
4 8 32 48 64 80
2035
4010
15520
22416
30016
37040
2 Nodes
3 Nodes
4 Nodes
5 Nodes
# of Cores
Users
*SAP certified SD Benchmark results
SAP SD workload

New Generation Oracle RAC scales IoT workloads
New Apps
https://www.oracle.com/a/tech/docs/wp-bp-for-iot-with-12c-
042017-3679918.pdf
IoT Workload

Oracle RAC Performance is dependent on Private Network Latency
• Oracle RAC utilizes private network to
synchronize changes in memory
• Private network is to be translated as a distinct
isolated low latency network
• Also important for security
• Communication on generic systems utilize
UDP with its own error correction mechanism
• Supports redundant network via OS NIC
bonding, HAIP
• Supports Network virtualization typically used
in hyper-converged infrastructure using
VLANs
• Ensure adequate QoS settings for private
network
Buffer
Cache
Shared
Pool
In-Memory
Misc
Total SGA
S
G
A
S
G
A
S
G
A
Buffer
Cache
Shared
Pool
In-Memory
Misc
Buffer
Cache
Shared
Pool
In-Memory
Misc
Shared
Pool
In-Memory
Misc
Buffer
Cache

TCP vs UDP Communication
TCP UDP
Message
Ack
Message
Ack
Message
Message
Message
Message
• In-built support for
Segment retransmission and flow control
• Ordered transmission
UDP is Stateless Protocol
• No flow control or retranmission
• No order

Oracle RAC UDP protocol improvements
• Oracle RAC implements it own state, sequence
control on top of UDP
• Process send and receive messages
acknowledge receipt of messages
• Processes periodically send special “side
channel message”
• If acknowledgement of special side channel
message is received before message, the lost
block counter is incremented, and message is
resent
• Lost blocks can be caused by
• Flaky private network
• High or no CPU
UDP
Message
Message
Side Channel Message

Configure at least one or more Private Network
• Installer detects all networks and provides a
drop-down list with list of values to be chosen
• ASM Network can be combined with Private
network
• Post Installation, oifcfg can be used to
add/delete/modify private network
• Choosing multiple interfaces results in HAIP
being used
• For more details check out my slideshare
(https://www.slideshare.net/AnilNair27/s
mart-monitoring-how-does-oracle-rac-
manage-resource-state-ukoug19-
220887244)
$oifcfg getif
eth1 10.12.0.0 global cluster_interconnect
eth2 10.13.0.0 global asm
$oifcfg setif -global eth3/192.168.23.0:private

Confirm correct private network is used – Alert log
Cluster Communication is configured to use IPs
from: GPnP
IP: 169.254.249.167 Subnet: 169.254.0.0
Relies on HAIP
Private Interface 'bondib0' configured from GPnP
for use as a private interconnect.
Relies on Bonding for Private network HA
IPCLW over UDP Oracle RDS/IP

Confirm correct private network is used –AWR/ADDM
ExadataGeneric systems

Detecting performance issues caused by lost blocks
lost blocks only
reported on
Instance 2
348 Lost
blocks
NIC
Driver ,
IP stack
Context switch
IPCWire
usersys

Common cause resulting in Blocks lost
• Incorrectly sized UDP send recv buffers
• Check netstat -s' or 'netstat -su' for
"udpInOverflowsudpInOverflows",
"packet receive errors",
"fragments dropped" or
"outgoing packet drop"
• High CPU utilization caused by packet
fragmentation
• Check for netstat -s IP stat counters:
3104582 fragments dropped after timeout
34550600 reassemblies required
8961342 packets reassembled ok
3104582 packet reassembles failed.
• Check /proc/sys/net/ipv4/
ipfrag_low_thresh (default = 196608)
• ipfrag_high_thresh (default = 262144)
• Network Checksum errors causing send (tx) /
receive (tx) errors
• Mismatched MTU sizes
• Ensure all interfaces and switch settings
can support Jumbo frames if possible
• Ensure all private interconnect links are layer 2
direct attached to switch/switches
• Ipfilter configuration
• Full Duplex or Duplex mode mismatch
• Flow control mismatch between interface and
Links
• Default to ON
Additional details in Troubleshooting gc block
lost and Poor Network Performance in a
RAC Environment (Doc ID 563566.1)

Tools to help check network performance
• ORAchk
• iperf/iperf3
iperf3 –s
iperf3 -u -c 192.168.18.xx -l 8K -b 5G -w 1M -t 120
-u UDP, -b bandwidth –w window –t time –l block size

Oracle RAC Performance
Troubleshooting Tips & New Generation Performance Internals
Atsushi Morimura
RAC Performance Engineer
Oracle RAC Development
October 2020

17
5 years
RDBMS Sales Consulting
14 years
RAC Development

Today’s Agenda
18
Troubleshooting Tips from a Performance Engineer
New Generation Oracle RAC Performance Internals
Question & Answers

Troubleshooting Tips from a
Performance Engineer
19

2-way Block Transfer
Foreground sends request to
remote LMS, LMS sends a data
block back.
• Master is local, holder is
remote.
• Master is remote, and is also
the holder.
3-way Block Transfer
remote LMS, LMS forwards
request to holder LMS, holder
LMS sends a data block back.
• Master is remote, holder is on
another remote instance.
2-way Lock Grant
remote LMS, LMS sends a lock
grant back.
• Master is remote, buffer is not
held anywhere.
• Data is read from disk.
Common Cache Fusion Messaging Patterns
Copyright © 2020, Oracle and/or its affiliates20
Instance 2Instance 1
FG LMS
Request
Data Block Transfer
FG LMS
Request
Lock GrantDisk I/O
No matter the size of the cluster,
requests are always satisfied in ≤ 3 hops
LMS LMS
RequestData Block Transfer
Instance 1
FG
Forward

Waiting for a buffer (or lock grant) transfer from a remote instance
Common Cache Fusion Wait Events
gc [cr|current] [block|grant] [2-way|3-way|busy|congested]
Short for
“Global Cache”
Type of block requested:
• CR: consistent read
• Current: current pin
Request outcome:
(what was returned
from server)
Performance hint

The “performance hint”
• “2-way” / “3-way”
• Received an immediate response back from the master or holder instance.
• “busy”
• If LMS is unable to immediately service the request, processing gets queued.
- A dirty buffer had to be sent, so a REDO log flush was initiated.
- The buffer was pinned by another process.
- The buffer lock was held in an incompatible mode.
- AWR will help explain the reason.
• “congested”
• Request was processed immediately but was stuck in LMS.
- If we exceed a 1 ms threshold, the wait will be marked “congested”.
• Usually indicates LMS starvation.
FG LMS
Request
Data Block Transfer REDO log flush

Details, expectations and probable causes for abnormalities
gc [cr|current] block [2|3]-way
• Expected Latency: < 1 ms (< 100 us on Exadata for 2-way).
• Reasons for abnormal : Interconnect congestion , high CPU load and priority inversion , LMS not in RT.
gc [cr|current] block busy
• Expected latency : log file parallel write + latency of immediate transfer.
• Reasons for abnormal: log flush time, application contention on hot blocks, high CPU load and priority inversion.
gc [cr|current] block congested
• We should not see this event on healthy systems.
• Reasons for abnormal: Same as [2|3]-way waits.

Concurrency waits for globally-busy buffers
gc buffer busy [acquire|release]
• “ACQUIRE”: someone wants to read a remote block, and the request has not yet completed.
• The remote block transfer is likely in-process by another local client.
• “RELEASE”: a block got released (pinged away) because it was requested by another instance.
• Upon completion of this wait, the client may be able to pin the buffer right away (since the
buffer has subsequently been ACQUIRE’d by another client), or in some cases, the lock request
will need to be re-started again.
• Don’t confuse these with the “gc [cr|current] [block|grant] busy” waits.
• If there is high concurrency, there will be lots of “buffer busy” waits and a lot fewer transfer waits,
but the root of the wait are the transfers.

Other common wait events
gc [cr|current] request
• Used for in-process buffer requests (reported in ASH, hang analyze, etc. but never in AWR).
• Wait event is “fixed-up” upon completion to:
• “gc [cr|current] [block|grant] [2-way|3-way|busy|congested]”
gc [cr|current] multi block [request|grant|mixed]
• A batch of requests were initiated by the client.
• “request” is used when all requests were fulfilled via block transfers.
• “grant” is used when all requests were fulfilled via lock grants (disk I/O’s will follow).
• “mixed” is used when we got a mixture of both.
gc [cr|current] block direct read
• Exadata only (18c+).
• RDMA-read of a remote buffer.

Common reasons and solutions
Common Wait Event Dependencies
log file parallel write à gc [cr|current] block busy
• Log writes on sender instance block cache fusion sends.
• Generally log IO related.
• Tune DML schema or SQL to reduce hot blocks.
Often associated with other wait events
• gc [cr|current] block busy à enq: TX – index contention
• gc [cr|current] block busy à enq: TX - row contention
• Dependencies are captured in ASH, based on session blocker/blocked information.
gc [cr|current] block busy à gc buffer busy [acquire|release]
• A session tries to pin a buffer which is/was pinned by another session waiting for a
Cache Fusion transfer from another node.
• gc buffer busy is indicative of serialization.
• Fix root cause (e.g. slow log writes) or remove hot spots.

Correlated/Supporting OS and AWR statistics
gc [cr|current] block [2|3]-way à
• Run queue length and load average (either server side OR client side).
• LMS priority changes.
• Signs of congestion on interconnect.
gc [cr|current] block busy à
• Log file parallel write and log file sync in other instances.
• Pin and Flush statistics in other instances.
gc [cr|current] block congested à
• LMS CPU utilization.
• Receive queue time.

28

29
Long “Flush” times usually indicate log I/O issues.
Long “Pins” times may indicate:
ü CPU load on serving instances, leading to slow release of pins by foregrounds
ü Priority inversion on serving instance (preempted while holding an exclusive pin)
ü Inefficient code where pins are held for long (contact Oracle support)

30

New Generation Oracle RAC
Performance Internals

•Exafusion
•Zero copy block sends
•Smart fusion block transfer
•Index contention optimizations
•CR & remastering servers
•Service oriented buffer cache
•Global enqueue S-lock
•RDMA for undo blocks
•In-memory commit cache
•Scalable sequences
•PDB lock domains
•Oracle RAC Sharding
•Fast index split
•PMEM commit accelerator (X8M)
•RDMA for data block reads
•RDMA for undo header blocks
•Broadcast on Commit RDMA
•More index contention optimizations
Recent Performance Optimizations for RAC
* Exadata specific optimizations
More details at: https://www.oracle.com/a/otn/docs/oracle-rac-
cache-fusion-performance-optimization-on-exadata-wp.pdf Future

• Remote Direct Memory Access (RDMA) is the ability for one computer to
access data from a remote computer without any OS or CPU involvement
• Network card directly reads/writes memory with no extra copying or
buffering and very low latency
Exadata Uses RDMA for Extreme Performance
33
DRAM
CPUHCA
Server #2
DRAM
CPU HCA
Server #1
RDMA Write
RDMA Read

RDMA can eliminate:
• All server-side processing & context switches (up to 80% of overall round-trip time)
• No server-side queuing effects, LMS can be completely bottlenecked, and we’ll be unaffected
• Client side context switches
RDMA is now supported for the following operations:
• Undo block reads (18c)
• Shared data block reads (Future)
• Undo header reads (Future)
• In-memory commit cache lookups (Future)
RDMA for Cache Fusion
34

Master is local,
Holder is remote
• Removes all LMS processing
Master is remote,
and is also the Holder
• Cheaper processing in LMS
Master is remote,
Holder is another remote instance
• Remove LMS-LMS messaging
• Does not invoke holder LMS
New Messaging Patterns With Shared Data Block RDMA
FG LMS
1. Request
2. Data Block Transfer
FG LMS
1. Request
Instance 2
Instance 1
Direct
Read
SGAFG
Instance 1 3. Direct
Read
Instance 2
1. Request
2. Grant
SGA
LMS
FG
1. Request
Instance 1
2. Forward
LMS
FG
LMS
3. Direct Read
1. Request
Instance 1
FG
SGA LMS
2. Grant

Example from a system where LMS is NOT saturated
• Cache Fusion averages are healthy, at around ~120 µs for 2-way and ~200 µs for 3-way transfers
• RDMA is very fast at 8 µs à 15x faster
Cache Fusion RDMA Benefits
Total
%Time Wait Waits % DB
Event Waits -outs Time (s) Avg wait /txn time
-------------------------- ----------- ----- -------- --------- -------- ------
cell single block physical 57,904 16 280.33us 1.6 25.4
log file sync 35,778 13 355.15us 1.0 19.9
gc cr grant 2-way 75,206 8 112.77us 2.1 13.3
gc current grant 2-way 13,855 1 106.73us 0.4 2.3
gc cr block 2-way 10,213 1 122.94us 0.3 2.0
gc cr block 3-way 4,946 1 202.63us 0.1 1.6
gc current grant busy 10,494 1 86.34us 0.3 1.4
Sync ASM rebalance 25 1 20.99ms 0.0 .8
gc current block direct read 57,550 0 8.20us 1.6 .7
gc current block 2-way 2,567 0 111.98us 0.1 .4
gc current block 3-way 1,271 0 181.92us 0.0 .4

Example from a system where LMS IS saturated
• With higher load, LMS is now a bottleneck with lots of congested waits
• RDMA is still at 15µs
• Slower than earlier due to client-side scheduling delays
Cache Fusion RDMA Benefits
Total
%Time Wait Waits % DB
Event Waits -outs Time (s) Avg wait /txn time
-------------------------- ----------- ----- -------- --------- -------- ------
log file sync 944,248 782 828.24us 1.0 16.4
gc cr grant 2-way 1,713,004 778 453.94us 1.8 16.3
cell single block physical 1,855,674 743 400.15us 2.0 15.6
gc cr grant congested 83,226 412 4.94ms 0.1 8.6
gc current grant busy 232,165 325 1.40ms 0.2 6.8
gc cr block congested 42,106 217 5.16ms 0.0 4.6
gc cr block 2-way 438,178 203 464.18us 0.5 4.3
gc cr block 3-way 214,811 186 866.67us 0.2 3.9
(snip)
gc current block direct read 864,609 13 15.11us 0.9 .3

Various optimizations have been implemented to help improve scalability for these workloads
• Keep branch block at splitter (12c)
• Make sure splits complete as quickly as possible by preventing branch blocks from being pinged away
• Scalable Sequences (18c)
• Sequence values are automatically prefixed with an instance-specific value
• Oracle RAC Sharding (18c)
• Leverage the Oracle Sharding direct routing API’s to implement data-dependent-routing based on an
application-supplied partitioning key
• Delayed ping (19c)
• Allow local sessions to make more changes to a block while REDO I/O is still pending
• This will help increase “local” changes that can be done without Cache Fusion transfers
• Fast index split (19c)
• Allow sessions to wait for leaf block split completion in the buffer cache, instead of global enqueues
Optimizations for Right Growing Index Workloads
38

PDB Lock Domains Improve Performance
Instance 3
Instance 2
Instance 1
Container Database (CDB)
PDB 1 PDB 2 PDB 3
RAC locks are isolated per-PDB
• Lock masters are distributed among the instances
where the PDB is open
• PDB 1 is a “singleton”; it is only open on instance 1
• All locks for PDB 1 are mastered on instance 1, so
there is 100% locality and no Cache Fusion
messaging
• If more compute is needed, one can dynamically
open PDB 1 on instance 2 and/or 3
• Very common use-case in Autonomous Database
• PDB 2 and 3 are RAC clusters
Please set target_pdbs for best performance
(MOS note: 2644243.1)

RAC (PDB) reconfiguration is only necessary if:
• The PDB is opened or closed on another CDB instance
• If PDB 1 opens on instance 2, RAC reconfiguration is
invoked for PDB 1
• A CDB instance where the PDB is open goes down
(either planed or unplanned)
• If CDB instance 2 goes down, RAC reconfiguration is
invoked for PDB’s 2 and 3
Impact is isolated to the affected PDB’s only:
• PDB 1 is unaffected when CDB instance 3 crashes
• PDB 2 and 3 are unaffected when PDB 1 is opened on CDB
instance 2
• No PDB’s are affected when a 4th CDB instance is brought up
PDB Lock Domains Improve Failure Isolation
40
Instance 3
Instance 2
Instance 1
Container Database (CDB)
PDB 1 PDB 2 PDB 3

41
Questions?

Thank you
Atsushi Morimura
RAC Performance Engineer
Oracle RAC Development

New Generation Oracle RAC
High Availability and Scalability for both
• Traditional Applications
• New Application Paradigms
Provides a single converged and secure
Database for all your application needs
While ensuring complete Isolation between the
PDBs

New Generation Oracle RAC Performance

New Generation Oracle RAC Performance

Related slideshows

More Related Content

New Generation Oracle RAC Performance