OracleRACInternals_CacheFusion_RACPerfTuning

Oracle Real Application Clusters(RAC)

RAC Internals, Cache Fusion and Performance Tuning

A “BrainSurface” Presentation

www.brainsurface.com

Disclaimer

n This views/content in this document are those of the author and do not necessarily reflect that of Oracle Corporation and/or its affiliates/subsidiaries. The material in this affiliates/subsidiaries. The material in this document is for informational purposes only and is published with no guarantee or warranty, express or implied.

Agenda

• Node & Clusterware stack startup sequence

Oracle RAC Internals

• Heartbeat mechanism

• Voting disk functionality

• Split-brain resolution

• Node reboot causes

Oracle RAC Internals: Node Startup Sequence

Figure/Diagram from Oracle Documentation

Clusterwarestartup orderdiscussed in thecoming slides

Oracle RAC Internals: Clusterware Stack Startup Sequence: Pre-11gR2

h1:35:respawn:/etc/init.d/init.evmd run >/dev/null 2>&1 </dev/null

h2:35:respawn:/etc/init.d/init.cssd fatal >/dev/null 2>&1 </dev/null

h3:35:respawn:/etc/init.d/init.crsd run >/dev/null 2>&1 </dev/null

Entries in the /etc/inittab

inittab

OS startup

qAdded during the root.sh execution

Node boots up

inittab

init.crsdinit.cssdinit.evmd

evmd.bin

oprocd.bin

oclsmon.bin

ocssd.bin crsd.bin

startup

üManage and monitor CRS resources

üUpdates OCR when srvctl is usedüProvides cluster group membership

üMonitor nodes in the cluster via heartbeat mechanism

üPublish the events upon detectingüResponsible to execute callouts

Clusterware stack

Voting disk

h1:3:respawn:/sbin/init.d/init.ohasd run >/dev/null 2>&1 </dev/null

Entries in the /etc/inittab

inittab

init.ohasd

qAdded during the root.sh execution

OS startup

Node boots up

Oracle RAC Internals: Clusterware Stack Startup Sequence: 11gR2

init.ohasd

orarootagent.binoraagent.bin cssdagentØCSSD MonitorØCRSDØCTSSDØDiskmonØACFS Drivers

ØMDNSDØGIPCDØGPNPDØEVMDØASM

Oracle High Availability Services Daemon

oraagent orarootagent

ØOCSSD

ØONSØASM InstanceØDB InstanceØListenerØSCAN Listener

ØGSDØVIPØSCAN VIP

Oracle RAC Internals: Clusterware Stack Startup Order: 11gR2

Oracle High Availability Services Daemon

Oracle RAC Internals: Clusterware Stack Startup Order:11gR2

Clusterware and heartbeat mechanismTwo (02) types of heartbeats:

1.Network heartbeat• Performed once per second.• Node will evict from cluster when failed to send a network heartbeat within <MissCount –

Oracle RAC Internals: Clusterware and Heartbeat Mechanism

• Node will evict from cluster when failed to send a network heartbeat within <MissCount –maximum time in seconds> time frame.

• clssnmPollingThread (ocssd.log)

CSSD]2009-01-27 11:15:37.409 [18] >TRACE: clssnmPollingThread: Eviction started for node usogp06 (6), flags 0x0001, state 3,wt4c 0

2.Disk (Voting Disk) heartbeat• Each node of a cluster writes a disk heartbeat to voting disk every second• Reads kill block every second to commit suicide, if required.• Node evicts from cluster if no heartbeat is updated within I/O (MissCount/Disktimeout)

timeout.• clssnmDiskPMT (ocssd.log)

CSSD]2009-10-11 15:56:23.668 [93645744] >WARNING: clssnmDiskPMT: long disk latency >(45940 ms) to voting disk (0//dev/raw/raw1)

CSS parameters and their default values in 11gR2:

crsctl get css prarameter

crsctl set css parameter value

Oracle RAC Internals: Clusterware and Heartbeat Mechanism

crsctl set css parameter value

clusterguiddisktimeout (200 (seconds))misscount (30 (seconds)) – more misscount time when vendor cluster is configuredreboottime (3 (seconds))priority (4 (UNIX), 3 (Windows))logfilesize (50 (MB))

Node1 Node2 Node3

Network heartbeat (every second)

• Used by the Cluster synchronization Service (CSS).

• It records and manages the node membership information.

• At any time, each node of a cluster

Oracle RAC Internals: Voting Disk Functionality

Voting Disk

All 3 nodes can see each other

ALL IS WELL!

Disk heartbeat(once per second)

• At any time, each node of a cluster must be able to access more than half of the voting disks.

• Recommended to have 2n+1 (odd number) voting disk files.

Voting Disk

Node1 Node2 Node3

Split-brain

Oracle RAC Internals: Split-Brain Syndrome

Voting Disk

can’t see 1&2

Node 1 & 2 can see each otherbut both can’t see 3

?lets evict Node3

Kill yourself (Node3)

What is Split-Brain?

The term "Split-Brain" is often used to describe the scenario when two or more co-operating processes in a distributed

Oracle RAC Internals: Split-Brain Resolution

Quote/Abstract from MOS document

when two or more co-operating processes in a distributed system, typically a high availability cluster, lose connectivity with one another but then continue to operate independently of each other, including acquiring logical or physical resources, under the incorrect assumption that the other process(es) are no longer operational or using the said resources.

When does a node reboots?

•Network failure– interconnect

• Slow interconnect (latency) – must fail 30 consecutive times! | check private

Oracle RAC Internals: Node Reboot Causes

• Slow interconnect (latency) – must fail 30 consecutive times! | check private interconnect configuration

• Voting disk IO– cannot read or write | refer ocssd.log

• CPU-bound– CPU is too busy to maintain heartbeat | configure oswatcher to verify resource consumption

• Files moved, delected, changed or some other human error

• Configuration error– wrong network for private interconnect

• ocssd process died

• Some Oracle Clusterware bug

Oracle RAC Internals: Grid Infrastructure: Log Files Hierarchy

n Cache Fusion is the driving technology behind Oracle RAC that enable Applications to scale out on multiple servers/instances.

What is Cache Fusion? Synopsis & Overview

n Cache Fusion/Synchronization enables concurrent/simultaneous transaction-processing between all Instances using the Private Cluster Interconnect.

n DB Blocks are synchronized, NOT mirrored = Faster performance.

What is Cache Fusion? Synopsis & Overview

n With the advent of Oracle RAC 9i in 2001, Cache Fusion provides the following great features:n More nodes can be added/removed in HOT MODE=ZERO DOWNTIME with zero database MODE=ZERO DOWNTIME with zero database downtime to provide elasticity and scalability.

n Database Files residing on Shared Disk Cluster File System provide a uniform, fast and read-consistent image to the end-user.

n Applications typically scale out-of-the-box with zero/minimal tuning.

Cache Fusion –Synopsis & Overview

n Cache Fusion is very fast due to the fact that, disk writes are eliminated when other instances request blocks for updates.

Cache Fusion is a mechanism within Oracle n Cache Fusion is a mechanism within Oracle RAC employs Shared Cache Architecture that fuses the in-memory data buffer cache across all nodes into a single logical read-consistent buffer cache available to all instances.

n DB Blocks are transferred in-memory from instance-to-instance cache over the Cluster InterConnect when requested after proper locking procedures are implemented.

Cache Fusion –Synopsis & Overview

n Global Cache Service (GCS) is used for FAST instance-to-instance block buffer transfer and establishes/implements Cache Coherency = Never more than 3 hops.

n Global Enqueue Service (GES), previously known as Dynamic Lock Manager (DLM) is used for block buffer locking.

n Global Resource Directory (GRD) is used for keeping track of Block Buffer Location/Mode/Role information.

n The Private Cluster InterConnect is used for block-transfers amongst instances to enable Cache Fusion.

Cache Fusion Architecture -Overview

Cache Fusion Architecture –Global Resource Directory (GRD)

n GCS & GES maintain the Global Resource Directory (GRD).

Internal Repository stored by all instances of the RAC n Internal Repository stored by all instances of the RAC Cluster.

n Global Resource Directory (GRD) is used for keeping track of Data Structures, Block Buffer Location, Mode, Role, Inventory etc.

Cache Fusion Architecture –Global Cache Service (GCS)

n The backbone of Cache Fusion: Responsible for Cache Coherence.

n Responsible for maintaining different block modes and transfer of data buffers amongst the instances.and transfer of data buffers amongst the instances.

n Implemented by the Global Cache Service Processes (LMSn).

n Lock Manager Server (LMS): Processes that are responsible for remote messaging.

n LMSn: n = 0 – 9: Upto 10 LMS processes: Can be set with the Init parameter GCS_SERVER_PROCESSES

Cache Fusion Architecture –Global Enqueue Service (GES)

n Global Enqueue Service (GES), previously known as Dynamic Lock Manager (DLM) is responsible for locking mechanisms used in Cache Fusion.

n LMON process responsible for cluster monitoring & management of global resources: Also know as Cluster Group Services. of global resources: Also know as Cluster Group Services.

n LMD0 processes responsible for:

n Management of resource requests from RAC instances.

n Distributed Deadlock Detections.

n Processing of Enqueued Requests.

n Access Control to Global Enqueues.

Cache Fusion –Measuring Efficiency

n Global Cache Services (GCS) Waits = Cross-Instance Block transfer Waits Cross-Instance Block transfer Waits

= Measure of Data Block Transfer Efficiency.

Cache Fusion –Dynamic Performance Views

n Some useful Dynamic Performance Views for monitoring Cache Fusion:n v$gc_elementn v$cache

v$instance_cache_transfern v$instance_cache_transfern v$cr_block_servern v$cache_transfern v$ges_blocking_enqueuen gv$file_cache_transfern gv$temp_cache_transfern gv$cache_transfern gv$class_cache_transfer

• Nemiec (2004 – 9i RAC)– App Tuning– Database Tuning

• Nanda (2009)– CPU and I/O (not

Interconnect) are necessary for RAC

RAC Performance Tuning: Starting Out

– OS Tuning

– THEN... RAC Tuning

• These quotes are from presentations in the RAC SIG library.

necessary for RAC Performance

• Lawson (2010)– “The Essence Of

Performance Tuning Is The Same”

• Top-Down– Application ResponsivenessGrid Control

• Bottom-Up– Storage

• Spindles, Controllers, Paths

RAC Performance Tuning: Approaches

– Grid Control Performance Tab

– Statspack/AWR Reports

• Goal: Minimize Response Time or Throughput

Controllers, Paths– OS

• I/O times, queues• Network latency• Memory• CPU (each core)

• Goal: Balance & Maximize Utilization

• Look Out For:– Indexes– Sequences

RAC Performance Tuning: Application & Schema Design

– Sequences– “Hot” rows or small tables– MSSM– “gc” Wait Events– High Interconnect Utilization

• Main Principle: parallelize(avoid serialization on any data)

• Decrease rows/block• Reverse Key or Hash Indexes

RAC Performance Tuning: Application & Schema Design

on any data)• If it doesn't scale on SMP then it won't scale on RAC– Same principles of good app design for non-RAC!!

– No Range Scans

• Seq NoOrder+Cache• ASSM (or FreeL Gr)• Data & Index Partitioning

• App Partitioning

RAC Performance Tuning: Tune the Entire System as a Whole

Figure/Diagram from Bert Scalzo

RAC Performance Tuning: Tune the Entire System as a Whole

RAC Performance Tuning: Real Life Case Study

• Hardware• All nodes have similar performance characteristics• Interconnect (The RAC Achilles’ heal)

• Network segment truly private• Bond NIC’s to improve throughput• All nodes set NIC’s to Jumbo Frames

RAC Performance Tuning: Configuration Checklist

• All nodes set NIC’s to Jumbo Frames• Switches / VLAN’s set to Jumbo Frames• Consider 10Gbit Ethernet for Interconnect

• Storage• Multipath• Verify settings for read & write caching match application nature• If using iSCSI, treat as similar to interconnect network (see above)

• Software• All nodes have the exact same OS patches• All nodes have the exact same Oracle patches• Oracle both recommends and pushes for using ASM on RAC• Do NOT rely on non-RAC enabled scripts or tools for handling RAC

• DBCA’s default block size is 8K• Many DBA’s experience is that bigger block size is better• So most databases these days often have block sizes >= 8K

• But bigger is not always better

RAC Performance Tuning: Block Size is Important

• But bigger is not always better• Block size and number of nodes should be considered (next 2 slides)• No matter how fast or good cache fusion is – don’t stress it if unnecessary

• Example: OLTP application using 8K block size and having 8 nodes

• Larger block size = more rows per block• More rows per block = more likelihood of block contention• More nodes (>=4) = more likelihood of block contention

• More block contention means more cache fusion work• Remember, interconnect is most often RAC’s Achilles’ heal …

RAC Performance Tuning: Block Contention

Summary

n To summarize, Oracle RAC is proven, robust and stable and is used by corporations, organizations & governments across the globe to achieve High Availability, Elasticity & Scalability by providing a lower-cost and Scalability by providing a lower-cost and higher ROI alternative to Mainframe-like SMP (Symmetric Multi-Processing) models of computing. Learn more about Oracle RAC at Oracle's RAC homepage.

http://www.oracle.com/technology/products/database/clustering/index.html

OracleRACInternals_CacheFusion_RACPerfTuning

Documents

Transcript of OracleRACInternals_CacheFusion_RACPerfTuning

Fortran

Disclaimer

One Flew Over the Cuckoo's Nest

Basic Buffer Overflows Explained

Daniel Zanella and Alexander Weygers

xCDC14a

Jan Van Eyck and the Man In A Red Turban

Heidegger Kritik

Aesops Fables

Effective Parenting: Establishing Boundaries

Steve Jobs' Commencement Speech at Stanford

Bodybuilding - The Rock Hard Challenge (Month 1 Training)

Bhagavad Gita

Oedipus The King: The Ideal Tragic Play

Plato - Symposium

I Am a Holocaust Denier and I Am Unafraid

Life Is Just A Dream - Or Is It?

Explore The Levels of Creation

Star Wars Prequel Trilogy Trivia (Episodes I-III)

Rubik's cube solution