· Corruption in the Primary Database • If data blocks critical to application functionality are...

<Insert Picture Here>

Oracle Data Guard: Defining the Next Era in Data Availability and Data Protection Ashish Ray Lee ParsonsGroup Product Manager Database Engineering ManagerOracle [email protected] [email protected]

3

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions.The development, release, and timing of any features or functionality described for Oracle’s products remain at the sole discretion of Oracle.

4

<Insert Picture Here>Agenda

• Disaster Recovery (DR) – Common Concerns • Data Guard – A Quick Introduction• Data Guard – Extending Beyond DR• Amazon.com – Beyond Custom Physical Standbys

5

Common Concerns for DR Solutions• Roadblocks for adoption of DR solutions

• Perception around the term “Disaster”• “Disaster” often linked to destructive events that occur

infrequently, so no strong urge to implement a DR solution“When it happens, we will see.”“We do tape backups, and that should be fine, right?”

• Shortcomings of existing solutionsMost DR solutions involve redundant systems that can’t be utilized for productive useThe solutions are expensive, with no immediate ROI (till “disaster” occurs)

“We don’t have budget for machines basically sitting idle.”

6

What is a “Disaster”?

• Well-recognized disasters such as headline-grabbing events• Fire, earthquake, tsunami, flood, hurricane, …

• What about more mundane events that still cause outage but occur much more frequently?

• Faulty system components – server, network, storage, software, …• Data corruptions• Backup/recovery of bad data• Wrong batch job• Bad HW/SW installations / upgrades / patching• Operator errors• Power outages• Etc.

7

• Examples of the errors observed in the alert.log of the production database:• Errors in file /opt/app/oracle/admin/dg/bdump/dg1.trc:• ORA-01186 : file 93 failed verification tests• ORA-01122 : database file 93 failed verification check• ORA-01110 : data file 93: '/dbmnt/db01/oradata/dg/arch05.dg'• ORA-01251 : Unknown File Header Version read for file number 93

• ORA-01251 - Corrupted file header. This could be caused due to missed read or write or hardware problem or process external to oracle overwriting the information in file header.

• Affected database: one of the most critical databases supporting its retail businesses

• Supports the firm’s primary customer facing applications for trade transaction confirmation, new accounts, and customer account information

Real-life “Disaster”Financial Services Company

Traditional DR solutions such as storage mirroring would propagate this corruption to target storage volumes, rendering them useless as well

8

Needed: Next-Generation DR Solution Comprehensive Availability & Protection

• Data AvailabilityOutages should be tolerated transparentlyOutages should be recovered from quickly

• Data ProtectionStandby data should be isolated from production faultsNo data should be lost

• Systems UtilizationStandby resources should be utilized for productive use

• Fully integrated in a cost-effective mannerThat’s where Data Guard comes in!

9

• Thankfully, they already had Data Guard implemented• Physical Standby, Maximum Availability• Data Guard architecture prevented corruption from affecting their

standby databases

• Failed over to the standby database• New production database up in minutes, no loss of data

• Independently investigated problems at original production server• Problem traced to faulty storage array component• Took a few days to investigate and fix the problem

Remember – Real-life “Disaster”? What Did They Do?

10

• Data Availability & Data Protection solution for Oracle

• Automates the creation and maintenance of one or more synchronized copies (standby) of the production (or primary) database

• If the primary database becomes unavailable, a standby database can easily assume the primary role

• Feature of Oracle Database Enterprise Edition (EE)• Feature available at no extra cost• Primary and standby databases need to be licensed EE

What is Data Guard?

11

Oracle’s Integrated HA Solution Set

System Failures

Data Failures

System Changes

Data Changes

UnplannedDowntime

PlannedDowntime

Real Application Clusters

ASMFlashback

RMAN & Oracle Secure BackupH.A.R.D

Data GuardStreams

Online ReconfigurationRolling Upgrades

Online Redefinition

Oracle M

AA

Best Practices

12

Data Guard Configuration

• Managed as a single configuration• Primary and standby databases can be Real Application Clusters

or single-instance Oracle• Up to nine standby databases supported in a single configuration

PrimaryDatabase

StandbyDatabase

Standby Site A

Standby Site B

Primary Site

StandbyDatabase

Broker

13

Data Guard – DR and Beyond

1. High data availability

2. Comprehensive data protection

3. Efficient systems utilization

Availability Protection Utilization

Data Guard Utility Meter

14

High Data Availability

High Data Availability requirements:Maintain high availability from

Server FailuresNetwork Failures

Needed a failover mechanism that is fast, automatic, doesn’t lose data

Standby Site BPrimary Site A Primary Site BStandby Site A

Role transition following a Disaster / Outage

15

Comprehensive Data Protection

Standby Site BPrimary Site A

Comprehensive Data Protection requirements:Provide bullet-proof protection from

Storage FailuresSite FailuresData CorruptionsOperator Errors

16

Efficient Systems Utilization

Standby Site BPrimary Site A

End-users

Administrators

Efficient System Utilization requirements:Standby resources should be used productively by

Administrators for planned maintenance operationsEnd-users for application access

17


1. High Data Availability, from:Server FailuresNetwork Failures

Server Failures

Network Failures

Availability

18

Data Guard – Combined High Availability and Disaster Recovery

• Provided through the Fast-Start Failover feature• Data Guard automatically fails over to designated

standby database• Standby can become a primary in a few seconds• Application clients may also be automatically re-

connected to the new primary database• No manual intervention, no data loss• Protection from disasters / outages

19

Fast-Start Failover

Standby SitePrimary Site

Observer

1.Data Guard in steady state – transmitting redo2.Observer monitoring state of the configuration

20

Fast-Start Failover


Observer

3. Disaster strikes the primary – connections lost

21

Fast-Start Failover


Observer

4. Observer <=> primary connection times out (timeout threshold configurable)5. Observer asks target standby if it is ready to fail over6. Observer begins Fast-Start Failover

22

Fast-Start Failover

Observer

Primary Site

7. Target standby automatically becomes new primary

23

Fast-Start Failover

Observer

Standby Site Primary Site

8. After old primary is repaired, Observer re-establishes connection9. Observer automatically reinstates old primary to be a new standby10. Redo transmission starts from new primary to new standby

24

Fast-Start Failover – How Fast?

0

5

10

15

20

25

AverageFailover

Time (seconds)

Physical Standby Logical Standby

Single Instance RAC

Figure 2: Fast-Start Failover Test Results

From MAA paper: Fast-Start Failover Best Practices, http://www.oracle.com/technology/deploy/availability/htdocs/maa.htm

25

Fast-Start Failover – Operational Tips• Requires:

• Data Guard Broker to be enabled• Maximum Availability protection mode• Flashback Database to be enabled for auto-reinstate

• Occurs during:• Network / cluster failures• Shutdown abort / datafiles offline

• Best PracticesPlace Observer in the same network segment as the middle tiersMonitoring – FS_FAILOVER_STATUS in v$databaseSet DB_FLASHBACK_RETENTION_TARGET to a minimum of 60 minsWith Grid Control Agent installed, Observer can be automaticallyrestarted if the Observer process were to ever stopPossible to configure multiple Observers on the same server monitoring their own Data Guard configurations

26


2. Comprehensive Data Protection, fromStorage failuresSite failuresData corruptionsOperator errors

Server Failures

Network Failures

StorageFailures

SiteFailures

DataCorruptions

OperatorErrors

Availability Protection

27

Data Guard: Basic DR (Of Course …)Standby SitePrimary Site

1Standby SitePrimary Site

2

New Primary SiteOld Primary Site3

Primary SiteStandby Site4

28

Data Corruption Protection by Data Guard

• Faulty system component could physically corrupt data files / redo log files / control file, affecting primary database operations• Any component can fail: file system, volume manager, device

driver, host bus adapter, storage controller, disk drive• Remember the earlier real-life disaster example?

• Data Guard protection• Robust checks and balances in place to ensure physical data

corruptions on primary database do not affect standby database

• Some real-life examples follow …

29

Data Protection in Action – Example 1• Example: Standby Redo Log corrupted, however Archiver

detects it on standby database, corruption does not spread*** 2005-04-04 20:33:24.670Archiving standby databaseSelected standby logfile…Corrupt redo block 17457 detected: bad checksumSeq: 0x00000bfb Block: 0x00004431 Time: 554759965 Beg: 0x10 Cks: 0x8deb*** 2005-04-04 20:33:25.467ARC0: All Archive destinations made inactive due to error 354*** 2005-04-04 20:33:25.467ORA-00354: corrupt redo log block headerORA-00353: log corruption near block 17457 change 0 time 04/04/2005 19:59:25ORA-00312: online log 11 thread 1: 'D:\REDO01.DBF' *** 2005-04-04 20:33:25.498ARC0: Archiving not possible: error count exceededORA-16038: log 11 sequence# 0 cannot be archived

30

Data Protection in Action – Example 2• Example: archivelog corrupted, and Redo Apply on standby database

detects it and stops applyingTue Jun 22 18:25:29 2004Errors in file ora_3506.trc:ORA-01115: IO error reading block from file 480 (block # 67249)ORA-01110: data file 480: ‘replication_01.dbf’ORA-27091: skgfqio: unable to queue I/OORA-27072: skgfdisp: I/O errorLinux Error: 14: Bad addressAdditional information: 67248ORA-00368: checksum error in redo log blockORA-00353: log corruption near block 281856 change 5682682353914 time 06/22/2004 16:18:43ORA-00334: archived log: ‘redolog_01.arc’…Media Recovery failed with error 368

31

How Does Data Guard Ensure Data Protection?

• Data Guard: a loosely coupled architectureStandby databases kept synchronized through redo blocks, completely detached from possible datafile corruptions on primaryIn some redo transport configurations, redo is shipped from primary SGA, and thus detached from physical I/O corruptions on primarySoftware code-path executed on standby fundamentally different from that of primary – effective seclusion from software errors

• Corruption-detection checks at key interfacesPrimary: during Redo Transport: LGWR, LNS, ARCHStandby: during Redo Apply: RFS, ARCH, MRP, LSP, DBWR

• If redo corruption detected on standby, Data Guard tries to re-fetch valid logs as part of archivelog gap handling

• Fundamental Principle: Primary database corruption should not affect the standby database

32

Data Corruption Protection: Operational Tips

• Two key parameters:db_block_checksum (OFF | TYPICAL | FULL)• Determines whether checksum computed, stored and verified for

data & redo blocksdb_block_checking (OFF | LOW | MEDIUM | FULL)• Semantic block checking for data blocks

• Recommended settings:

Primary Standby

db_block_checksum TYPICAL TYPICAL

db_block_checking MEDIUM MEDIUM*

* Check for possible impact

33

Utilizing Data Guard upon Data Corruption in the Primary Database

• If data blocks critical to application functionality are corrupted• Perform Data Guard switchover / failover to standby

Resumes application availability with a new valid primary databaseCorruption issues on new standby can be investigated offlineProvides fastest predictable recovery time objective (RTO)Fast-Start Failover: no data loss, failover can be done in seconds

• If non-critical data blocks are corrupted• Perform RMAN Block Media Recovery (BLOCKRECOVER …) using a

valid datafile backup from the physical standby databaseUsed when a small number of blocks require media recovery and the blocks that need recovery are knownAffected datafiles will be online (except corrupt blocks)

• Perform RMAN restore and recovery using valid datafiles from the physical standby database• When entire datafiles have been corrupted

34

Utilizing Standby Database To Recover From Logical Corruptions

• Logical corruptions may result from running bad scripts, inadvertent deletes, incorrect updates, etc.

• Database may be operational without any database errors, but application behavior may be impaired

• Solution: use standby database to recover with minimal production downtime• Use Flashback Database on standby database to revert to a

known good state, open up standby, import data back into primary database

• Apply process on standby database may also be run in a delayed mode such that standby may not be affected at all by these logical corruptions

35

Operational ErrorsAnother Real-life Disaster

• Large telecom company• Multi-TB customer service production database serving millions of

customers set up with online redo logs that were not multiplexed• Had at least one large table (over 1 billion rows)• The only available online redo log was corrupted• Database instance was shut down• When it was tried to open:

ksedmp: internal or fatal errorORA-00600: internal error code, arguments: [2662], [1965], [349730312], [1965], [349743443], [666894377], [], []

2662: a data block SCN ahead of the current SCN, possibly due tosome physical corruption

• The production database had no standby database

36

Operational Errors … contd.• Restore old backup & recovery was estimated to take 1day+• Managed to open the database in degraded mode (1/3rd of apps)• Several other corruptions noticed• Used various means to try identifying corrupt blocks

dbms_repair.check_objectdbms_space_admin.tablespace_verifyDBVERIFY(RMAN) BACKUP VALIDATEANALYZE TABLE … VALIDATE STRUCTURE CASCADE ONLINE

• Final decision: rebuild the databaseCorruption spread was extensivePerformance problems running corruption checks on production

• After several days, managed to build a new database using a combination of transportable tablespace and log mining

37

Operational Error – Lessons Learned!

• If they had a Data Guard standby database• Could have simply switched over / failed over to the standby• Application would have been online in a few minutes• Corruption diagnosis and repair could have been done offline

• Having your mechanic do an engine repair on your car while you are driving on the freeway is a BAD idea• Most car service dealerships offer

you a rental car for a reason!

38


Server Failures

Network Failures


StorageFailures

SiteFailures

DataCorruptions

OperatorErrors

PlannedMaintenance

ApplicationAccess

3. Efficient Systems UtilizationAdministrator utilization

Rolling database upgradesMigrate data centers, SANs, platformsUse physical standbys for backupsCloning / testing of production workload

End-user utilizationUse logical standbys for apps, reporting, read-access

39

SQL Apply – Rolling Database Upgrades

Major ReleaseUpgrades

Patch SetUpgrades

Cluster Software & Hardware Upgrades

Initial SQL Apply Config

Clients Redo

Version X Version X

1

BA

Switchover to B, upgrade A

Redo

4

Upgrade

X+1X+1

BA

Run in mixed mode to test

Redo

3X+1X

A B

Upgrade node B to X+1

Upgrade

LogsQueue

X2

X+1

A B

40

Other Planned Maintenance Activities Possible with Data Guard

• Data Guard: very effective way to migrate data centers / SANs• Create standby databases without any production impact• Keep them synchronized automatically

Possible to use incremental backups – see MetaLink Note 290814.1For expediting standby sync-up, refer to max_connections attribute

• Before the cut-off time, perform Data Guard switchover

• Switchover: another effective way to do configuration changes with minimal downtime

• Hardware changes on primary server: Data Guard is agnostic of underlying server / storage system

• Migration to RAC / ASM

• Selected platform migration – between:• HP-UX PA-RISC and HP-UX Itanium (see MetaLink Note 395982.1)• Linux distributions (e.g. RedHat and Suse) (on the same platform)• AMD-64 and Intel-64 (same OS)

41

Offload Backups to Physical Standby• Oracle Recovery Manager (RMAN) integrated with Data Guard –

backups can be offloaded to physical standby• On production server: saves processing cycles, no impact• Backups can be done while physical standby in recovery / open read-only• Backups can be used by primary / other standbys• Standbys can be created from RMAN backups with no primary downtime

• Operational best practices for backups on standby• Flash Recovery Area configuration (see MetaLink Note 331924.1):

Primary: CONFIGURE ARCHIVELOG DELETION POLICY TO APPLIED ON STANDBY;Standby: CONFIGURE ARCHIVELOG DELETION POLICY TO NONE;

• Use identical directory structures• Use RMAN recovery catalog• Use spfile for primary and standby databases

Refer to “Using Recovery Manager with Oracle Data Guard in Oracle Database 10g” http://www.oracle.com/technology/deploy/availability/techlisting.html

42

Physical Standby for Cloning / Testing

• A Physical Standby can be opened read/write for development, reporting, or testing purposes, and then flashed back to be a physical standby once again• When flashed back, Data Guard automatically synchronizes

the standby with the primary

43

Cloning / Testing … contd.• Operational Steps

Set up Flash Recovery Area on the physical standbyCancel Redo Applycreate restore point pre_clone guarantee flashback database;

Defer primary database connection to this standbyalter database activate standby database;

Open the “physical standby”Perform testing / reportingflashback database to restore point pre_clone;

alter database convert to physical standby;

Resume Redo ApplyReconnect primary to standby

• Excellent way to do testing on a production workload without impacting the production database

44

Scale-out with Logical Standby for Web-App

Primary Database

Logical standbys: scaling out read access (web content browsing)

Physical standby, sync transport, for DR

Server Farm approach

45

Offload Applications to Logical Standby• Use Logical standby to offload applications from production database

and save processing cycles• Applications well-suited: those that do a lot of processing with read-only

data, and produce interim data-sets that are not mission-critical enough to be disaster-protected

• Examples:A billing application that does not cause key database updatesAn application that schedules site visits for tech support personnelAn ETL tool that feeds data into a Data Warehousing application

• Operational best practicesChange GUARD setting: alter database guard standby;These apps may even be restricted to access only the logical standbyIf logical standby is RAC, the non-Apply nodes should be utilized for these appsWatch out for unsupported data types

46

Offload Reporting to Logical Standby• Similar to offloading application processing• Reporting apps may have certain special requirements

• Reporting may have to be on the latest data (real-time reports)Use LGWR SYNC for redo transportUse Real-Time apply

alter database start logical standby apply immediate;

• Reporting may require local write-access to summary tablesIf these tables also exist on primary, they need to be skipped

Stop applySkip apply on these tables (use dbms_logstdby.skip)Start applyChange GUARD setting: alter database guard standby;

• Reporting apps may require additional indexes / materialized viewsStop applyDisable guard for the session (alter session disable guard;)Create indexEnable guard for the session (alter session enable guard;)Start apply

• For RAC standby, the non-Apply nodes should be utilized for reporting

47

Data Guard – Defining the Next Era in Data Availability & Data Protection1. High data availability

Integrated high availability through Fast-Start Failover

2. Comprehensive data protectionProtection from data corruptionsProtection from operational errors

3. Efficient systems utilizationRolling database upgradesMigrate data centers, SANs, platforms, etc.Use physical standbys for backupsCloning / testing of production workloadUse logical standbys for apps, reporting, read-access


48

Amazon.com – Beyond Custom Physical Standbys

49

Amazon.com – Life before Data Guard

• Major HA & DR requirements:• Guaranteed no data loss• Failovers should be as quick as possible• No impact on primary database’s availability due to standby

failures

• Amazon built scripts and programs to automatically maintain physical standby databases• Automatically copy, apply and archive redo• Monitor, Respond, Alarm to problems• Failover/Switchover is a manual process

50

Amazon.com – Life before Data Guard

• Issues with building a custom standby solution• Complexity of code to handle all failure cases• Solutions tied to site specific configuration of Oracle

• Compatibility with newer releases of Oracle and the Operating System

• Unique solutions require unique training for new DBAs• Longer than desired switchover• Primary is unaware of the standby and doesn’t care if it is

current• Does not scale to a large number of databases

51

Amazon.com – Supporting Physical Standbys using Data Guard

• Immediate replacement for custom recovery solution• 10gR1 Data Guard with Maximum Performance (ARCH

transport method)

• Data Guard is superior to our home grown solution• Switchover in minutes• Changes are pushed as they are generated• Data Guard handles redo shipping and gap recovery• Additional push/apply methods are available based on your

requirements• Some monitoring and recovery agents still required• Works across any environment that supports Oracle

52

Amazon.com – Improving Availability using Fast-Start Failover in 10gR2

• Fast-Start Failover meets Amazon’s availability requirements• Now possible to do very fast failovers without any data loss• Fast-start Failover will not commence if target standby is not

synchronized with the primary• Synchronization status shown through fs_failover_status

column in v$database • Since Fast-Start Failover is based on Maximum Availability,

primary database not impacted upon standby failures• Old primary automatically reinstated as new target standby

53

Amazon.com – Data Guard Wish List

• Increasing capacity with readable standbys • Physical standbys that could support Read-Only operations

during apply would become part of the application architecture and not simply part of the infrastructure

• Increasing options with support for low bandwidth WANs• The ability to support standbys in remote locations connected

by low bandwidth networks, would extend the reach of the infrastructure and removes distance as a barrier

54

Amazon.com – Summary Assessment

• Data Guard has fundamentally changed the way we look at and manage standby databases

• Fast-Start Failover has the potential to increase availability by an order of magnitude

• If the improvements in standby management provided in 10g are any indication, we can’t wait to get our hands on 11g

55

Data Guard Sessions from Oracle Development at Oracle OpenWorld

1. Monday 4:45pm - Session S281211 - Oracle Data Guard Customer Experiences: Practical Lessons, Tips, and Techniques, Moscone South 102

2. Tuesday 10:45am - Session 281207: Next Generation Oracle Database Availability: A Sneak Preview, Moscone West 2009-2011

3. Tuesday 1:45pm - Session 281212: Oracle Data Guard: Defining the Next Era in Data Availability and Data Protection, Moscone South 305

4. Wednesday 11:30am - Session 281210: Oracle Data Guard Tips and Tricks: Direct from Oracle Development, Moscone South 104

5. Thursday 12:30pm - Session 281208: MAA Best Practices: Building a Highly Available and Disaster-Proof Architecture, Using Data Guard, Oracle RAC, Automatic Storage Management, and Flashback, Moscone South 102

6. Thursday 2:00pm - Session 281209: MAA Best Practices: Reducing Downtime for Planned Maintenance Operations Using Oracle Database 10g, Moscone South 304

56

For More Information

http://search.oracle.com

orhttp://www.oracle.com/technology/deploy/availability/htdocs/DataGuardOverview.html

&http://www.oracle.com/technology/deploy/availability/htdocs/maa.htm

Data Guard

http://www.oracle.com/technology/deploy/availability/htdocs/DataGuardOverview.html

http://www.oracle.com/technology/deploy/availability/htdocs/maa.htm

· Corruption in the Primary Database • If data blocks critical to application functionality are...

Documents

Transcript of · Corruption in the Primary Database • If data blocks critical to application functionality are...