Emergency Database Failover : Impacts & Recovery Plan

8
Emergency Database Failover: Impacts & Recovery Plan Trey Felton – ERCOT IT

description

Emergency Database Failover : Impacts & Recovery Plan. Trey Felton – ERCOT IT. Synopsis. ISM - Information Services Master Database DB – Database EDW – Electronic Data Warehouse. Synopsis. Failover. Emergency DB failover on April 21 st , 2008 - PowerPoint PPT Presentation

Transcript of Emergency Database Failover : Impacts & Recovery Plan

Page 1: Emergency Database Failover : Impacts & Recovery Plan

Emergency Database Failover:Impacts & Recovery Plan

Trey Felton – ERCOT IT

Page 2: Emergency Database Failover : Impacts & Recovery Plan

2

Synopsis

Market DB(Taylor)

Logical Standby(RSS)

ISM(EDW)

Market DB Physical Standby

(Austin)

LodeStar

Paperfree

Siebel

ISM - Information Services Master DatabaseDB – DatabaseEDW – Electronic Data Warehouse

Page 3: Emergency Database Failover : Impacts & Recovery Plan

3

Synopsis

Market DB(Taylor)

Logical Standby(RSS)

ISM(EDW)

Market DB Physical Standby

(Austin)

LodeStar

Paperfree

Siebel

Failover

Out of synch (24 hrs)

– Emergency DB failover on April 21st, 2008• Market DB (which feeds ISM) became

unresponsive– Data could not be written/read

– Synchronization issues caused a 24 hr gap in data

• Propagated through to ISM

ISM - Information Services Master DatabaseDB – DatabaseEDW – Electronic Data Warehouse

Page 4: Emergency Database Failover : Impacts & Recovery Plan

4

Synopsis

Market DB(Taylor)

Logical Standby(RSS)

ISM(EDW)

Market DB Physical Standby

(Austin)

LodeStar

Paperfree

Siebel

Failover

Re-created ISM(for Recovery)

Recovered Extracts

– Physical Standby brought online– ISM rebuilt through Source data to recover

affected extracts

Source Data

ISM - Information Services Master DatabaseDB – DatabaseEDW – Electronic Data Warehouse

Page 5: Emergency Database Failover : Impacts & Recovery Plan

5

Impacts

• Impacts:– Market transactions were prevented from updating ISM through Logical Standby

• Market DB utilizes a standby to prevent outages / performance degradations– Logical Standby (RSS) became out of synch with Physical Standby by 24 hrs

• April 22 at 11:14am through April 21 at 10:44am• Other DBs feeding ISM continued normally (only Market DB was out of synch)

– Priority of rebuild led to the Standby being rebuilt before the RSS• Market DB has to be kept up• This prolonged the outage to the EDW and affected extracts

– Prices had to be recalculated and extracts restored from Source• Price adjustments for NSRS were completed June 5th

• Missing extracts for April 21 - April 30 completed on July 1st

• Why did recovery take so long?– ISM generates up to 25-35G of data per day– Data restored from Source back to April 1st

• 120 Terabytes had to be restored in order to roll-forward through transaction gap• Archive log changes applied during 24-hour gap

Page 6: Emergency Database Failover : Impacts & Recovery Plan

6

Emergency Database Failover

• All data was restored with 100% accuracy

• The affected market systems that caused the April failure:• Run the balancing energy and ancillary services markets

• Not used for wholesale batch or the retail markets. 

• ERCOT considers this to be an isolated incident and not a systemic problem

Page 7: Emergency Database Failover : Impacts & Recovery Plan

7

Going Forward

• Actions to prevent future occurrences:– Nodal market DBs will utilize newer Hardware

• More fault tolerance

• Redundancy

– Change of architecture in the replication process for Nodal • Proof of Concept recently introduced into the Nodal market systems

• Testing underway

– ERCOT is conducting a risk/cost analysis of several options for these Zonal systems

• To be presented to TAC in August

– New Backups / Recovery Procedures • Project initiated to stabilize our database backup procedures

• Shorter recovery time

Page 8: Emergency Database Failover : Impacts & Recovery Plan

8

Data Recovery

NOTICE DATE: July 1, 2008NOTICE TYPE: W-A042308-48 UPDATE Extracts - WholesaleCLASSIFICATION: PublicSHORT DESCRIPTION: ERCOT has completed recovery of the missing data for April 21 through

April 30, 2008. INTENDED AUDIENCE: QSEs DAY AFFECTED: April 21 through April 30, 2008LONG DESCRIPTION: ERCOT conducted an emergency database failover on April 21, 2008

following a hardware failure. This database failover resulted in an out-of-synch data problem from April 21 through April 30. ERCOT developed a phased process to attempt to thoroughly recover the missing data. The missing data has been recovered for the following extracts.  A market notice will be sent when the extracts are expected to be posted.

Act_Res_OutputAncillary_Services_DailyBids_and_Schedules_DailyForecast_Data_DailyMarket_Information_DailySched_and_Actual_LoadSelf_Sch_Energy_ServicesASDEPLOYMENTS