Emergency Database Failover : Impacts & Recovery Plan
description
Transcript of Emergency Database Failover : Impacts & Recovery Plan
Emergency Database Failover:Impacts & Recovery Plan
Trey Felton – ERCOT IT
2
Synopsis
Market DB(Taylor)
Logical Standby(RSS)
ISM(EDW)
Market DB Physical Standby
(Austin)
LodeStar
Paperfree
Siebel
ISM - Information Services Master DatabaseDB – DatabaseEDW – Electronic Data Warehouse
3
Synopsis
Market DB(Taylor)
Logical Standby(RSS)
ISM(EDW)
Market DB Physical Standby
(Austin)
LodeStar
Paperfree
Siebel
Failover
Out of synch (24 hrs)
– Emergency DB failover on April 21st, 2008• Market DB (which feeds ISM) became
unresponsive– Data could not be written/read
– Synchronization issues caused a 24 hr gap in data
• Propagated through to ISM
ISM - Information Services Master DatabaseDB – DatabaseEDW – Electronic Data Warehouse
4
Synopsis
Market DB(Taylor)
Logical Standby(RSS)
ISM(EDW)
Market DB Physical Standby
(Austin)
LodeStar
Paperfree
Siebel
Failover
Re-created ISM(for Recovery)
Recovered Extracts
– Physical Standby brought online– ISM rebuilt through Source data to recover
affected extracts
Source Data
ISM - Information Services Master DatabaseDB – DatabaseEDW – Electronic Data Warehouse
5
Impacts
• Impacts:– Market transactions were prevented from updating ISM through Logical Standby
• Market DB utilizes a standby to prevent outages / performance degradations– Logical Standby (RSS) became out of synch with Physical Standby by 24 hrs
• April 22 at 11:14am through April 21 at 10:44am• Other DBs feeding ISM continued normally (only Market DB was out of synch)
– Priority of rebuild led to the Standby being rebuilt before the RSS• Market DB has to be kept up• This prolonged the outage to the EDW and affected extracts
– Prices had to be recalculated and extracts restored from Source• Price adjustments for NSRS were completed June 5th
• Missing extracts for April 21 - April 30 completed on July 1st
• Why did recovery take so long?– ISM generates up to 25-35G of data per day– Data restored from Source back to April 1st
• 120 Terabytes had to be restored in order to roll-forward through transaction gap• Archive log changes applied during 24-hour gap
6
Emergency Database Failover
• All data was restored with 100% accuracy
• The affected market systems that caused the April failure:• Run the balancing energy and ancillary services markets
• Not used for wholesale batch or the retail markets.
• ERCOT considers this to be an isolated incident and not a systemic problem
7
Going Forward
• Actions to prevent future occurrences:– Nodal market DBs will utilize newer Hardware
• More fault tolerance
• Redundancy
– Change of architecture in the replication process for Nodal • Proof of Concept recently introduced into the Nodal market systems
• Testing underway
– ERCOT is conducting a risk/cost analysis of several options for these Zonal systems
• To be presented to TAC in August
– New Backups / Recovery Procedures • Project initiated to stabilize our database backup procedures
• Shorter recovery time
8
Data Recovery
NOTICE DATE: July 1, 2008NOTICE TYPE: W-A042308-48 UPDATE Extracts - WholesaleCLASSIFICATION: PublicSHORT DESCRIPTION: ERCOT has completed recovery of the missing data for April 21 through
April 30, 2008. INTENDED AUDIENCE: QSEs DAY AFFECTED: April 21 through April 30, 2008LONG DESCRIPTION: ERCOT conducted an emergency database failover on April 21, 2008
following a hardware failure. This database failover resulted in an out-of-synch data problem from April 21 through April 30. ERCOT developed a phased process to attempt to thoroughly recover the missing data. The missing data has been recovered for the following extracts. A market notice will be sent when the extracts are expected to be posted.
Act_Res_OutputAncillary_Services_DailyBids_and_Schedules_DailyForecast_Data_DailyMarket_Information_DailySched_and_Actual_LoadSelf_Sch_Energy_ServicesASDEPLOYMENTS