On-Time Product Delivery COPC - HPCC Best Practices 14-15 March 2011
description
Transcript of On-Time Product Delivery COPC - HPCC Best Practices 14-15 March 2011
On-Time Product DeliveryCOPC - HPCC Best Practices
14-15 March 2011
Allan DarlingDeputy Director
NCEP Central Operations
Where America’s Climate, Weather, Ocean and Space Weather Services Begin
COPC HPCC Best Practices - 14-15 March 2011 2
On-Time Product Delivery
COPC HPCC Best Practices - 14-15 March 2011 3
NCEP MissionNCEP delivers science-based environmental predictions to the
Nation and the global community. We collaborate with partners and customers to produce reliable, timely, and accurate analyses, guidance, forecasts and warnings for the protection of life and property and the enhancement of the national economy.
NCEP Goals and Strategies• Information Systems
– Enhance the real-time, on-time, all the time access, display and delivery of NCEP products and services.
COPC HPCC Best Practices - 14-15 March 2011 4
On-Time Product DeliveryThe principle performance metric for NCEP Operational
Supercomputing, measured since 1999
Underlying PhilosophyProduct delivery is the last event in the whole
modeling process. To deliver on time, the entire chain of events must work as intended.
One Measurement of Operational Success
COPC HPCC Best Practices - 14-15 March 2011 5
Incentives for CapabilityMeasurement Area Indicator 2010
Baseline Calendar Year 2010
Comments
Customer Results 1-day Precipitation Forecast threat score
29 35
Customer Results Seasonal Heidke Temperature skill score:
19 18
Mission and Business Results
48-Hour Hurricane Tracking Forecast 48-hr Hurricance tracking intensity Forecast
142 miles 14 knots
95 nm* 14.7 *
*The final outcomes of these measures are reported at the end of hurricane season which is Nov. 30.
Processes and Activities On Time Product Generation
99.92% 99.85%
Technology System Availability 99% 99.98%
Technology Time to Switch to Backup System
30 min. 9.6 min
COPC HPCC Best Practices - 14-15 March 2011 6
On-Time Product Delivery
Dual System CM & Ops Practice Refinement
COPC HPCC Best Practices - 14-15 March 2011 7
Enabling the Capability
SystemArchitecture Technical Practice
High Availability Configuration Management
Operations Practices
On-Time Product Delivery
Metrics
COPC HPCC Best Practices - 14-15 March 2011 8
Technical PracticeMeasurement
• Products are “on time” if they are released within 15 minutes of their assigned target delivery time
• Target delivery times are based on 30-day average availability times of products
• Target delivery times are adjusted as needed– Model changes– System changes
• New products added as part of the model implementation process
• Timeliness measured for ~720,000 products today
COPC HPCC Best Practices - 14-15 March 2011 9
Technical PracticeMeasurement
• Some products are excluded from measurement– Inconsistent delivery times (e.g. on-demand
dispersion models)– Not delivered through operational dissemination
services• Measurement performed daily at 1200Z
– Entire previous day– First half of current day
COPC HPCC Best Practices - 14-15 March 2011 10
Operations Practice• Daily Meeting to review:
– Operations log– Status of open issues– On time delivery metrics– Calendar of planned events
• Weekly Meeting with HPC vendors to review:– Facility and system status– System utilization– Vendor open issues
12:1
0:00
12:4
0:00
13:1
0:00
13:4
0:00
14:1
0:00
14:4
0:00
15:1
0:00
15:4
0:00
16:1
0:00
16:4
0:00
17:1
0:00
17:4
0:00
18:1
0:00
18:4
0:00
19:1
0:00
19:4
0:00
20:1
0:00
20:4
0:00
21:1
0:00
21:4
0:00
22:1
0:00
22:4
0:00
23:1
0:00
23:4
0:00
0:10
:00
0:40
:00
1:10
:00
1:40
:00
2:10
:00
2:40
:00
3:10
:00
3:40
:00
4:10
:00
4:40
:00
5:10
:00
5:40
:00
6:10
:00
6:40
:00
7:10
:00
7:40
:00
8:10
:00
8:40
:00
9:10
:00
9:40
:00
10:1
0:00
10:4
0:00
11:1
0:00
11:4
0:00
0
10
20
30
40
50
60
70
80
90
100
Percent On-Time Product Creation for Stratus/Cirrus
<15 Minutes <10 Minutes <5 Minutes
Time (GMT)
Per
cen
t3/2/2011 3/3/2011to
Percent less than 15 min delayed for Wednesday (00Z to 23:59Z)
Percent less than 15 min delayedfor Thursday (00Z to 12Z)
95.1258 100
2011
0201
2011
0202
2011
0203
2011
0204
2011
0205
2011
0206
2011
0207
2011
0208
2011
0209
2011
0210
2011
0211
2011
0212
2011
0213
2011
0214
2011
0215
2011
0216
2011
0217
2011
0218
2011
0219
2011
0220
2011
0221
2011
0222
2011
0223
2011
0224
2011
0225
2011
0226
2011
0227
2011
0228
80
82
84
86
88
90
92
94
96
98
100
On-time Product Delivery
Daily On-time
Month-to-date
Date
Pro
du
cts
Del
iver
ed O
n-t
ime
(%)
Month-to-Date On Time:99.334%
2/15/2011:06Z HI and AK smoke products MISSING due to firewall problem that created comms loss;
06Z HIRESW products 85 minutes late due to
time out from over-loaded node
2/16/2011:12Z GFS, 12Z GEFS, 12Z WAVE,
12Z HIRESW, 14Z RUC, 15Z SREF, 15Z/20Z RTMA, 16Z/17Z/18Z
LAMP, 18Z NAM, and 18Z GFS products 15-150 minutes late due to
CCS issues
2/17/2011:06Z HIRESW and 06Z GFS
storm surge files 18-36 minutes late due to to landing on a prob-
lematic node
2/19/2011:18Z GFS, 18Z HIRESW, and 18Z OMB files 15-42 minutes late due to gpfs
resource contention
2/28/2011:12Z ECMWF prod-
ucts 19 minutes late due to implementa-
tion glitch on ftp server
COPC HPCC Best Practices - 14-15 March 2011 14
2011
0101
2011
0102
2011
0103
2011
0104
2011
0105
2011
0106
2011
0107
2011
0108
2011
0109
2011
0110
2011
0111
2011
0112
2011
0113
2011
0114
2011
0115
2011
0116
2011
0117
2011
0118
2011
0119
2011
0120
2011
0121
2011
0122
2011
0123
2011
0124
2011
0125
2011
0126
2011
0127
2011
0128
2011
0129
2011
0130
2011
0131
98
98.2
98.4
98.6
98.8
99
99.2
99.4
99.6
99.8
100
On-time Product Delivery
Daily On-time
Month-to-date
Date
Pro
du
cts
Del
iver
ed O
n-t
ime
(%)
Month-to-Date On Time: 99.953%
1/28/2011:00Z ECMWF data was cor-
rupt or missing. All data was then resent. All data arrived 417 minutes late; 15Z SREF products and 18Z NAM prod-ucts 15-32 minutes late due to silent failure of 15Z SREF
1/31/2011:00Z UK-
MET prod-ucts 88
minutes late to MISSING
due to dataflow
problems at UKMET
COPC HPCC Best Practices - 14-15 March 2011 15
On-Time Product Delivery
Dual System CM & Ops Practice Refinement
COPC HPCC Best Practices - 14-15 March 2011 16
CM Incentive
• Backup supercomputer implemented, with associated IT infrastructure and requirements– Network between systems– System configuration synchronization– Coordinated model implementations– Failover capability
Expectation – Better Performance
Reality – Greater Complexity
COPC HPCC Best Practices - 14-15 March 2011 17
Configuration Management• Ensure system integrity• Weekly meeting to review executed and
proposed changes• Before change occurs…
– Validate and test– Schedule appropriately– Review and approve– Communicate with customers
• After change occurs…– Identify and communicate outcomes
COPC HPCC Best Practices - 14-15 March 2011 18
Configuration Management
• Covers all NCO IT practice, not just supercomputers
• Includes NWS and other partners• Full-time staff (primary and backup)• Weekly tempo with daily tie-in to operations
COPC HPCC Best Practices - 14-15 March 2011 19
On-Time Product Delivery
Dual System CM & Ops Practice Refinement
COPC HPCC Best Practices - 14-15 March 2011 20
CM Evolution w/ On-time Feedback
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
June
Jul
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
June
July
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
June
Jul
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Jul
May
June
July
98.0
98.5
99.0
99.5
100.0
Pe
rce
nt
on
Tim
e
2004 2005 2006 2007 2008 2009 2010
First CM Attempt CM Process Focus
CM Refinement
5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 1 3 5 7
0
10
20
30
40
50
60
70
80
90
100
110
120 0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
RFC Classification and Rate of Problems Accelerated Instability Accelerated High Benefit Routine Significant Major Problematic Withdrawn before Implementation
Withdrawn after Implmentation
Week
Ex
ec
ute
d C
ha
ng
es
GEFS
Prob
lematic or w
ithd
rawn
Q2
Q3
Q4
GFSQ
1
AQFS HI&AK
SSTOIQD
GENESIS (5)MAG (4)
WSR_88d(2)
GLOFS &
NAEFS
NDFD &
RTMA
Q2
Change Metrics
Last 12 months – 15 changes withdrawn out of 1004
COPC HPCC Best Practices - 14-15 March 2011 22
Ancillary Benefits• Daily review
– Identifies performance problems before customers are affected
– Reveals silent failures• Weekly & Monthly Reviews
– Identify system management gaps– Identify model instability
COPC HPCC Best Practices - 14-15 March 2011 23
On-Time Product Delivery
Yearly Average2006: 99.42%2007: 99.70%2008: 99.82%2009: 99.85%2010: 99.83%
Questions / Discussion