SCD Update
description
Transcript of SCD Update
Supercomputing • Communications • Data
NCAR Scientific Computing Division
SCD Update
Tom BettgeTom Bettge
Deputy DirectorDeputy Director
Scientific Computing DivisionScientific Computing Division
National Center for Atmospheric ResearchNational Center for Atmospheric Research
Boulder, CO USABoulder, CO USA
User Forum17-19 May 2005
Supercomputing • Communications • Data
NCAR Scientific Computing Division
NCAR/SCD NCAR/SCD
1990 1995 2000 2005 2010
1
50
100
200
250
300
350
150
Posit
ion
Year1996
Procurement
IBMPower3
IBMPower4
Supercomputing • Communications • Data
NCAR Scientific Computing Division
Peak TFLOPs at NCAR
0
2
4
6
8
10
12
Jan-97 Jan-98 Jan-99 Jan-00 Jan-01 Jan-02 Jan-03 Jan-04 Jan-05
IBM Opteron/Linux(pegasus)
IBM Opteron/Linux(lightning)
IBM POWER4/Federation(thunder)
IBM POWER4/Colony(bluesky)
IBM POWER4 (bluedawn)
SGI Origin3800/128
IBM POWER3(blackforest)
IBM POWER3 (babyblue)
Compaq ES40/32(prospect)
SGI Origin2000/128 (ute)
HP SPP-2000/64 (sioux)
CRI Cray C90/16 (antero)
CRI Cray J90 series
Cray C90/16
HP SPP2000
SGI Origin2000
blackforestWH-1
blackforestWH-2
ARCS Phase 1blackforest upgrade SGI Origin3800
ARCS Phase 2bluesky
ARCS Phase 3bluesky expansion
IBM Linux
Supercomputing • Communications • Data
NCAR Scientific Computing Division
SCD UpdateSCD Update
Production HEC ComputingProduction HEC Computing Mass Storage SystemMass Storage System ServicesServices Server Consolidation and DecommissionsServer Consolidation and Decommissions
Physical Facility Infrastructure UpdatePhysical Facility Infrastructure Update
Future HEC at NCARFuture HEC at NCAR
Supercomputing • Communications • Data
NCAR Scientific Computing Division
News: Production ComputingNews: Production Computing
Redeployed SGI 3800 as Data Analysis engineRedeployed SGI 3800 as Data Analysis engine– chinookchinook became became tempesttempest– departure of departure of davedave
IBM Power 3 IBM Power 3 blackforest blackforest decommissioned Jan 2005decommissioned Jan 2005– Loss Loss of 2.0 Tflops of peak computing capacity of 2.0 Tflops of peak computing capacity
IBM Linux Cluster IBM Linux Cluster lightning lightning joined production pooljoined production pool March 2005March 2005
– GainGain of 1.1 Tflops of peak computing capacity of 1.1 Tflops of peak computing capacity– 256 processors (128 dual node configuration)256 processors (128 dual node configuration)– 2.2 GHz AMD Opteron processors2.2 GHz AMD Opteron processors– 6 TByte FastT500 RAID with GPFS6 TByte FastT500 RAID with GPFS– 40% faster than 40% faster than bluesky bluesky (1.3 GHz POWER4) cluster on (1.3 GHz POWER4) cluster on
parallel POP and CAM simulationsparallel POP and CAM simulations– 33rdrd party vendor compilers party vendor compilers
Supercomputing • Communications • Data
NCAR Scientific Computing Division
At the end of FY04, the combined supercomputing capacity at At the end of FY04, the combined supercomputing capacity at NCAR was ~ 11 TFLOPsNCAR was ~ 11 TFLOPs
Roughly 81% of that Roughly 81% of that capacity was used capacity was used for climate for climate simulation simulation and analysisand analysis(Climate &(Climate &IPCC)IPCC)
Climate50.8%
Cloud Physics0.4%
IPCC30.4%
Miscellaneous0.2%
Upper Atmosphere
0.2%
Weather Prediction
6.5%
Oceanography5.2%
Basic Fluid Dynamics
1.2%
Atmospheric Chemistry
1.8%
Astrophysics3.3%
Resource Usage FY04Resource Usage FY04
Supercomputing • Communications • Data
NCAR Scientific Computing Division
bluesky Workload by Facilitybluesky Workload by Facility
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000
CSL Comm NCAR UNIVCP
U h
ou
rs
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
# J
ob
s
Actual CPUhours
ReservedCPU hours
# Jobs
April 2005April 2005
Supercomputing • Communications • Data
NCAR Scientific Computing Division
Computing DemandComputing Demand
Science Driving Demand for Scientific ComputingScience Driving Demand for Scientific Computing
Summer 2004: CSL Requests 1.5x Availability
Sept 2004: NCAR Requests 2x Availability
Sept 2004: University Requests 3x Availability
March 2005: University Requests 1.7x Availability
Supercomputing • Communications • Data
NCAR Scientific Computing Division
Computational CampaignsComputational Campaigns
BAMEXBAMEX Spring 2003Spring 2003 IPCCIPCC FY 2004FY 2004 MMM Spring Real-Time ForecastsMMM Spring Real-Time Forecasts Spring 2004Spring 2004 WRF Real-Time Hurricane ForecastWRF Real-Time Hurricane Forecast Fall 2004Fall 2004 DTC Winter Real-Time ForecastsDTC Winter Real-Time Forecasts Winter 2004-Winter 2004-
20052005 MMM Spring Real-Time ForecastMMM Spring Real-Time Forecast Spring 2005Spring 2005 MMM East Pacific Hurricane FormationMMM East Pacific Hurricane Formation July 2005July 2005
Supercomputing • Communications • Data
NCAR Scientific Computing Division
bluesky 8-way LPAR Usage
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
8/29 9/5
9/12
9/19
9/26
10/3
10/1
0
10/1
7
10/2
4
10/3
1
11/7
11/1
4
11/2
1
11/2
8
12/5
12/1
2
12/1
9
12/2
6
1/2
1/9
1/16
1/23
1/30 2/6
2/13
2/20
2/27 3/6
3/13
3/20
3/27
Utilization % User
% Idle % System
bluesky 8-waybluesky 8-way
Supercomputing • Communications • Data
NCAR Scientific Computing Division
bluesky 32-way LPAR Usage
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%8
/29
9/5
9/1
2
9/1
99
/26
10
/31
0/1
0
10
/17
10
/24
10
/31
11
/7
11
/14
11
/21
11
/28
12
/5
12
/12
12
/19
12
/26
1/2
1/9
1/1
61
/23
1/3
02
/6
2/1
32
/20
2/2
73
/6
3/1
33
/20
3/2
7
Utilization % User
% Idle % System
bluesky 32-waybluesky 32-way
Supercomputing • Communications • Data
NCAR Scientific Computing Division
SCD’s supercomputers SCD’s supercomputers are well utilized ...are well utilized ...
... yet average job ... yet average job queue-wait timesqueue-wait times†† are are measured in hours (was measured in hours (was minutes in ’04), not minutes in ’04), not daysdays
Apr ’05Apr ’05 20042004
Bluesky 8-way LPARsBluesky 8-way LPARs 94.6%94.6% 89%89%
Bluesky 32-way LPARsBluesky 32-way LPARs 95.8%95.8% 92%92%
BlackforestBlackforest -- 82%82%
LightningLightning 48.0%48.0% --
Regular QueueRegular Queue CSLCSL CommunityCommunity
Bluesky 8-wayBluesky 8-way 43m43m 3h34m3h34m
Bluesky 32-wayBluesky 32-way 1h02m1h02m 49m49m
LightningLightning 1m1m
Servicing the DemandServicing the DemandNCAR Computing FacilityNCAR Computing Facility
† April 2005 average
Supercomputing • Communications • Data
NCAR Scientific Computing Division
Average bluesky Queue-Wait TimesAverage bluesky Queue-Wait Times (HH:MM) (HH:MM)
8-way LPARs8-way LPARs
UniversityUniversity NCARNCAR
Jan '05Jan '05 Feb '05Feb '05 Mar '05Mar '05 Apr '05Apr '05 Jan '05Jan '05 Feb '05Feb '05 Mar '05Mar '05 Apr '05Apr '05
PremiumPremium 0:090:09 0:340:34 0:520:52 0:290:29 0:130:13 0:280:28 1:071:07 0:310:31
RegularRegular 0:570:57 3:443:44 6:246:24 2:572:57 0:210:21 9:419:41 11:1911:19 4:274:27
EconomyEconomy 1:471:47 1:121:12 1:451:45 1:001:00 4:064:06 2:402:40 3:003:00 5:445:44
Stand-byStand-by 0:060:06 0:170:17 0:100:10 3:023:02 10:0810:08 32:4132:41 0:440:44 4:584:58
32-way LPARs32-way LPARs
UniversityUniversity NCARNCAR
Jan '05Jan '05 Feb '05Feb '05 Mar '05Mar '05 Apr '05Apr '05 Jan '05Jan '05 Feb '05Feb '05 Mar '05Mar '05 Apr '05Apr '05
PremiumPremium 0:000:00 0:200:20 0:020:02 0:060:06 0:180:18 0:210:21 0:530:53 0:220:22
RegularRegular 0:570:57 1:101:10 2:302:30 0:460:46 1:031:03 1:281:28 1:421:42 0:550:55
EconomyEconomy 3:423:42 1:391:39 2:082:08 2:452:45 4:404:40 0:480:48 4:094:09 1:541:54
Stand-byStand-by 3:363:36 7:367:36 19:3619:36 1:581:58 5:355:35 15:5815:58 25:2825:28 32:3432:34
Supercomputing • Communications • Data
NCAR Scientific Computing Division
bluesky Queue Wait Timesbluesky Queue Wait Times
blackforest removedblackforest removed lightning charging did not start until March 1lightning charging did not start until March 1 Corrective (minor) actions taken:Corrective (minor) actions taken:
– Disallow “batch” node_usage=shared jobsDisallow “batch” node_usage=shared jobs Increase utility of the “share” nodes (4 nodes, 128 pes) Increase utility of the “share” nodes (4 nodes, 128 pes)
– Shift the “facility” split (CSL/Community) from 50/50 to Shift the “facility” split (CSL/Community) from 50/50 to 45/5545/55
More accurately reflects the actual allocation distributionMore accurately reflects the actual allocation distribution
– Reduce premium charge from 2.0x to 1.5xReduce premium charge from 2.0x to 1.5x Encourage use of premium if needed for critical turnaroundEncourage use of premium if needed for critical turnaround
– Have reduced Have reduced NCARNCAR 30-day allocation limit from 130% to 30-day allocation limit from 130% to 120%120%
Matches other groups (leveled playing field)Matches other groups (leveled playing field) SCD is watching closely……SCD is watching closely……
Supercomputing • Communications • Data
NCAR Scientific Computing Division
Average Compute Factor per GAU Average Compute Factor per GAU ChargedCharged
Jan 1 Feb 1 Mar 1 Apr 1 May 1
2005
Compute Factor
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1 4 7 10 13 16 19 22 25 28 31 34 37
Compute Factor
Supercomputing • Communications • Data
NCAR Scientific Computing Division
Mass Storage SystemMass Storage SystemNCAR MSS - Data Holdings
0
500
1000
1500
2000
2500
Jan
-97
Jan
-98
Jan
-99
Jan
-00
Jan
-01
Jan
-02
Jan
-03
Jan
-04
Jan
-05
Te
rab
yte
s
Total
Unique
Supercomputing • Communications • Data
NCAR Scientific Computing Division
Supercomputing • Communications • Data
NCAR Scientific Computing Division
Mass Storage System Mass Storage System
Disk cache expanded to service files 100MB Disk cache expanded to service files 100MB 60% of files this size being read from cache, not tape mount60% of files this size being read from cache, not tape mount
Deployment of 200GB cartridges (previous 60 GB)Deployment of 200GB cartridges (previous 60 GB)– Now over 500TB of data on these cartridgesNow over 500TB of data on these cartridges– Drives provide 3x increase in transfer rateDrives provide 3x increase in transfer rate– Full silo holds 1.2 PBs 5 silos hold 6 PBs of dataFull silo holds 1.2 PBs 5 silos hold 6 PBs of data
Users have recently moved to single copy class of Users have recently moved to single copy class of service (motivated by GAU service (motivated by GAU computecompute charges) charges)
Embarking on project to address future MSS growthEmbarking on project to address future MSS growth– Manageable growth rateManageable growth rate– User management tools (identify, remove, etc)User management tools (identify, remove, etc)– User access patterns / User Education (archive selectively, User access patterns / User Education (archive selectively,
tar)tar)– CompressionCompression
Supercomputing • Communications • Data
NCAR Scientific Computing Division
SCD Customer SupportSCD Customer Support
Consistent with SCDConsistent with SCD
ReorganizationReorganization Phased Deployment Phased Deployment
Dec 2004 May 2005Dec 2004 May 2005 Advantages:Advantages:
– Enhanced service – Computer Production Group 24/7Enhanced service – Computer Production Group 24/7– Effectively utilize other SCD groups in customer supportEffectively utilize other SCD groups in customer support– Easier questions handled Easier questions handled soonersooner– Harder questions routed to correct group Harder questions routed to correct group soonersooner
Feedback PlanFeedback Plan
SCD will provide a balanced set of services to enable researchers to easily and effectively utilize community resources.
Supercomputing • Communications • Data
NCAR Scientific Computing Division
Server DecommissionsServer Decommissions
MIGS – MSS access from remote sitesMIGS – MSS access from remote sites– Decommission April 12, 2005Decommission April 12, 2005– Other contemporary methods now available Other contemporary methods now available
IRJE – job submittal to supers (firewall made obsolete)IRJE – job submittal to supers (firewall made obsolete)– Decommissioned March 21, 2005Decommissioned March 21, 2005
Front-End Server Consolidation to single new server Front-End Server Consolidation to single new server over next few monthsover next few months– UCAR front-end Sun server (meeker)UCAR front-end Sun server (meeker)– UCAR front-end Linux server (longs)UCAR front-end Linux server (longs)– Joint SCD/CSS Sun computational server (k2)Joint SCD/CSS Sun computational server (k2)– SCD front-end Sun server (niwot)SCD front-end Sun server (niwot)
Supercomputing • Communications • Data
NCAR Scientific Computing Division
Physical Facility Infrastructure Physical Facility Infrastructure UpdateUpdate
Chilled water upgrade continuesChilled water upgrade continues– Brings cooling up to power capacity of data center.Brings cooling up to power capacity of data center.– Startup of new chiller went flawlessly on March 15Startup of new chiller went flawlessly on March 15thth
– May 19-22 Last planned shutdownMay 19-22 Last planned shutdown Stand-By Generators proved themselves again during Stand-By Generators proved themselves again during
outage March 13outage March 13thth , and Xcel power drops April 29 , and Xcel power drops April 29 Design phase of planning electrical distribution Design phase of planning electrical distribution
upgrades to be completed by late 2005upgrades to be completed by late 2005 Risk assessment identified concerns about substation 3Risk assessment identified concerns about substation 3
– Power to data center (station is near lifetime limit)Power to data center (station is near lifetime limit)– Additional testing completed Feb. 26Additional testing completed Feb. 26thth
– Awaiting reportAwaiting report
Supercomputing • Communications • Data
NCAR Scientific Computing Division
Future Plans for HEC at Future Plans for HEC at NCAR……NCAR……
Supercomputing • Communications • Data
NCAR Scientific Computing Division
SCD Strategic Plan:SCD Strategic Plan:High-End ComputingHigh-End Computing
Within the Within the current funding envelopcurrent funding envelop, achieve a , achieve a 25-fold increase over current sustained computing25-fold increase over current sustained computingcapacity in five years.capacity in five years.
SCD intends as well to pursue opportunitiesSCD intends as well to pursue opportunitiesfor substantial additional funding for computationalfor substantial additional funding for computationalequipment and infrastructure to support theequipment and infrastructure to support therealization of demanding institutional sciencerealization of demanding institutional scienceobjectives.objectives.
SCD will continue to investigate and acquireSCD will continue to investigate and acquireexperimental hardware and software systems. experimental hardware and software systems.
•IBM BlueGene/L
1Q2005
Supercomputing • Communications • Data
NCAR Scientific Computing Division
SCD Target CapacitySCD Target Capacity
Target Sustained Computing Capacity at NCAR
0
2
4
6
8
10
12
Jan-99 Jan-00 Jan-01 Jan-02 Jan-03 Jan-04 Jan-05 Jan-06 Jan-07 Jan-08 Jan-09 Jan-10
Su
sta
ined
Tera
FL
OP
s
Moore's Law
SCD Target
Supercomputing • Communications • Data
NCAR Scientific Computing Division
Challenges in Achieving Challenges in Achieving 2006-2007 Goals2006-2007 Goals
Capability vs. CapacityCapability vs. Capacity– Costs (price performance)Costs (price performance)– Need/Desire for Capability Computing (define!)Need/Desire for Capability Computing (define!)– Balance within center of capability and capacity. How?Balance within center of capability and capacity. How?
NCAR/SCD “fixed income”NCAR/SCD “fixed income” Business PlansBusiness Plans
– Evaluating Year 5 Option with IBMEvaluating Year 5 Option with IBM– Engaging vendors to informally analyze SCD Strategic Engaging vendors to informally analyze SCD Strategic
Plan for HECPlan for HEC– Likely to enter year-long procurement for 4Q2006 Likely to enter year-long procurement for 4Q2006
deployment of additional capacity and capabilitydeployment of additional capacity and capability
Supercomputing • Communications • Data
NCAR Scientific Computing Division
Beyond 2006Beyond 2006
Data Center Limitations / Data Center ExpansionData Center Limitations / Data Center Expansion– NCAR center limits of power/cooling/space will be reached with NCAR center limits of power/cooling/space will be reached with
2006 computing addition2006 computing addition– New center requirements have been compiled/completedNew center requirements have been compiled/completed– Conceptual Design for new center is near completionConceptual Design for new center is near completion– Funding options being developed with UCARFunding options being developed with UCAR
Opportunity of NSF Petascale Computing InitiativeOpportunity of NSF Petascale Computing Initiative Commitment to balanced and Commitment to balanced and sustainedsustained investment in investment in
robustrobust cyberinfrastructure. cyberinfrastructure.– Supercomputing systems Supercomputing systems – Mass storage Mass storage – Networking Networking – Data Management Systems Data Management Systems – Software Tools and FrameworksSoftware Tools and Frameworks– Services and ExpertiseServices and Expertise– SecuritySecurity
Supercomputing • Communications • Data
NCAR Scientific Computing Division
Scientific Computing DivisionScientific Computing DivisionStrategic PlanStrategic Plan
2005-20092005-2009
www.scd.ucar.edu
to serve the computing, research and data management needs of atmospheric and related sciences.
Supercomputing • Communications • Data
NCAR Scientific Computing Division
QuestionsQuestions