SCD Update

28
Supercomputing • Communications • NCAR Scientific Computing Div SCD Update Tom Bettge Tom Bettge Deputy Director Deputy Director Scientific Computing Division Scientific Computing Division National Center for Atmospheric Research National Center for Atmospheric Research Boulder, CO USA Boulder, CO USA User Forum 17-19 May 2005

description

SCD Update. Tom Bettge Deputy Director Scientific Computing Division National Center for Atmospheric Research Boulder, CO USA. User Forum 17-19 May 2005. NCAR/SCD. IBM Power4. 1. 50. 100. IBM Power3. Position. 150. 200. 250. 300. 350. 1996 Procurement. Year. SCD Update. - PowerPoint PPT Presentation

Transcript of SCD Update

Page 1: SCD Update

Supercomputing • Communications • Data

NCAR Scientific Computing Division

SCD Update

Tom BettgeTom Bettge

Deputy DirectorDeputy Director

Scientific Computing DivisionScientific Computing Division

National Center for Atmospheric ResearchNational Center for Atmospheric Research

Boulder, CO USABoulder, CO USA

User Forum17-19 May 2005

Page 2: SCD Update

Supercomputing • Communications • Data

NCAR Scientific Computing Division

NCAR/SCD NCAR/SCD

1990 1995 2000 2005 2010

1

50

100

200

250

300

350

150

Posit

ion

Year1996

Procurement

IBMPower3

IBMPower4

Page 3: SCD Update

Supercomputing • Communications • Data

NCAR Scientific Computing Division

Peak TFLOPs at NCAR

0

2

4

6

8

10

12

Jan-97 Jan-98 Jan-99 Jan-00 Jan-01 Jan-02 Jan-03 Jan-04 Jan-05

IBM Opteron/Linux(pegasus)

IBM Opteron/Linux(lightning)

IBM POWER4/Federation(thunder)

IBM POWER4/Colony(bluesky)

IBM POWER4 (bluedawn)

SGI Origin3800/128

IBM POWER3(blackforest)

IBM POWER3 (babyblue)

Compaq ES40/32(prospect)

SGI Origin2000/128 (ute)

HP SPP-2000/64 (sioux)

CRI Cray C90/16 (antero)

CRI Cray J90 series

Cray C90/16

HP SPP2000

SGI Origin2000

blackforestWH-1

blackforestWH-2

ARCS Phase 1blackforest upgrade SGI Origin3800

ARCS Phase 2bluesky

ARCS Phase 3bluesky expansion

IBM Linux

Page 4: SCD Update

Supercomputing • Communications • Data

NCAR Scientific Computing Division

SCD UpdateSCD Update

Production HEC ComputingProduction HEC Computing Mass Storage SystemMass Storage System ServicesServices Server Consolidation and DecommissionsServer Consolidation and Decommissions

Physical Facility Infrastructure UpdatePhysical Facility Infrastructure Update

Future HEC at NCARFuture HEC at NCAR

Page 5: SCD Update

Supercomputing • Communications • Data

NCAR Scientific Computing Division

News: Production ComputingNews: Production Computing

Redeployed SGI 3800 as Data Analysis engineRedeployed SGI 3800 as Data Analysis engine– chinookchinook became became tempesttempest– departure of departure of davedave

IBM Power 3 IBM Power 3 blackforest blackforest decommissioned Jan 2005decommissioned Jan 2005– Loss Loss of 2.0 Tflops of peak computing capacity of 2.0 Tflops of peak computing capacity

IBM Linux Cluster IBM Linux Cluster lightning lightning joined production pooljoined production pool March 2005March 2005

– GainGain of 1.1 Tflops of peak computing capacity of 1.1 Tflops of peak computing capacity– 256 processors (128 dual node configuration)256 processors (128 dual node configuration)– 2.2 GHz AMD Opteron processors2.2 GHz AMD Opteron processors– 6 TByte FastT500 RAID with GPFS6 TByte FastT500 RAID with GPFS– 40% faster than 40% faster than bluesky bluesky (1.3 GHz POWER4) cluster on (1.3 GHz POWER4) cluster on

parallel POP and CAM simulationsparallel POP and CAM simulations– 33rdrd party vendor compilers party vendor compilers

Page 6: SCD Update

Supercomputing • Communications • Data

NCAR Scientific Computing Division

At the end of FY04, the combined supercomputing capacity at At the end of FY04, the combined supercomputing capacity at NCAR was ~ 11 TFLOPsNCAR was ~ 11 TFLOPs

Roughly 81% of that Roughly 81% of that capacity was used capacity was used for climate for climate simulation simulation and analysisand analysis(Climate &(Climate &IPCC)IPCC)

Climate50.8%

Cloud Physics0.4%

IPCC30.4%

Miscellaneous0.2%

Upper Atmosphere

0.2%

Weather Prediction

6.5%

Oceanography5.2%

Basic Fluid Dynamics

1.2%

Atmospheric Chemistry

1.8%

Astrophysics3.3%

Resource Usage FY04Resource Usage FY04

Page 7: SCD Update

Supercomputing • Communications • Data

NCAR Scientific Computing Division

bluesky Workload by Facilitybluesky Workload by Facility

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

500000

CSL Comm NCAR UNIVCP

U h

ou

rs

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

# J

ob

s

Actual CPUhours

ReservedCPU hours

# Jobs

April 2005April 2005

Page 8: SCD Update

Supercomputing • Communications • Data

NCAR Scientific Computing Division

Computing DemandComputing Demand

Science Driving Demand for Scientific ComputingScience Driving Demand for Scientific Computing

Summer 2004: CSL Requests 1.5x Availability

Sept 2004: NCAR Requests 2x Availability

Sept 2004: University Requests 3x Availability

March 2005: University Requests 1.7x Availability

Page 9: SCD Update

Supercomputing • Communications • Data

NCAR Scientific Computing Division

Computational CampaignsComputational Campaigns

BAMEXBAMEX Spring 2003Spring 2003 IPCCIPCC FY 2004FY 2004 MMM Spring Real-Time ForecastsMMM Spring Real-Time Forecasts Spring 2004Spring 2004 WRF Real-Time Hurricane ForecastWRF Real-Time Hurricane Forecast Fall 2004Fall 2004 DTC Winter Real-Time ForecastsDTC Winter Real-Time Forecasts Winter 2004-Winter 2004-

20052005 MMM Spring Real-Time ForecastMMM Spring Real-Time Forecast Spring 2005Spring 2005 MMM East Pacific Hurricane FormationMMM East Pacific Hurricane Formation July 2005July 2005

Page 10: SCD Update

Supercomputing • Communications • Data

NCAR Scientific Computing Division

bluesky 8-way LPAR Usage

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

8/29 9/5

9/12

9/19

9/26

10/3

10/1

0

10/1

7

10/2

4

10/3

1

11/7

11/1

4

11/2

1

11/2

8

12/5

12/1

2

12/1

9

12/2

6

1/2

1/9

1/16

1/23

1/30 2/6

2/13

2/20

2/27 3/6

3/13

3/20

3/27

Utilization % User

% Idle % System

bluesky 8-waybluesky 8-way

Page 11: SCD Update

Supercomputing • Communications • Data

NCAR Scientific Computing Division

bluesky 32-way LPAR Usage

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%8

/29

9/5

9/1

2

9/1

99

/26

10

/31

0/1

0

10

/17

10

/24

10

/31

11

/7

11

/14

11

/21

11

/28

12

/5

12

/12

12

/19

12

/26

1/2

1/9

1/1

61

/23

1/3

02

/6

2/1

32

/20

2/2

73

/6

3/1

33

/20

3/2

7

Utilization % User

% Idle % System

bluesky 32-waybluesky 32-way

Page 12: SCD Update

Supercomputing • Communications • Data

NCAR Scientific Computing Division

SCD’s supercomputers SCD’s supercomputers are well utilized ...are well utilized ...

... yet average job ... yet average job queue-wait timesqueue-wait times†† are are measured in hours (was measured in hours (was minutes in ’04), not minutes in ’04), not daysdays

Apr ’05Apr ’05 20042004

Bluesky 8-way LPARsBluesky 8-way LPARs 94.6%94.6% 89%89%

Bluesky 32-way LPARsBluesky 32-way LPARs 95.8%95.8% 92%92%

BlackforestBlackforest -- 82%82%

LightningLightning 48.0%48.0% --

Regular QueueRegular Queue CSLCSL CommunityCommunity

Bluesky 8-wayBluesky 8-way 43m43m 3h34m3h34m

Bluesky 32-wayBluesky 32-way 1h02m1h02m 49m49m

LightningLightning 1m1m

Servicing the DemandServicing the DemandNCAR Computing FacilityNCAR Computing Facility

† April 2005 average

Page 13: SCD Update

Supercomputing • Communications • Data

NCAR Scientific Computing Division

Average bluesky Queue-Wait TimesAverage bluesky Queue-Wait Times (HH:MM) (HH:MM)

8-way LPARs8-way LPARs

UniversityUniversity NCARNCAR

Jan '05Jan '05 Feb '05Feb '05 Mar '05Mar '05 Apr '05Apr '05 Jan '05Jan '05 Feb '05Feb '05 Mar '05Mar '05 Apr '05Apr '05

PremiumPremium 0:090:09 0:340:34 0:520:52 0:290:29 0:130:13 0:280:28 1:071:07 0:310:31

RegularRegular 0:570:57 3:443:44 6:246:24 2:572:57 0:210:21 9:419:41 11:1911:19 4:274:27

EconomyEconomy 1:471:47 1:121:12 1:451:45 1:001:00 4:064:06 2:402:40 3:003:00 5:445:44

Stand-byStand-by 0:060:06 0:170:17 0:100:10 3:023:02 10:0810:08 32:4132:41 0:440:44 4:584:58

32-way LPARs32-way LPARs

UniversityUniversity NCARNCAR

Jan '05Jan '05 Feb '05Feb '05 Mar '05Mar '05 Apr '05Apr '05 Jan '05Jan '05 Feb '05Feb '05 Mar '05Mar '05 Apr '05Apr '05

PremiumPremium 0:000:00 0:200:20 0:020:02 0:060:06 0:180:18 0:210:21 0:530:53 0:220:22

RegularRegular 0:570:57 1:101:10 2:302:30 0:460:46 1:031:03 1:281:28 1:421:42 0:550:55

EconomyEconomy 3:423:42 1:391:39 2:082:08 2:452:45 4:404:40 0:480:48 4:094:09 1:541:54

Stand-byStand-by 3:363:36 7:367:36 19:3619:36 1:581:58 5:355:35 15:5815:58 25:2825:28 32:3432:34

Page 14: SCD Update

Supercomputing • Communications • Data

NCAR Scientific Computing Division

bluesky Queue Wait Timesbluesky Queue Wait Times

blackforest removedblackforest removed lightning charging did not start until March 1lightning charging did not start until March 1 Corrective (minor) actions taken:Corrective (minor) actions taken:

– Disallow “batch” node_usage=shared jobsDisallow “batch” node_usage=shared jobs Increase utility of the “share” nodes (4 nodes, 128 pes) Increase utility of the “share” nodes (4 nodes, 128 pes)

– Shift the “facility” split (CSL/Community) from 50/50 to Shift the “facility” split (CSL/Community) from 50/50 to 45/5545/55

More accurately reflects the actual allocation distributionMore accurately reflects the actual allocation distribution

– Reduce premium charge from 2.0x to 1.5xReduce premium charge from 2.0x to 1.5x Encourage use of premium if needed for critical turnaroundEncourage use of premium if needed for critical turnaround

– Have reduced Have reduced NCARNCAR 30-day allocation limit from 130% to 30-day allocation limit from 130% to 120%120%

Matches other groups (leveled playing field)Matches other groups (leveled playing field) SCD is watching closely……SCD is watching closely……

Page 15: SCD Update

Supercomputing • Communications • Data

NCAR Scientific Computing Division

Average Compute Factor per GAU Average Compute Factor per GAU ChargedCharged

Jan 1 Feb 1 Mar 1 Apr 1 May 1

2005

Compute Factor

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1 4 7 10 13 16 19 22 25 28 31 34 37

Compute Factor

Page 16: SCD Update

Supercomputing • Communications • Data

NCAR Scientific Computing Division

Mass Storage SystemMass Storage SystemNCAR MSS - Data Holdings

0

500

1000

1500

2000

2500

Jan

-97

Jan

-98

Jan

-99

Jan

-00

Jan

-01

Jan

-02

Jan

-03

Jan

-04

Jan

-05

Te

rab

yte

s

Total

Unique

Page 17: SCD Update

Supercomputing • Communications • Data

NCAR Scientific Computing Division

Page 18: SCD Update

Supercomputing • Communications • Data

NCAR Scientific Computing Division

Mass Storage System Mass Storage System

Disk cache expanded to service files 100MB Disk cache expanded to service files 100MB 60% of files this size being read from cache, not tape mount60% of files this size being read from cache, not tape mount

Deployment of 200GB cartridges (previous 60 GB)Deployment of 200GB cartridges (previous 60 GB)– Now over 500TB of data on these cartridgesNow over 500TB of data on these cartridges– Drives provide 3x increase in transfer rateDrives provide 3x increase in transfer rate– Full silo holds 1.2 PBs 5 silos hold 6 PBs of dataFull silo holds 1.2 PBs 5 silos hold 6 PBs of data

Users have recently moved to single copy class of Users have recently moved to single copy class of service (motivated by GAU service (motivated by GAU computecompute charges) charges)

Embarking on project to address future MSS growthEmbarking on project to address future MSS growth– Manageable growth rateManageable growth rate– User management tools (identify, remove, etc)User management tools (identify, remove, etc)– User access patterns / User Education (archive selectively, User access patterns / User Education (archive selectively,

tar)tar)– CompressionCompression

Page 19: SCD Update

Supercomputing • Communications • Data

NCAR Scientific Computing Division

SCD Customer SupportSCD Customer Support

Consistent with SCDConsistent with SCD

ReorganizationReorganization Phased Deployment Phased Deployment

Dec 2004 May 2005Dec 2004 May 2005 Advantages:Advantages:

– Enhanced service – Computer Production Group 24/7Enhanced service – Computer Production Group 24/7– Effectively utilize other SCD groups in customer supportEffectively utilize other SCD groups in customer support– Easier questions handled Easier questions handled soonersooner– Harder questions routed to correct group Harder questions routed to correct group soonersooner

Feedback PlanFeedback Plan

SCD will provide a balanced set of services to enable researchers to easily and effectively utilize community resources.

Page 20: SCD Update

Supercomputing • Communications • Data

NCAR Scientific Computing Division

Server DecommissionsServer Decommissions

MIGS – MSS access from remote sitesMIGS – MSS access from remote sites– Decommission April 12, 2005Decommission April 12, 2005– Other contemporary methods now available Other contemporary methods now available

IRJE – job submittal to supers (firewall made obsolete)IRJE – job submittal to supers (firewall made obsolete)– Decommissioned March 21, 2005Decommissioned March 21, 2005

Front-End Server Consolidation to single new server Front-End Server Consolidation to single new server over next few monthsover next few months– UCAR front-end Sun server (meeker)UCAR front-end Sun server (meeker)– UCAR front-end Linux server (longs)UCAR front-end Linux server (longs)– Joint SCD/CSS Sun computational server (k2)Joint SCD/CSS Sun computational server (k2)– SCD front-end Sun server (niwot)SCD front-end Sun server (niwot)

Page 21: SCD Update

Supercomputing • Communications • Data

NCAR Scientific Computing Division

Physical Facility Infrastructure Physical Facility Infrastructure UpdateUpdate

Chilled water upgrade continuesChilled water upgrade continues– Brings cooling up to power capacity of data center.Brings cooling up to power capacity of data center.– Startup of new chiller went flawlessly on March 15Startup of new chiller went flawlessly on March 15thth

– May 19-22 Last planned shutdownMay 19-22 Last planned shutdown Stand-By Generators proved themselves again during Stand-By Generators proved themselves again during

outage March 13outage March 13thth , and Xcel power drops April 29 , and Xcel power drops April 29 Design phase of planning electrical distribution Design phase of planning electrical distribution

upgrades to be completed by late 2005upgrades to be completed by late 2005 Risk assessment identified concerns about substation 3Risk assessment identified concerns about substation 3

– Power to data center (station is near lifetime limit)Power to data center (station is near lifetime limit)– Additional testing completed Feb. 26Additional testing completed Feb. 26thth

– Awaiting reportAwaiting report

Page 22: SCD Update

Supercomputing • Communications • Data

NCAR Scientific Computing Division

Future Plans for HEC at Future Plans for HEC at NCAR……NCAR……

Page 23: SCD Update

Supercomputing • Communications • Data

NCAR Scientific Computing Division

SCD Strategic Plan:SCD Strategic Plan:High-End ComputingHigh-End Computing

Within the Within the current funding envelopcurrent funding envelop, achieve a , achieve a 25-fold increase over current sustained computing25-fold increase over current sustained computingcapacity in five years.capacity in five years.

SCD intends as well to pursue opportunitiesSCD intends as well to pursue opportunitiesfor substantial additional funding for computationalfor substantial additional funding for computationalequipment and infrastructure to support theequipment and infrastructure to support therealization of demanding institutional sciencerealization of demanding institutional scienceobjectives.objectives.

SCD will continue to investigate and acquireSCD will continue to investigate and acquireexperimental hardware and software systems. experimental hardware and software systems.

•IBM BlueGene/L

1Q2005

Page 24: SCD Update

Supercomputing • Communications • Data

NCAR Scientific Computing Division

SCD Target CapacitySCD Target Capacity

Target Sustained Computing Capacity at NCAR

0

2

4

6

8

10

12

Jan-99 Jan-00 Jan-01 Jan-02 Jan-03 Jan-04 Jan-05 Jan-06 Jan-07 Jan-08 Jan-09 Jan-10

Su

sta

ined

Tera

FL

OP

s

Moore's Law

SCD Target

Page 25: SCD Update

Supercomputing • Communications • Data

NCAR Scientific Computing Division

Challenges in Achieving Challenges in Achieving 2006-2007 Goals2006-2007 Goals

Capability vs. CapacityCapability vs. Capacity– Costs (price performance)Costs (price performance)– Need/Desire for Capability Computing (define!)Need/Desire for Capability Computing (define!)– Balance within center of capability and capacity. How?Balance within center of capability and capacity. How?

NCAR/SCD “fixed income”NCAR/SCD “fixed income” Business PlansBusiness Plans

– Evaluating Year 5 Option with IBMEvaluating Year 5 Option with IBM– Engaging vendors to informally analyze SCD Strategic Engaging vendors to informally analyze SCD Strategic

Plan for HECPlan for HEC– Likely to enter year-long procurement for 4Q2006 Likely to enter year-long procurement for 4Q2006

deployment of additional capacity and capabilitydeployment of additional capacity and capability

Page 26: SCD Update

Supercomputing • Communications • Data

NCAR Scientific Computing Division

Beyond 2006Beyond 2006

Data Center Limitations / Data Center ExpansionData Center Limitations / Data Center Expansion– NCAR center limits of power/cooling/space will be reached with NCAR center limits of power/cooling/space will be reached with

2006 computing addition2006 computing addition– New center requirements have been compiled/completedNew center requirements have been compiled/completed– Conceptual Design for new center is near completionConceptual Design for new center is near completion– Funding options being developed with UCARFunding options being developed with UCAR

Opportunity of NSF Petascale Computing InitiativeOpportunity of NSF Petascale Computing Initiative Commitment to balanced and Commitment to balanced and sustainedsustained investment in investment in

robustrobust cyberinfrastructure. cyberinfrastructure.– Supercomputing systems Supercomputing systems – Mass storage Mass storage – Networking Networking – Data Management Systems Data Management Systems – Software Tools and FrameworksSoftware Tools and Frameworks– Services and ExpertiseServices and Expertise– SecuritySecurity

Page 27: SCD Update

Supercomputing • Communications • Data

NCAR Scientific Computing Division

Scientific Computing DivisionScientific Computing DivisionStrategic PlanStrategic Plan

2005-20092005-2009

www.scd.ucar.edu

to serve the computing, research and data management needs of atmospheric and related sciences.

Page 28: SCD Update

Supercomputing • Communications • Data

NCAR Scientific Computing Division

QuestionsQuestions