Rich Loft Director, Technology Development Computational and Information Systems Laboratory
description
Transcript of Rich Loft Director, Technology Development Computational and Information Systems Laboratory
11/18/08 1
An Inconvenient Question: Are We Going to Get the Algorithms and Computing Technology We Need to Make Critical Climate
Predictions in Time?
Rich LoftDirector, Technology Development
Computational and Information Systems Laboratory
National Center for Atmospheric [email protected]
Main Points• Nature of the climate system makes it a grand
challenge computing problem.• We are at a critical juncture: we need regional
climate prediction capabilities!• Computer clock/thread speeds are stalled:
massive parallelism is the future of supercomputing.
• Our best algorithms, parallelization strategies and architectures are inadequate to the task.
• We need model acceleration improvements in all three areas if we are to meet the challenge.
11/18/08 2
Options for Application Acceleration
• Scalability– Eliminate bottlenecks– Find more parallelism – Load balancing algorithms
• Algorithmic Acceleration– Bigger Timesteps
• Semi-Lagrangian Transport• Implicit or semi-implicit time integration – solvers
– Fewer Points• Adaptive Mesh Refinement methods
• Hardware Acceleration– More Threads
• CMP, GP-GPU’s
– Faster threads • device innovations (high-K)
– Smarter threads• Architecture - old tricks, new tricks… magic tricks
– Vector units, GPU’s, FPGA’s
11/18/08 3
11/18/08 4
Viner (2002)
A Very Grand Challenge:Coupled Models of the Earth
System
Typical Model Computation: - 15 minute time steps- 1 peta-flop per model year
~150 km
There are 3.5 million timesteps in a century
air column
water column
11/18/08 5
Multicomponent Earth System Model
Atmosphere Ocean
Coupler
Sea IceLand
C/NCycle
Dyn.Veg.
Ecosystem & BGCGas chem. Prognostic
AerosolsUpperAtm.
LandUse
IceSheets Software Challenges:
•Increasing Complexity•Validation and Verification•Understanding the Output
Key concept: A flexible coupling framework is critical!
Climate Change
Credit: Caspar AmmanNCAR 11/18/08 6
11/18/08 7
o IPCC AR4: “Warming of the climate system is un-equivocal” …
o …and it is “very likely” caused by human activities.
o Most of the observed changes over the past 50 years are now simulated by climate models adding confidence to future projections.
o Model Resolutions: O(100 km)
IPCC AR4 - 2007
Climate Change Research Epochs
Reproduce historical trends
Investigate climate change
Run IPCC Scenarios
Assess regional impacts
Simulate adaptation strategies
Simulate geoengineering solns
Before IPCC AR4 After
Curiosity Driven Policy Driven
11/18/08 8
2007
11/18/08 9
ESSL - The Earth & Sun Systems Laboratory
Where we want to go:The Exascale Earth System
Model VisionCoupled Ocean-Land-Atmosphere Model
~1 km x ~1 km (cloud-resolving)
100 levels, whole atmosphere
Unstructured, adaptive grids
~100 m
10 levels
Landscape-resolving
~10 km x ~10 km (eddy-resolving)
100 levels
Unstructured, adaptive grids
Requirement: Computing power enhancement by as much as a factor of 1010-1012
YIKES!
Compute Factors for ultra-high resolution Earth System Model
11/18/08 10
Spatial resolution Provide regional details
103-105
Model completeness
Add “new” science 102
New parameterizations
Upgrade to “better” science
102
Run length Long-term implications
102
Ensembles, scenarios
Range of model variability
10
Total Compute Factor
1010-1012
(courtesy of John Drake, ORNL)
Why run-length:global thermohaline
circulation timescale: 3,000 years
11/18/08 11
11/18/08 12
Why resolution: Atmospheric convective (cloud) scales in
the : O(1 km)
11/18/08 13
Why High Resolution in the Ocean?
Ocean component of CCSM (Collins et al, 2006)
Eddy-resolving POP (Maltrud & McClean,2005)
1˚ 0.1˚
11/18/08 14
High Resolution and the Land Surface
11/18/08 15
Performance Improvements are not coming fast enough!
…suggests 1010 to 1012 improvement will take 40 years
ITRS Roadmap: feature size dropping
14%/year
By 2050 reaches the size of an atom – oops!
11/18/08 16
11/18/08 17
National Security Agency - The power consumption of today's advanced computing systems is rapidly becoming the limiting factor with respect to improved/increased computational ability."
11/18/08 18
Chip Level Trends: Stagnant Clock Speed
Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)
• Chip density is continuing increase ~2x every 2 years– Clock speed is
not– Number of cores
are doubling instead
• There is little or no additional hidden parallelism (ILP)
• Parallelism must be exploited by software
11/18/08 19
Moore’s Law -> More’s Law: Speed-up through increasing
parallelism
How long can we double the number of cores per chip?
11/18/08 20
Dr. Henry Tufoand myself with “frost”
(2005)
Characteristics:•2048 Processors/5.7 TF•PPC 440 (750 MHz) •Two processors/node•512 MB memory per node•6 TB file system
NCAR and University Colorado Partner to Experiment with Blue Gene/L
11/18/08 21
Status and immediate plans for high resolution Earth System Modeling
11/18/08 22
Current high resolution CCSM runs
• 0.25 ATM,LND + 0.1 OCN,ICE [ATLAS/LLNL]– 3280 processors– 0.42 simulated years/day (SYPD)– 187K CPU hours/year
• 0.50 ATM,LND + 0.1 OCN,ICE [FRANKLIN/NERSC]– Current
• 5416 processors • 1.31 SYPD• 99K CPU hours/year
– “Efficiency Goal• 4932 processors• 1.80 SYPD• 66K CPU hours/year
11/18/08 23
Current 0.5 CCSM “fuel efficient” configuration [franklin]
5416 processors
168 sec.
OCN[np=3600]
120 sec.
ATM[np=1664]
52 sec.
CPL[np=384]
21 sec.
LND[np=16]
ICE[np=1800]
91 sec.
11/18/08 24
Efficiency issues in current 0.5 CCSM configuration
120 sec.
11/18/08 25
Load Balancing: Partitioning with Space Filling Curves
Partition for 3 processors
11/18/08 26
Space-filling Curve Partitioning for Ocean Model running on 8
Processors
Key concept: no need to compute over land!
Static Load Balancing…
11/18/08 27
Ocean Model 1/10 Degree performance
Key concept: You need routine access to > 1k procs to discover true scaling behaviour!
11/18/08 28
Efficiency issues in Current CCSM 0.5 configuration
LND[np=16]
ICE[np=1800]
91 sec.
11/18/08 29
Static, Weighted Load Balancing Example:Sea Ice Model CICE4 @ 1° on 20 processors
Small domains @ high latitudes
Large domains @ low latitudes
Courtesy of John DennisCourtesy of John Dennis
11/18/08 30
Efficiency issues in current 0.5 CCSM configuration:
Coupler
CPL[np=384]
21 sec.
Unresolved scalability issues in Coupler – Options: Better interconnect, Nested grids,PGAS language paradigm
11/18/08 31
Efficiency issues in current 0.5 CCSM configuration:
atmospheric component
ATM[np=1664]
52 sec.
Scalability limitation in 0.5° fv-CAM [MPI] – shift to hybrid OpenMP/MPI version
11/18/08 32
Projected 0.5 CCSM “capability” configuration: 3.8 years/day
19460 processors
62 sec.
OCN[np=6100]
62 sec.
ATM[np=5200]
31 sec.
CPL[np=384]
21 sec.
LND[np=40]
ICE[np=8120]
10 sec.
Action: Run hybrid atmospheric model
11/18/08 33
Projected 0.5 CCSM “capability” configuration - version 2: 3.8
years/day
14260 processors
62 sec.
OCN[np=6100]
62 sec.
ATM[np=5200]
31 sec.
CPL[np=384]
21 sec.
LND[np=40]
ICE[np=8120]
10 sec.Action: Thread ice model
11/18/08 34
Scalable Geometry Choice: Cube-Sphere
• Sphere is decomposed into 6 identical regions using a central projection (Sadourny, 1972) with equiangular grid (Rancic et al., 1996).
• Avoids pole problems, quasi-uniform.
• Non-orthogonal curvilinear coordinate system with identical metric terms
Ne=16 Cube SphereShowing degree of
non-uniformity
11/18/08 35
Scalable Numerical Method:High-Order Methods
• Algorithmic Advantages of High Order Methods– h-p element-based method on quadrilaterals (Ne
x Ne)– Exponential convergence in polynomial degree
(N)
• Computational Advantages of High Order Methods– Naturally cache-blocked N x N computations– Nearest-neighbor communication between
elements (explicit)– Well suited to parallel µprocessor systems
11/18/08 36
HOMME: Computational Mesh
• Elements:– A quadrilateral “patch” of N x N
gridpoints– Gauss-Lobatto Grid– Typically N={4-8}
• Cube – Ne = Elements on an edge– 6 x Ne x Ne elements total
11/18/08 37
Partitioning a cube-sphere on 8 processors
11/18/08 38
Partitioning a cubed-sphere on
8 processors
11/18/08 39
Aqua-Planet CAM/HOMME Dycore
Full CAM Physics/HOMME DycoreParallel I/O library used for physics aerosol input and
input data ( work COULD NOT have been done without Parallel IO)Work underway to couple to other CCSM components
5 years/day
11/18/08 40
Projected 0.25 CCSM “capability” configuration - version 2: 4.0 years/day
30000 processors
60 sec.
OCN[np=6000]
60 sec.
HOMME ATM[np=24000]
47 sec.
CPL[np=3840]
8 sec.
LND[np=320]
ICE[np=16240]
5 sec.Action: insert scalable atmospheric dycore
11/18/08 41
Using a bigger parallel machine
can’t be the only answer • Progress in the Top 500 list is not fast enough• Amdahl’s Law is formidable opponent• Dynamical timestep goes like N-1
– Merciless effect of Courant limit– The cost of dynamics relative to physics increases as
N– e.g. if dynamics takes 20% at 25 km it will take 86%
of the time at 1 km
• Traditional parallelization of horizontal leaves N2 per thread cost (vertical x horizontal)– Must inevitably slow down with stalled thread
speeds
Options for Application Acceleration
• Scalability– Eliminate bottlenecks– Find more parallelism – Load balancing algorithms
• Algorithmic Acceleration– Bigger Timesteps
• Semi-Lagrangian Transport• Implicit or semi-implicit time integration – solvers
– Fewer Points• Adaptive Mesh Refinement methods
• Hardware Acceleration– More Threads
• CMP, GP-GPU’s
– Faster threads • device innovations (high-K)
– Smarter threads• Architecture - old tricks, new tricks… magic tricks
– Vector units, GPU’s, FPGA’s
11/18/08 42
11/18/08
Accelerator Research
• Graphics Cards – Nvidia 9800/Cuda– Measured 109x on WRF microphysics on
9800GX2• FPGA – Xilinx (data flow model)
– 21.7x simulated on sw-radiation code• IBM Cell Processor - 8 cores• Intel Larrabee
43
11/18/08 44
DG+NH+AMR
•Curvilinear elements
•Overhead of parallel AMR at each time-step: less than 1%
Idea based on Fischer, Kruse, Loth (02)
Courtesy of Amik St. Cyr
11/18/08 45
SLIM ocean model•Louvain la Neuve University
•DG, implicit, AMR unstructured To be coupled to prototype
unstructured ATM model
(Courtesy of J-F Remacle LNU)
NCAR Summer Internships in Parallel Computational Science
(SIParCS)2007-2008
• Open to:– Upper division undergrads– Graduate students
• In Disciplines such as: – CS, Software Engineering– Applied Math, Statistics– ES Science
• Support:– Travel, Housing, Per diem– 10 weeks salary
• Number of interns selected:– 7 in 2007– 11 in 2008
http://www.cisl.ucar.edu/siparcs
11/18/08 47
Meanwhile - the clock is ticking
11/18/08 48
The Size of the Interdisciplinary/Interagency Team
Working on Climate Scalability• Contributors:
D. Bailey (NCAR)F. Bryan (NCAR)T. Craig (NCAR)A. St. Cyr (NCAR)J. Dennis (NCAR)J. Edwards (IBM)B. Fox-Kemper (MIT,CU)E. Hunke (LANL)B. Kadlec (CU)D. Ivanova (LLNL)E. Jedlicka (ANL)E. Jessup (CU)R. Jacob (ANL)P. Jones (LANL)S. Peacock (NCAR)K. Lindsay (NCAR)W. Lipscomb (LANL)R. Loy (ANL)J. Michalakes (NCAR)A. Mirin (LLNL)M. Maltrud (LANL)J. McClean (LLNL)R. Nair (NCAR)M. Norman (NCSU)T. Qian (NCAR)M. Taylor (SNL)H. Tufo (NCAR)M. Vertenstein (NCAR)P. Worley (ORNL)M. Zhang (SUNYSB)
• Funding:– DOE-BER CCPP Program Grant
• DE-FC03-97ER62402• DE-PS02-07ER07-06• DE-FC02-07ER64340• B&R KP1206000
– DOE-ASCR• B&R KJ0101030
– NSF Cooperative Grant NSF01– NSF PetaApps Award
• Computer Time:– Blue Gene/L time:
NSF MRI GrantNCARUniversity of ColoradoIBM (SUR) program
BGW Consortium DaysIBM research (Watson)
LLNLStony Brook & BNL
– CRAY XT3/4 time:ORNLSandia
11/18/08 49
Thanks! Any Questions?
11/18/08 50
Q. If you had a petascale computer
what would you do with it?
A. Use it as a prototype of an exascale computer.