GEOS-Chem simulation for AEROCOM Organic Aerosol Inter-comparison
Distributed computing of the GEOS-Chem model
Transcript of Distributed computing of the GEOS-Chem model
1
Distributed computing of the GEOS-Chem model
Kevin Bowman
Lei Pan, Qinbin Li, and Paul von Allmen
California Institute of Technology
Jet Propulsion Laboratory
2
Objectives
• The objective of this activity is to develop a scalable, parallel version ofthe GEOS-Chem code based on a distributed computing architecturethat is suitable for the JPL 1024 processing element (PEs) institutionalcluster
• The goal is to improve the GEOS-Chem wall-clock performance by atleast one order of magnitude over the current capability
• The current capability:the speedup of GEOS-Chem with the number ofCPUs currently plateaus at 4 processors on a shared memory platformsuch as SGI O2K. Best wall-clock performance is completion of a 1-month model simulation on a 200 x 250 km grid within 1 day.
3
Approach
• The primary calculations in GEOS-Chem are:
– Chemistry (60%)
– Transport, deposition, emissions (40%)
• The chemistry component is inherently parallel and therefore the mostlogical starting point.
• The initial stage is to use a master/slave architecture for theparallelization of the chemistry
• The second stage is to migrate towards a domain decompositiondesign that will handle both transport and chemistry.
4
Timeseries diagnostics (diag49)
Initialization
Start 6-h loop
Start dynamic time stepArchive diagnostics (diag3)Met fields (a3 & a6): unzip, read
Transport
Turbulent Mixing
Convection
End 6-h loop
Dry Deposition
Emissions
Chemistry
Wet Deposition
End dynamic time step
Seasonal, monthly, daily dataInterpolate met fieldsCompute air mass quantitiesUnit conversion: kg -> v/v
Compute air mass quantities
Upper boundary flux conditions
Unit conversion: kg -> v/v
DO_CHEMISTRYchemistry_mod.f
CHEM chem.f
PHYSPROC physproc.f
CALCRATE calcrate.fSMVGEAR smvgear.f
DO_WETDEPwetscav_mod.f
WETDEP wetscav_mod.f
DO_EMISSIONSemissions_mod.f
EMISSDR emissdr.f
DO_DRYDEPdrydep_mod.f
DEPVEL drydep_mod.f
DO_CONVECTIONconvection_mod.f
FVDAS_CONVECT fvdas_convect_mod.f
NFCLDMX convection_mod.for
TURBDAYturbday.f
DO_TRANSPORTtransport_mod.f
TPCORE_FVDAS tpcore_fvdas_mod.f90
TPCOREtpcore_mod.for
15 min
15 min
60 min
60 min
60 min
15 min
15 min
GEOS-Chem computational flow
5
Master/slave architecture
Transport, turbulent mixing,convectionDry deposition, emissions
GEOS-Chem master node
Chemistry Chemistry Chemistry Chemistry
GEOS-Chem master node
Wet deposition
Logical sequence/one tim
e-step
Slave node
6
Chemistry
Physproc
FOR ii = 1, 2300 DO
CALCRATE
SMVGEAR
ENDDO
Physproc
FOR ii = 1, 2300/N DO
CALCRATE
SMVGEAR
MPI-SEND
MPI-RECEIVE
CALCRATE
SMVGEAR
CALCRATE
SMVGEAR
ENDDO
PE 1
PE 2
PE N
7
Amdahl’s Law
Amdahl’s law describes the speed-up from parallelizationas a function of processor number, non-parallelizable component,processor communication and contention
Speedup =Tseq
Tnp + Tcom (P)+Tcont (P)+Tseq − Tnp
P
Tseq : Sequential timeTnp : Non-parallelizable component timeTcom: Communication time between processorsTcont: Contention time between processorsP : Number of processors
8
Performance
•Test run on 4x5 deg, full chemistry•1024 processor (dual CPU/node) Dell cluster
• Xeon Processors, ~3 Tflops theoretical peak, ~2 Tbyte RAM•Pentium 3.2 Ghz and 2 GB RAM
•Communication and contention cost removed for analysis
•Tseq: 649.83 sec•Chemistry (seq) : 432.21 sec (66.5%)•SMVGEAR+CALCRATE: 0.0076 sec/node•Reach optimal trade-off in speedup-processorperformance with 32 processors
However,•Total time with master/slave architecture is 2230 sec•Contention time: 1825.95 sec or 82% of wall-clocktime.•Communication time: ~0.0063*2300 sec
Master/slave architecture not a viable option forChemistry or transport.
9
Domain Decomposition
Grid cell:GhostBoundaries:
All computations (transport, chemistry) for a grid cell are performed on one processorFor transport, ghost boundaries must be used
PE 1,1
PE 1,2
PE 1,3
10
Ghost Boundaries
PE1
PE2
t+2dtt t+dt
MES
SAG
E PA
SSIN
G
t+3dtProcess Time t: current values of fields on allgrid points are accessible by PE1 andPE2. Time t+dt and t+2dt: current valuesof fields are accessible by both PE1and PE2 on a reduced set of gridpoints. Message passing: current values offields are made accessible to both PE1and PE2 on all grid points. Time t+3dt: situation identical totime t.
Salient Features Information is exchanged betweenPE1 and PE2 every 3 time steps. Fields on all the grid points in theghost boundary are exchanged. Fields on some grids points arecomputed redundantly by both PE1 andPE2.
Optimization of ghost boundary size
11
Future Directions and Conclusions
• We have a preliminary design for the domain decomposition
• We expect to achieve ~P1/2 speed-up with this design.
• The I/O bottleneck (lots of data written to files) will be resolved by usinga Parallel Virtual File System (PVFS) and MPI ROM/IO in order tomaintain the scaling for a larger number of processors.
• We see this approach will enable GEOS-Chem user’s to address abroad range of questions that are currently inhibited by computationalconstraints.
• These techniques will be beneficial not only to large systems, such asthe JPL institutional cluster, but also to more modest cluster systems.
12
Distributed Computation
Data on full grid
Distribute Data (MP)
Distributed Computation
InjectBoundary Data (MP)
Gather Data (MP)
P1 PNP2
Chemistry
Transportt→t+dt
Chemistry
Transportt→t+dt
P1 P2 PN
Data on full grid