August 2001
Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling
System (SMS)
Dan Schaffer
NOAA Forecast Systems Laboratory (FSL)
August 2001
August 2001
Outline
• Who we are
• Intro to SMS
• Application of SMS to ROMS
• Ongoing Work
• Conclusion
August 2001
Who we are
• Mark Govett
• Leslie Hart
• Tom Henderson
• Jacques Middlecoff
• Dan Schaffer
• Developing SMS for 20+ man years
August 2001
Intro to SMS
• Overview– Directive based
• FORTRAN comments
• Enables single source parallelization
– Distributed or shared memory machines– Performance portability
August 2001
Distributed Memory Parallelism
August 2001
Add SMSDirectives
Code Parallelization using SMS
SMS Serial Code
SMS Parallel
Code
OriginalSerialCode
PPPParallel Pre-Processor
SerialExecutable
ParallelExecutable
August 2001
Low-Level SMS
MPI,
SHMEM, etc.
NNT SRSSSTFDA Library Spectral Library Parallel I/O
SMS Parallel
Code
August 2001
Intro to SMS (contd)
– Support for all of F77 plus much of F90 including:• Dynamic memory allocation
• Modules (partially supported)
• User-defined types
– Supported Machines• COMPAQ Alpha-Linux Cluster (FSL “Jet”)
• PC-Linux Cluster
• SUN Sparcstation
• SGI Origin 2000
• IBM SP-2
August 2001
Intro to SMS (contd)
• Models Parallelized– Ocean : ROMS, HYCOM, POM – Mesoscale Weather : FSL RUC, FSL QNH,
NWS Eta, Taiwan TFS (Nested)– Global Weather : Taiwan GFS (Spectral)– Atmospheric Chemistry : NOAA Aeronomy
Lab
August 2001
Key SMS Directives
• Data Decomposition– csms$declare_decomp
– csms$create_decomp
– csms$distribute
• Communication– csms$exchange
– csms$reduce
• Index Translation– csms$parallel
• Incremental Parallelization – csms$serial
• Performance Tuning– csms$flush_output
• Debugging Support– csms$reduce (bitwise exact)
– csms$compare_var
– csms$check_halo
August 2001
SMS Serial Code program DYNAMIC_MEMORY_EXAMPLE parameter(IM = 15) CSMS$DECLARE_DECOMP(my_dh) CSMS$DISTRIBUTE(my_dh, 1) BEGIN real, allocatable :: x(:) real, allocatable :: y(:) real xsum CSMS$DISTRIBUTE END CSMS$CREATE_DECOMP (my_dh, <IM>, <2>) allocate(x(im)) allocate(y(im)) open (10, file = 'x_in.dat', form='unformatted') read (10) x CSMS$PARALLEL(my_dh, <i>) BEGIN do 100 i = 3, 13 y(i) = x(i) - x(i-1) - x(i+1) - x(i-2) - x(i+2) 100 continue CSMS$EXCHANGE(y) do 200 i = 3, 13 x(i) = y(i) + y(i-1) + y(i+1) + y(i-2) + y(i+2) 200 continue xsum = 0.0 do 300 i = 1, 15 xsum = xsum + x(i) 300 continue CSMS$REDUCE(xsum, SUM) CSMS$PARALLEL END print *,'xsum = ',xsum end
August 2001
Advanced Features
• Nesting• Incremental Parallelization• Debugging Support (Run-time configurable)
– CSMS$REDUCE• Enables bit-wise exact reductions
– CSMS$CHECK_HALO • Verifies a halo region is up-to-date
– CSMS$COMPARE_VAR• Compare variables for simultaneous runs with different
numbers of processors• HYCOM 1-D decomp parallelized in 9 days
August 2001
Incremental Parallelization
“global” “local”
“local” “global”
CALL NOT_PARALLEL(...)
SMS Directive: CSMS$SERIAL
August 2001
Advanced Features (contd)
• Overlapping Output with Computations (FORTRAN Style I/O only)
• Run-time Process Configuration– Specify
• number of processors per decomposed dim or
• number of grid points per processor
• 15% performance boost for HYCOM
– Support for irregular grids coming soon
August 2001
SMS Performance (Eta)
• Eta model run in production at NCEP for use in National Weather Service Forecasts
• 16000 Lines of Code (excluding comments)
• 198 SMS Directives added to the code
August 2001
ETA Performance• Performance measured on NCEP SP2• I/O excluded• Resolution : 223x365x45• 88 PE run-time beats NCEP hand-coded MPI by 1%• 88 PE Exchange time beats hand-coded MPI by 17%
Processors Time (sec.) Efficiency
4 406 1.00
16 103 0.99
64 29.3 0.86
88 23.9 0.80
August 2001
SMS Performance (HYCOM)
• 4500 Lines of Code (excluding comments)
• 108 openMP directives included in the code
• 143 SMS Directives added to the code
August 2001
HYCOM Performance
• Performance measured on O2K• Resolution : 135x256x14 • Serial code runs in 136 seconds
Procs openMP Time
Efficiency SMS Time Efficiency
1 142 0.96 127 1.07
8 22.6 0.75 14.5 1.17
16 12.9 0.66 7.60 1.18
August 2001
Intro to SMS (contd)
– Extensive documentation available on the web
– New development aided by• Regression test suite
• Web-based bug tracking system
August 2001
Outline
• Who we are
• Intro to SMS
• Application of SMS to ROMS
• Ongoing Work
• Conclusion
August 2001
SMS ROMS Implementation
• Used awk and cpp to convert to dynamic memory; simplifying SMS parallelization
• Leveraged existing shared memory parallelism do I = ISTR, IEND
• Directives added to handle NEP scenario • 13000 Lines of Code, 132 SMS directives• Handled netCDF I/O with CSMS$SERIAL
August 2001
Results and Performance
• Runs and produces correct answer on all supported SMS machines
• Low Resolution 128x128x30– “Jet”, O2K, T3E Scaling
– Run-times for main loop (21 time steps) excluding I/O
• High Resolution 210x550x30– PMEL using in production
– 97% Efficiency between 8 and 16 processors on “Jet”
August 2001
SMS Low Res ROMS “Jet” Performance
Processors Time (sec.) Efficiency
1
(serial code)
153 1.00
4 41.3 0.93
8 21.6 0.89
16 12.6 0.76
August 2001
SMS Low Res ROMS O2K Performance
Processors Time (sec.) Efficiency
1
(serial code)
298 1.00
8 41.6 0.90
16 22.4 0.83
August 2001
SMS Low Res ROMS T3E Performance
Processors Time (sec.) Efficiency
8 63.2 1.00
16 35.8 0.88
32 19.5 0.81
August 2001
Outline
• Who we are
• Intro to SMS
• Application of SMS to ROMS
• Ongoing Work
• Conclusion
August 2001
Ongoing Work (funding dependent)
• Full F90 Support
• Support for parallel netCDF
• T3E port
• SHMEM implementation on T3E, O2K
• Parallelize other ROMS scenarios
• Implement SMS nested ROMS
• Implement SMS coupled ROMS/COAMPS
August 2001
Conclusion
• SMS is a high level directive-based tool
• Simple single source parallelization
• Performance optimizations provided
• Strong debugging support included
• Performance beats hand-coded MPI
• SMS is performance portable
August 2001
Web-Site
www-ad.fsl.noaa.gov/ac/sms.html
Top Related