Post on 24-Jan-2016
description
Operational Forecasting on the SGI Origin 3800 and Linux
Clusters
Roar Skålin
Norwegian Meteorological Institute
CAS 2001, Annecy, 31.10.2001
Contributions by: Dr. D. Bjørge, Dr. O. Vignes, Dr. E. Berge and T. Bø
?
DNMI Atmospheric Models
• Weather Forecasting– HIRLAM (HIgh Resolution Limited Area Model)– 3D VAR, hydrostatic, semi-implicit, semi-Lagrangian– Parallellisation by SHMEM and MPI– Resolutions: 50 km -> 20 km, 10 km, 5 km
• Air Quality Forecasting (Clean City Air):– HIRLAM: 10 km– MM5: 3 km and 1 km– AirQUIS: Emission database, Eulerian dispersion
model, sub-grid treatment of line and point sources, receptor point calculations
DNMI Operational Computers
GridurSGI O3800220 PE/220 GBMIPS 400 MHzTrix OSCompute Server
ClusterScali TeraRack20 PE/5 GBIntel PIII 800 MHzLinux OSCompute Server
500 kmPeak: 100 Mbit/sFtp: 55 Mbit/sScp: 20 Mbit/s
MonsoonSGI O20004 PE/2 GBIrix OSSystem Monitoring and Scheduling (SMS)
2 mPeak 100 Mbit/sFtp: 44 Mbit/s
DNMI Operational Schedule
Monsoon
SMS
Met. WorkstationHirlam 50: 02:30Hirlam 10: 03:30
MM5: 05:00AirQUIS: 06:00
NT systems
AirQUIS: 05:50
Cluster
MM5: 05:00
Gridur
Hirlam 50: 02:30Hirlam 20: 03:15Hirlam 10: 03:30
EC Frames: 01:20
Observastions: 02:15
Cray T3E vs. SGI Origin 3800
• HIRLAM 50 on Cray T3E:– Version 2.6 of HIRLAM– DNMI specific data assimilation and I/O– 188 x 152 x 31 grid points– Run on 84 EV5 300 MHz processors
• HIRLAM 20 on SGI Origin 3800– Version 4.2 of HIRLAM– 3D VAR and GRIB I/O– 468 x 378 x 40 grid points– Run on 210 MIPS R14K 400 MHz processors
Cray T3E vs. SGI Origin 3800
0 %
10 %
20 %
30 %
40 %
50 %
60 %
Dynamics Physics Diffusion Init
T3E
O3800
HIRLAM 50 on Cray T3E, 84 PEs vs. HIRLAM 20 on SGI Origin 3800, 210 PEs:
Cray T3E vs. SGI Origin 3800
0 %10 %20 %30 %40 %50 %60 %70 %80 %90 %
One or moreprocessors
communicate or wait
All processorscompute
T3E
O3800
HIRLAM 50 on Cray T3E, 84 PEs vs. HIRLAM 20 on SGI Origin 3800, 210 PEs :
O3800 Algorithmic Challenges
• Reduce the amount of messages and synchronisation points– Use of buffers in nearest neighbour communication– Develop new algorithms for data transposition– Remove unnecessary statistics
• Parallel I/O– Asynchronous I/O on a dedicated set of processors
• Dynamic load balancing• Single node optimisation
– Currently far less important than on the Cray T3E
O3800 System Challenges
• Interference from other users– CPU: Must suspend all other jobs, even if we run on a
subset of the system– Memory: Global Swapping under TRIX/IRIX– Interactive processes: Cannot be suspended
• Security– Scp substantially slower that ftp– TRIX is not a problem
• Communication on a system level– Memory: Use local memory if possible– I/O: CXFS, NFS, directly mounted disks
Clean City Air
• Collaborative effort of:
The Norwegian Public Road Administration
The Municipality of Oslo
The Norwegian Meteorological Institute
Norwegian Institute for Air Research
Main Aims
• Reduce the undesired effects of wintertime air pollution in Norwegian cities
• Components: NO2, PM10 (PM2.5)
• Develop a standardised and science based forecast system for air pollution in Norwegian cities
• Develop a basis for decision makers who want to control emissions on winter days with high concentration levels
Modelling Domains
AirQUIS Output Domain Oslo
Scali TeraRack
• 10 Dual Nodes:– Two 800 MHz Pentium III– 512 MByte RAM– 30 GB IDE disk– Dolphin Interconnect
• Software:– RedHat Linux 6.2– Scali MPI implementation– PGI Compilers– OpenPBS queuing system
MM5 on the TeraRack
0
20
40
60
80
100
120
140
160
MPI MPI and OpenMP Inlining
Target: 90 minutes to complete a 3 and 1 km run for the Oslo area
MM5 on the TeraRack
• Modifications to MM5:– No changes to the source code– Changed to configuration files– Inlined eight routines
• DCPL3D, BDYTEN, SOLVE, EQUATE, DM_BCAST, EXCHANJ, ADDRX1C, SINTY
• Struggled with one bug in the PGI runtime environment and a few Scali bugs
Conclusions
• Shared Memory (SM) vs. Distributed Memory (DM):– Performance of communication algorithms may differ
significantly– DM systems best for single user (peak), SM better for multi
user systems (throughput)– SM easy to use for ”new” users of parallel systems, DM
easier for ”experienced” users
• Linux Clusters:– So inexpensive that you can’t afford to optimise code– So inexpensive that you can afford to buy a backup system– Main limitations: Interconnect and I/O