ESMF Performance Evaluation and Optimization Peggy Li(1), Samson Cheung(2), Gerhard Theurich(2),...
-
Upload
august-wilkerson -
Category
Documents
-
view
215 -
download
2
Transcript of ESMF Performance Evaluation and Optimization Peggy Li(1), Samson Cheung(2), Gerhard Theurich(2),...
ESMF Performance Evaluation and OptimizationPeggy Li(1), Samson Cheung(2), Gerhard Theurich(2), Cecelia Deluca(3)
(1)Jet Propulsion Laboratory, California Institute of Technology, USA(2) Silicon Graphics Inc., USA (3) National Center for Atmospheric Research (NCAR), USA
Objective:We report the results of two performance studies conducted on ESMF applications. The first one is a grid redistribution overhead benchmark based on two different-resolution grids used in the CCSM (Community Climate System Model) and the second one is a scalibility evaluation of the ESMF superstructure functions on large processors.
1. CCSM Grid Redistribution Benchmark
Background:CCSM is a fully-coupled, global climate model that provides state-of-the-art computer simulations of the Earth’s past, present, and future climate states. The CCSM 3.0 consists of four dynamical geophysical models, namely, the Community Atmosphere Model (CAM), the Community Land Model (CLM), the Parallel Ocean Program (POP) and the Community Sea-Ice Model (CSIM), linked by a central coupler.
CCSM Coupler controls the execution and time evolution of the coupled CCSM system by synchronizing and controlling the flow of data between the various components. Current CCSM Coupler is built on top of MCT (The Model Coupling Toolkit).
In this study, we benchmark the performance of one major CCSM coupler function: the grid redistribution from the atmosphere model to the land model . The CCSM3 atmosphere model (CAM) and land model (CLM) share a common horizontal grid. The two resolutions been benchmarked are T85 - a Gaussian grid with 256 longitude points and 128 latitude points and T42 - a Gaussian grid with 128 longitude points and 64 latitude points.
Figure 1.a CAM T42 Grid (128x64) Decomposition on 8 processors
Figure 1.b CLM T42 Grid (128x64) Decomposition on 8 processors
Benchmark ProgramOur benchmark program contains four components: an Atmosphere Grid Component (ATM), a Land Grid Component (LND), an Atmosphere-to-Land Coupler Component (A2L) and a Land-to-Atmosphere Coupler Component (L2A). The ATM component creates a 2D arbitrarily distributed global rectangular grid and a bundle of 19 floating-point fields associated with the grid. The decomposition of a T42 resolution ATM grid on 8 processors is depicted in Figure 1.a. The LND component contains a bundle of 13 floating-point fields on the land portion of the same 2D global rectangular grid. The LND grid is arbitrarily distributed on 8 processors as shown in Figure 1.b where the dark blue represents no data. The A2L and L2A components perform grid redistribution from ATM grid to the LND grid and vise versa.
ESMF handles data redistribution in two stages: the initialization stage that precomputes the communication pattern required for performing the data distribution and the actual data redistribution stage. Our benchmark program measures the performance of the bundle level Redist functions ESMF_BundleRedistStore() and ESMF_BundleRedistRun() between an arbitrarily distributed ATM grid and another arbitrarily distributed LND grid.
Contact: [email protected]
Full Reports: www.esmf.ucar.edu/main_site/performance.htm
Acknowledgment: This task is sponsored by the Modeling, Analysis and Prediction (MAP) Program, National Aeronautics and Space Administration (NASA).
Contact: [email protected]
Full Reports: www.esmf.ucar.edu/main_site/performance.htm
Acknowledgment: This task is sponsored by the Modeling, Analysis and Prediction (MAP) Program, National Aeronautics and Space Administration (NASA).
ResultsWe ran the benchmark program on the IBM SP Cluster at NCAR and the Cray X1E at Cray Inc using 8 to 128 processors. We measured ESMF_BundleRedistStore() and ESMF_BundleRedistRun() in both A2L and L2A components and compared the timing results on the two platforms. In summary, the Cray X1E performs worse than the IBM SP in both functions. The performance of the data redistribution using ESMF is comparable to CCSM’s current MCT-based approach on both IBM SP and Cray X1E.
A. T42 Grid
B. T85 Grid
CCSM T42 Grid: 128x64 (Time in Milliseconds)
# Nodes Init (X1E) Init (IBM) run (X1E) run (IBM)
8 357.0178 40.5002 16.5927 1.2776
16 218.2901 34.1019 14.8972 1.5684
32 389.4656 31.3586 34.928 1.9814
64 425.2421 29.9956 59.4735 2.9228
ESMF_BundleRedistStore ESMF_BundleRedistRun
CCSM T85 Grid: 256x128 (Time in milliseconds)
# Nodes Init (X1E) Init (IBM) run (X1E) run (IBM)
16 924.6599 150.6831 30.0566 4.0421
32 1087.6294 140.4841 40.592 3.3827
64 1149.7676 124.6631 64.6535 4.6124
128 1728.3291 128.1008 129.1839 7.5746
ESMF_BundleRedistStore ESMF_BundleRedistRun
ESMF Grid Redistribution Run Time (256x128 grid)
0
20
40
60
80
100
120
140
0 50 100 150Number of Processors
Tim
e (
mil
lisecon
ds)
run (X1E)
run (IBM)
Optimization:1. We optimized ESMF_BundleRedistStore() by redesigning a ESMF Route function
ESMF_RoutePrecomputeRedistV() that calculates the send and receive route tables in each PET. The new algorithm sorts the local and the global grid points in the order of its grid index to reduce the time to calculate the intersection of the source and the destination grid.
2. We identified two functions that perform poorly on X1E, namely, MPI_Broadcast() and memcpy(). We replaced a loop of MPI_Broadcast() by a single MPI_AllGatherV() in ESMF_BundleRedistStore(). We also replaced memcpy() by assignment statements that was used to copy user data into message buffer in ESMF_BundleRedistRun(). These two modification improves the X1E performance significantly.
2. ESMF Superstructure Scalability Benchmark
This benchmark program evaluates the performance of the ESMF Superstructure Functions on large number of processors, i.e., over 1000 processors. The ESMF superstructure functions include the ESMF initialization and termination (ESMF_Initialize(), ESMF_Finalize()), and the component creation, initialization and execution and termination (ESMF_GridCompCreate(), ESMF_GridCompInit(), ESMF_GridCompRun() and ESMF_GridCompFinalize()). We conducted the performance evaluation on the Cray XT3, jaguar, at Oak Ridge National Laboratory and the SGI Altix superclusters, columbia, at NASA Ames. We ran the benchmark from 4 procesors up to 2048 processors.
Timing Results on XT3
The performance of ESMF_Initialize()and ESMF_Finalize() is dominated by the parallel I/O performance on the target machine because, by default, each processor opens an Error Log file at ESMF initialization (defaultLogType = ESMF_LOG_MULTI). By setting defaultLogType to ESMF_LOG_NONE, ESMF_Initialize()and ESMF_Finalize() run 200 times faster for 128 processors and above. The timings for these two functions with and without an error log file are shown below.
ESMF component functions overheads are very small. ESMF_GridCompRun() time is below 20 microseconds for processors up to 2048. However, except for ESMF_GridCompFinalize(), the other three functions have complexity of O(n) where n is the number of processors. The following table and figures depict the timings of these four component functions on XT3.
ESMF Component Routines Timing
1
10
100
1000
10000
100000
1 10 100 1000 10000
Number of Processors
Tim
e (
mic
rose
con
ds)
GridCompCreate
GridCompI nit
GridCompRun
GridCompFinalize
# Processors
GridCompCreate
GridCompInit
GridCompRun
GridCompFinalize
4 59.10 37.20 1.98682 7.90
8 76.10 47.90 1.98682 10.00
16 95.80 64.10 1.98682 12.10
32 122.10 74.10 2.30471 14.80
64 161.80 91.00 2.62260 16.90
128 266.10 105.10 3.01996 19.80
256 604.90 108.90 4.60942 23.80
512 1871.80 115.10 6.99361 27.90
1024 6957.10 430.01 11.28514 35.00
2048 28701.00 998.97 20.98083 47.20
IBM SP2
(bluevista)
Cray X1E
(earth)
Cray XT3/XT4
(jaguar)
SGI Altix
(columbia)
CPU Type and Speed
IBM POWER5/
7.6 GFLOPS/sec
MSP (Multi-streaming processor)/
18 GFLOPS/sec
Dual-core AMD Opteron/ 2.6GHz
Intel Itanium/
6GFLOPS/sec
Memory per Processor
2GB/processor
16GB shared/node
1TB total
16 GB/compute module, 512GB total
4 GB/processor,
46TB total
2GB/processor
1TB shared/node
Total number of processors
624 (78 8-processor nodes)
128 MSPs (32 4-MSP compute modules)
11,508 processors
10,240 (16 512-processor nodes and one 2048-E system)
Aggregated performance
4.74 TFLOPS 2.3 TFLOPS 119 TFLOPS 51.9 TFLOPS
Network
IBM High Performance Switch (HPS) 5 microsecond latency
DSM Architecture, 2D Torus, 34GB/s memory bandwidth
Cray Seastar Router, 3D Torus
SGI NumaLink fabric, 1 microsecond latency
XT3 and Altix ComparisonWe compared the timing results for the six ESMF superstructure functions on Cray XT3 and SGI Altix. The timing charts are shown below.
ESMF_ Initialize Timing
1
10
100
1000
10000
100000
1 10 100 1000 10000Number of Processors
Tim
e (
msecs)
Multiple Log FilesNo Log Files
ESMF_ Finalize Timing
0.01
0.10
1.00
10.00
100.00
1000.00
10000.00
1 10 100 1000 10000
Number of Processors
Tim
e (
msecs)
Multiple Log FilesNo Log File
ESMF_ GridCompCreate Time (XT3 vs Altix)
10
100
1000
10000
100000
4 8 16 32 64 128 256 512 1024
Number of Processors
Tim
e (
mic
rosecon
ds)
ColumbiaCray XT3
ESMF_ GridCompInit Time (XT3 vs Altix)
1
10
100
1000
10000
Number of Processors
Tim
e (
mic
roseconds)
ColumbiaCray XT3
ESMF_ GridCompRun Time (XT3 vs Altix)
1
10
100
4 8 16 32 64 128 256 512 1024
Number of Processors
Tim
e (
mic
roseconds)
Columbia
Cray XT3
ESMF_ GridCompFinalize Time (XT3 vs Altix)
1
10
100
Number of Processors
Tim
e (
mic
roseconds)
ColumbiaCray XT3
ESMF Component functions overhead on XT3 (numbers are in microseconds)
Comparison of the Four Benchmark Machines
ESMF Grid Redistribution Initialization Time(128x64 grid)
0
50
100
150
200
250
300
350
400
450
0 10 20 30 40 50 60 70
Number of Processors
Time (milliseconds)
Init (X1E)
Init (IBM)
ESMF Grid Redistribution Run Time(128x64 grid)
0
10
20
30
40
50
60
70
0 20 40 60 80
Number of Processors
Time (milliseconds)
run (X1E)
run (IBM)
ESMF Grid Redistribution Init Time (256x128 grid)
0200400600800
10001200
1400160018002000
0 50 100 150
Number of Processors
Time (milliseconds) Init (X1E)
Init (IBM)
ESMF_Initialize() and ESMF_Finalize() time shown in (A) and (B) was made with defaultLogType set to ESMF_LOG_NONE. Altix performs poorer than XT3 in both functions due to synchronization problem and MPI implementation. For ESMF_Initialize(), the time difference between the two machines was due to a global synchronization in the first MPI global operation called, MPI_Comm_Create() in the function. On Altix, MPI_Finalize() takes about 1 seconds regardless the number of processors used, which dominates the time for ESMF_Finalize().
The component functions on both machines have similar performance. ((C) to (F)). The timing for ESMF_GridCompRun() (E) are very close on two machines where XT3 is slightly better for all the configurations. On 1024 processors, it takes 11.28 microseconds on XT3 and 13.84 microseconds on Altix.
ESMF_Initialize() and ESMF_Finalize() time shown in (A) and (B) was made with defaultLogType set to ESMF_LOG_NONE. Altix performs poorer than XT3 in both functions due to synchronization problem and MPI implementation. For ESMF_Initialize(), the time difference between the two machines was due to a global synchronization in the first MPI global operation called, MPI_Comm_Create() in the function. On Altix, MPI_Finalize() takes about 1 seconds regardless the number of processors used, which dominates the time for ESMF_Finalize().
The component functions on both machines have similar performance. ((C) to (F)). The timing for ESMF_GridCompRun() (E) are very close on two machines where XT3 is slightly better for all the configurations. On 1024 processors, it takes 11.28 microseconds on XT3 and 13.84 microseconds on Altix.
(E)
(B)
(C)(D)
(A)
(F)
ESMF_Initialize Time (XT3 vs Altix)
1
10
100
1000
10000
4 8 16 32 64 128 256 512 1024
Number of Processors
Time (msecs)
Columbia
Cray XT3
ESMF_Finalize (XT3 vs Altix)
0.01
0.10
1.00
10.00
100.00
1000.00
10000.00
4 8 16 32 64 128 256 512 1024
Number of Processors
Time (msecs)
Columbia
Cray XT3