ESMF Performance Evaluation and Optimization Peggy Li(1), Samson Cheung(2), Gerhard Theurich(2),...

ESMF Performance Evaluation and OptimizationPeggy Li(1), Samson Cheung(2), Gerhard Theurich(2), Cecelia Deluca(3)

(1)Jet Propulsion Laboratory, California Institute of Technology, USA(2) Silicon Graphics Inc., USA (3) National Center for Atmospheric Research (NCAR), USA

Objective:We report the results of two performance studies conducted on ESMF applications. The first one is a grid redistribution overhead benchmark based on two different-resolution grids used in the CCSM (Community Climate System Model) and the second one is a scalibility evaluation of the ESMF superstructure functions on large processors.

1. CCSM Grid Redistribution Benchmark

Background:CCSM is a fully-coupled, global climate model that provides state-of-the-art computer simulations of the Earth’s past, present, and future climate states. The CCSM 3.0 consists of four dynamical geophysical models, namely, the Community Atmosphere Model (CAM), the Community Land Model (CLM), the Parallel Ocean Program (POP) and the Community Sea-Ice Model (CSIM), linked by a central coupler.

CCSM Coupler controls the execution and time evolution of the coupled CCSM system by synchronizing and controlling the flow of data between the various components. Current CCSM Coupler is built on top of MCT (The Model Coupling Toolkit).

In this study, we benchmark the performance of one major CCSM coupler function: the grid redistribution from the atmosphere model to the land model . The CCSM3 atmosphere model (CAM) and land model (CLM) share a common horizontal grid. The two resolutions been benchmarked are T85 - a Gaussian grid with 256 longitude points and 128 latitude points and T42 - a Gaussian grid with 128 longitude points and 64 latitude points.

Figure 1.a CAM T42 Grid (128x64) Decomposition on 8 processors

Figure 1.b CLM T42 Grid (128x64) Decomposition on 8 processors

Benchmark ProgramOur benchmark program contains four components: an Atmosphere Grid Component (ATM), a Land Grid Component (LND), an Atmosphere-to-Land Coupler Component (A2L) and a Land-to-Atmosphere Coupler Component (L2A). The ATM component creates a 2D arbitrarily distributed global rectangular grid and a bundle of 19 floating-point fields associated with the grid. The decomposition of a T42 resolution ATM grid on 8 processors is depicted in Figure 1.a. The LND component contains a bundle of 13 floating-point fields on the land portion of the same 2D global rectangular grid. The LND grid is arbitrarily distributed on 8 processors as shown in Figure 1.b where the dark blue represents no data. The A2L and L2A components perform grid redistribution from ATM grid to the LND grid and vise versa.

ESMF handles data redistribution in two stages: the initialization stage that precomputes the communication pattern required for performing the data distribution and the actual data redistribution stage. Our benchmark program measures the performance of the bundle level Redist functions ESMF_BundleRedistStore() and ESMF_BundleRedistRun() between an arbitrarily distributed ATM grid and another arbitrarily distributed LND grid.

Contact: [email protected]

Full Reports: www.esmf.ucar.edu/main_site/performance.htm

Acknowledgment: This task is sponsored by the Modeling, Analysis and Prediction (MAP) Program, National Aeronautics and Space Administration (NASA).

Contact: [email protected]

Full Reports: www.esmf.ucar.edu/main_site/performance.htm

Acknowledgment: This task is sponsored by the Modeling, Analysis and Prediction (MAP) Program, National Aeronautics and Space Administration (NASA).

ResultsWe ran the benchmark program on the IBM SP Cluster at NCAR and the Cray X1E at Cray Inc using 8 to 128 processors. We measured ESMF_BundleRedistStore() and ESMF_BundleRedistRun() in both A2L and L2A components and compared the timing results on the two platforms. In summary, the Cray X1E performs worse than the IBM SP in both functions. The performance of the data redistribution using ESMF is comparable to CCSM’s current MCT-based approach on both IBM SP and Cray X1E.

A. T42 Grid

B. T85 Grid

CCSM T42 Grid: 128x64 (Time in Milliseconds)

# Nodes Init (X1E) Init (IBM) run (X1E) run (IBM)

8 357.0178 40.5002 16.5927 1.2776

16 218.2901 34.1019 14.8972 1.5684

32 389.4656 31.3586 34.928 1.9814

64 425.2421 29.9956 59.4735 2.9228

ESMF_BundleRedistStore ESMF_BundleRedistRun

CCSM T85 Grid: 256x128 (Time in milliseconds)

# Nodes Init (X1E) Init (IBM) run (X1E) run (IBM)

16 924.6599 150.6831 30.0566 4.0421

32 1087.6294 140.4841 40.592 3.3827

64 1149.7676 124.6631 64.6535 4.6124

128 1728.3291 128.1008 129.1839 7.5746

ESMF_BundleRedistStore ESMF_BundleRedistRun

ESMF Grid Redistribution Run Time (256x128 grid)

0

20

40

60

80

100

120

140

0 50 100 150Number of Processors

Tim

e (

mil

lisecon

ds)

run (X1E)

run (IBM)

Optimization:1. We optimized ESMF_BundleRedistStore() by redesigning a ESMF Route function

ESMF_RoutePrecomputeRedistV() that calculates the send and receive route tables in each PET. The new algorithm sorts the local and the global grid points in the order of its grid index to reduce the time to calculate the intersection of the source and the destination grid.

2. We identified two functions that perform poorly on X1E, namely, MPI_Broadcast() and memcpy(). We replaced a loop of MPI_Broadcast() by a single MPI_AllGatherV() in ESMF_BundleRedistStore(). We also replaced memcpy() by assignment statements that was used to copy user data into message buffer in ESMF_BundleRedistRun(). These two modification improves the X1E performance significantly.

2. ESMF Superstructure Scalability Benchmark

This benchmark program evaluates the performance of the ESMF Superstructure Functions on large number of processors, i.e., over 1000 processors. The ESMF superstructure functions include the ESMF initialization and termination (ESMF_Initialize(), ESMF_Finalize()), and the component creation, initialization and execution and termination (ESMF_GridCompCreate(), ESMF_GridCompInit(), ESMF_GridCompRun() and ESMF_GridCompFinalize()). We conducted the performance evaluation on the Cray XT3, jaguar, at Oak Ridge National Laboratory and the SGI Altix superclusters, columbia, at NASA Ames. We ran the benchmark from 4 procesors up to 2048 processors.

Timing Results on XT3

The performance of ESMF_Initialize()and ESMF_Finalize() is dominated by the parallel I/O performance on the target machine because, by default, each processor opens an Error Log file at ESMF initialization (defaultLogType = ESMF_LOG_MULTI). By setting defaultLogType to ESMF_LOG_NONE, ESMF_Initialize()and ESMF_Finalize() run 200 times faster for 128 processors and above. The timings for these two functions with and without an error log file are shown below.

ESMF component functions overheads are very small. ESMF_GridCompRun() time is below 20 microseconds for processors up to 2048. However, except for ESMF_GridCompFinalize(), the other three functions have complexity of O(n) where n is the number of processors. The following table and figures depict the timings of these four component functions on XT3.

ESMF Component Routines Timing

1

10

100

1000

10000

100000

1 10 100 1000 10000

Number of Processors

Tim

e (

mic

rose

con

ds)

GridCompCreate

GridCompI nit

GridCompRun

GridCompFinalize

# Processors

GridCompCreate

GridCompInit

GridCompRun

GridCompFinalize

4 59.10 37.20 1.98682 7.90

8 76.10 47.90 1.98682 10.00

16 95.80 64.10 1.98682 12.10

32 122.10 74.10 2.30471 14.80

64 161.80 91.00 2.62260 16.90

128 266.10 105.10 3.01996 19.80

256 604.90 108.90 4.60942 23.80

512 1871.80 115.10 6.99361 27.90

1024 6957.10 430.01 11.28514 35.00

2048 28701.00 998.97 20.98083 47.20

IBM SP2

(bluevista)

Cray X1E

(earth)

Cray XT3/XT4

(jaguar)

SGI Altix

(columbia)

CPU Type and Speed

IBM POWER5/

7.6 GFLOPS/sec

MSP (Multi-streaming processor)/

18 GFLOPS/sec

Dual-core AMD Opteron/ 2.6GHz

Intel Itanium/

6GFLOPS/sec

Memory per Processor

2GB/processor

16GB shared/node

1TB total

16 GB/compute module, 512GB total

4 GB/processor,

46TB total

2GB/processor

1TB shared/node

Total number of processors

624 (78 8-processor nodes)

128 MSPs (32 4-MSP compute modules)

11,508 processors

10,240 (16 512-processor nodes and one 2048-E system)

Aggregated performance

4.74 TFLOPS 2.3 TFLOPS 119 TFLOPS 51.9 TFLOPS

Network

IBM High Performance Switch (HPS) 5 microsecond latency

DSM Architecture, 2D Torus, 34GB/s memory bandwidth

Cray Seastar Router, 3D Torus

SGI NumaLink fabric, 1 microsecond latency

XT3 and Altix ComparisonWe compared the timing results for the six ESMF superstructure functions on Cray XT3 and SGI Altix. The timing charts are shown below.

ESMF_ Initialize Timing

1

10

100

1000

10000

100000

1 10 100 1000 10000Number of Processors

Tim

e (

msecs)

Multiple Log FilesNo Log Files

ESMF_ Finalize Timing

0.01

0.10

1.00

10.00

100.00

1000.00

10000.00

1 10 100 1000 10000


Tim

e (

msecs)

Multiple Log FilesNo Log File

ESMF_ GridCompCreate Time (XT3 vs Altix)

10

100

1000

10000

100000

4 8 16 32 64 128 256 512 1024


Tim

e (

mic

rosecon

ds)

ColumbiaCray XT3

ESMF_ GridCompInit Time (XT3 vs Altix)

1

10

100

1000

10000


Tim

e (

mic

roseconds)

ColumbiaCray XT3

ESMF_ GridCompRun Time (XT3 vs Altix)

1

10

100

4 8 16 32 64 128 256 512 1024


Tim

e (

mic

roseconds)

Columbia

Cray XT3

ESMF_ GridCompFinalize Time (XT3 vs Altix)

1

10

100


Tim

e (

mic

roseconds)

ColumbiaCray XT3

ESMF Component functions overhead on XT3 (numbers are in microseconds)

Comparison of the Four Benchmark Machines

ESMF Grid Redistribution Initialization Time(128x64 grid)

0

50

100

150

200

250

300

350

400

450

0 10 20 30 40 50 60 70


Time (milliseconds)

Init (X1E)

Init (IBM)

ESMF Grid Redistribution Run Time(128x64 grid)

0

10

20

30

40

50

60

70

0 20 40 60 80


Time (milliseconds)

run (X1E)

run (IBM)

ESMF Grid Redistribution Init Time (256x128 grid)

0200400600800

10001200

1400160018002000

0 50 100 150


Time (milliseconds) Init (X1E)

Init (IBM)

ESMF_Initialize() and ESMF_Finalize() time shown in (A) and (B) was made with defaultLogType set to ESMF_LOG_NONE. Altix performs poorer than XT3 in both functions due to synchronization problem and MPI implementation. For ESMF_Initialize(), the time difference between the two machines was due to a global synchronization in the first MPI global operation called, MPI_Comm_Create() in the function. On Altix, MPI_Finalize() takes about 1 seconds regardless the number of processors used, which dominates the time for ESMF_Finalize().

The component functions on both machines have similar performance. ((C) to (F)). The timing for ESMF_GridCompRun() (E) are very close on two machines where XT3 is slightly better for all the configurations. On 1024 processors, it takes 11.28 microseconds on XT3 and 13.84 microseconds on Altix.

ESMF_Initialize() and ESMF_Finalize() time shown in (A) and (B) was made with defaultLogType set to ESMF_LOG_NONE. Altix performs poorer than XT3 in both functions due to synchronization problem and MPI implementation. For ESMF_Initialize(), the time difference between the two machines was due to a global synchronization in the first MPI global operation called, MPI_Comm_Create() in the function. On Altix, MPI_Finalize() takes about 1 seconds regardless the number of processors used, which dominates the time for ESMF_Finalize().

The component functions on both machines have similar performance. ((C) to (F)). The timing for ESMF_GridCompRun() (E) are very close on two machines where XT3 is slightly better for all the configurations. On 1024 processors, it takes 11.28 microseconds on XT3 and 13.84 microseconds on Altix.

(E)

(B)

(C)(D)

(A)

(F)

ESMF_Initialize Time (XT3 vs Altix)

1

10

100

1000

10000

4 8 16 32 64 128 256 512 1024


Time (msecs)

Columbia

Cray XT3

ESMF_Finalize (XT3 vs Altix)

0.01

0.10

1.00

10.00

100.00

1000.00

10000.00

4 8 16 32 64 128 256 512 1024


Time (msecs)

Columbia

Cray XT3

http://www.esmf.ucar.edu/main_site/performance.htm




ESMF Performance Evaluation and Optimization Peggy Li(1), Samson Cheung(2), Gerhard Theurich(2),...

Documents

Transcript of ESMF Performance Evaluation and Optimization Peggy Li(1), Samson Cheung(2), Gerhard Theurich(2),...