Copyright HiPERiSM Consulting, LLC, 2013 George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803...

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com

George Delic , Ph.D.

HiPERiSM Consulting, LLC

(919)484-9803

P.O. Box 569,

Chapel Hill, NC [email protected]

http://www.hiperism.com

HiPERiSM Consulting, LLC.


George DelicHiPERiSM Consulting, LLC

UPDATE ON A NEW PARALLEL SPARSE CHEMISTRY SOLVER FOR

CMAQ

12th Annual CMAS Conference,Chapel Hill, NC

30 October, 2013


Overview: CMAQ from HiPERiSM and the U.S. EPA

Overview: CMAQ from HiPERiSM and the U.S. EPA Hardware platforms Software and compilers Episode studied Thread parallel performance metrics 2 compilers, 2 platforms (24hr run) Chemistry solver parallel efficiency (1 hr run) Accuracy metrics for sparse solution of Ax=y CMAQ numerical performance Numerical error in U.S. EPA code Concentrations for O3, NO2 at hour 23 Lessons learned Conclusions Next Steps for CMAQ development


Hardware platforms

Intel: 2x4-core CPU’s = 8 cores W5590 Nehalem™ 3.3 GHz

AMD: 4x12-core CPU’s = 48 cores 6176SE Opteron™ 2.3 GHz


Software and compilers OS

Linux 64-bit

CMAQ versions (Rosenbrock solver*) U.S. EPA’s uses JSPARSE (serial) HiPERiSM uses FSPARSE (parallel)

Compilers (legend) Intel 12.1 (ifort/Intel)Portland 13.4 (pgf90)

(*) Requires a sparse linear solver in a linear system Ax=y for chemistry solution: FSPARSE replaces JSPARSE


Episode studied

Grid used279 X 240 Eastern US domain at 12 Km grid

spacing and 34 vertical layers

CMAQ 4.7.1 24-hour episodeAugust 09, 2006, using the CB05 mechanism with

Chlorine extensions and the Aero 4 version for PM modeling.

total output file size: ~ 37.7 GB (137 variables)


Thread parallel performance metrics

SPEEDUP:U.S. EPA time / Thread parallel time

PARALLEL SCALING: SP= T1 / TP

PARALLEL EFFICIENCY: EP= SP / P

T1 is runtime for a single thread

TP is runtime for P threads


2 compilers, 2 platforms (24hr run)

16

20

24

28

32

36

40

44

48

52

56

60

64

68

EPA OMP1 OMP2 OMP4 OMP6 OMP8

Wal

l clo

ck t

ime

(ho

urs

)

ifort on AMD nodepgf90 on AMD nodeifort on Intel nodepgf90 on Intel node

0.650.700.750.800.850.900.951.001.051.101.151.201.251.301.351.401.451.50

OMP1 OMP2 OMP4 OMP6 OMP8

Sp

ee

d u

p v

ers

us

EP

A

ifort on AMD nodepgf90 on AMD nodeifort on Intel nodepgf90 on Intel node

← CMAQ wall clock time (hours) for EPA and parallel versions with 1 to 8 threads on Intel and AMD platforms

Parallel CMAQ speedup versus EPA for 1 to 8 threads on Intel and AMD platforms →


Chemistry solver parallel efficiency(pgf90 on Intel node, 1hr run)

Parallel efficiency > 87% with 2-6 threads.

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0 10 20 30 40 50 60

Simulation time (minute)

Iter

atio

n p

aral

lel

effi

cien

cy

OMP2OMP4OMP6OMP8

Parallel efficiency by thread count (2-8)


CMAQ 4.6.1 MPI Efficiency & (estimated OpenMP speedup)

MPI (OpenMP)

hours Speed-up (OpenMP)

MPI efficiency

2 15.1 1.9 96%

4 8.2 3.5 (x 1.3) 88%

8 5.1 5.7 (x 1.4) 71%

16 3.3 8.7 (x 1.5) 54%

Portland compiler on x86_64 cluster


Accuracy Metrics for sparse solution of Ax = y

Value Norm1) Statistic2)

Residual norm(Ax-y,inf) mean or sample

Solution norm(x,inf) mean or sample

1) Used the “inf” norm, or maximum value, over the vector Ax-y of length equal to the number of chemistry species.

2) Mean over cells in each block, or sampled at one cell in each of 47,430 blocks over the grid domain.


CMAQ numerical performance

1.E-24

1.E-21

1.E-18

1.E-15

1.E-12

1.E-09

1.E-06

1.E-03

0 5000 10000 15000 20000

Block number

Res

idu

al HCEPA

At the end of the first simulation hour this shows the norm of the residual Ax-y at the last call to the CMAQ chemistry solver sampled in cell 48 for each of 47,430 blocks

1.E-24

1.E-21

1.E-18

1.E-15

1.E-12

1.E-09

1.E-06

1.E-03

23716 28716 33716 38716 43716

Block number

Res

idu

al HCEPA

norm(Ax-y,inf) in JSPARSE ( ■ ) and FSPARSE ( ■ ) methods


Numerical error in U.S. EPA code

• Uses mixed mode arithmetic (DP & SP)• Inconsistent promotion of SP to DP for

constants and variables• Worst case in CALCKS for thermal and

photolytic reaction rates computed in SP• Inherited SP values amplify precision loss

in three Rosenbrock solve stages• Use of ATOL = 1E-07 is moot


Concentrations for O3 at hour 23Histogram of all 66,960 concentration values of Layer 1 in decade bins:

difference in predictions ( ■ ) and concentration value ( ■ )

0.001

0.010

0.100

1.000

10.000

100.000

1.E-07 1.E-06 1.E-05 1.E-04 1.E-03 1.E-02 1.E-01

Value at bin upper boundary

O3

fre

qu

en

cy

(p

erc

en

t o

f a

ll c

ou

nts

)

Concentration value

Difference = EPA - OMP1


Concentrations for NO2 at hour 23Histogram of all 66,960 concentration values of Layer 1 in decade bins:


0.01

0.10

1.00

10.00

100.00

1.E-07 1.E-06 1.E-05 1.E-04 1.E-03 1.E-02 1.E-01


NO

2 fr

eq

uen

cy (

per

cen

t o

f al

l co

un

ts)

Concentration value



Lessons learned

1. Limitations due to EPA’s inconsistent use of mixed mode arithmetic

2. FSPARSE method is more precise by many orders of magnitude

3. FSPARSE method allows relaxation of chemistry time step convergence parameter ATOL

1. JSPARSE & FSPARSE showed good agreement for values of O3, NO2, NO, H2O2

2. Degraded agreement for species such as ASO4I

3. Remaining differences result from cumulative errors in EPA code.

Numerical precision Species Concentrations


Conclusions CMAQ computational performance shows

speedup in the range 1.4-1.5 with two compilers on two platforms in a thread parallel model for the Rosenbrock solver when compared to the U.S. EPA release

The FSPARSE algorithm yields more precision in a sparse matrix chemistry solver when compared to the U.S. EPA release

The FSPARSE algorithm offers performance gains that are portable across platforms and compilers


Next steps for CMAQ development Short term goals

OpenMP parallel model extensions to other code portions of CMAQ

Explore port of FSPARSE to GPGPU technology

Long term goals Plan for code architecture (re)design throughout

the whole of CMAQ to change the memory footprint & increase computational efficiency

Develop thread safe version of CMAQ with the Gear solver


Extra Slides


Chemistry solver time step count(pgf90 on Intel node, 1hr run)

← CMAQ time step count for EPA and parallel (single thread) versions with ATOL=1E-07

CMAQ time step count for parallel (single thread) version with ATOL=1E-05 →

1.0

1.5

2.0

2.5

3.0

3.5

0 10 20 30 40 50 60


Iter

atio

n c

ou

nt

(10

** 5

)

EPAOMP1RATIO

0.5

1.0

1.5

2.0

2.5

3.0

0 10 20 30 40 50 60


Iter

atio

n c

ou

nt

(10

** 5

)

EPAOMP1RATIO


Chemistry solver scaling & speedup(pgf90 on Intel node, 1hr run)

← Parallel CMAQ scaling by thread count versus single thread with ATOL=1E-05.

Parallel CMAQ speedup by thread count (1-8) versus EPA →

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

0 10 20 30 40 50 60


Iter

atio

n s

cali

ng

vs

1 th

read

OMP2OMP4OMP6OMP8

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

0 10 20 30 40 50 60


Iter

atio

n t

hre

ad s

pee

du

p v

s E

PA

OMP1OMP2OMP4OMP6OMP8


Concentrations for ASO4I at hour 23Histogram of all 66,960 concentration values of Layer 1 in decade bins:


0.00

0.01

0.10

1.00

10.00

100.00

1.E-07 1.E-06 1.E-05 1.E-04 1.E-03 1.E-02 1.E-01


AS

O4I

fre

qu

ency

(p

erce

nt

of

all

cou

nts

)

Concentration value



Parallel paradigm nomenclature

Parallel paradigmsMPI = Message Passing Interface

(coarse grain chunks of work)

OpenMP = a thread based model (fine grain chunks of work)

Vector/SSE = instruction level (really fine grain tasks)

GPGPU = General PurposeGraphical Processing

Unit (multi-grain tasks)

Band-width increases

& latency decreases

↓


Software Evolution

Compiler technology has grown CMAQ software development for

computational efficiency is laggingCMAQ users need more

throughput as problem size grows Penalty for not adapting to growth:

Lost performance (more than10x)Decrease in efficiency & throughput


Riding the revolution

HPC Mantra“Map the model to the architecture”

Shared Memory Parallel modelOpenMP port with up to 24 threadsGPGPU port with upto 100’s of threads

Decision pointsAssessing the level-of-effort to adaptBlending with existing MPI models


CMAQ has not kept up-to-date with HPC growth

Why?Architecture has evolved rapidly to support

multiple levels of parallelismCMAQ traditionally uses only one level of

parallelismModel development has effectively moved

CMAQ work load balance in the direction of more scalar work


Parallel CMAQ approach(old parallel school: 1980’s)

Data parallelismPartition data domain (i.e. grid)Distribute partitions to cluster nodes

Apply MPITo distribute coarse work chunksCo-ordinate synchronization & data

collection


Proposed parallel CMAQ approach(new parallel school: 2000’s)

Task parallelism (OpenMP)Distribute tasks to parallel thread teamsUtilize separate cores (one per thread)

Instruction level parallelism (Vector)Construct code that vectorizesUtilize vector instructions on commodity

processors

Target same code to GPGPUAll instruction-level parallel loops also parallelize

for a GPGPU target

Copyright HiPERiSM Consulting, LLC, 2013 George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803...

Documents

Transcript of Copyright HiPERiSM Consulting, LLC, 2013 George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803...