Copyright HiPERiSM Consulting, LLC, 2013 George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803...
-
Upload
randolf-holt -
Category
Documents
-
view
216 -
download
1
Transcript of Copyright HiPERiSM Consulting, LLC, 2013 George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803...
Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com
George Delic , Ph.D.
HiPERiSM Consulting, LLC
(919)484-9803
P.O. Box 569,
Chapel Hill, NC [email protected]
http://www.hiperism.com
HiPERiSM Consulting, LLC.
Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com
George DelicHiPERiSM Consulting, LLC
UPDATE ON A NEW PARALLEL SPARSE CHEMISTRY SOLVER FOR
CMAQ
12th Annual CMAS Conference,Chapel Hill, NC
30 October, 2013
Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com
Overview: CMAQ from HiPERiSM and the U.S. EPA
Overview: CMAQ from HiPERiSM and the U.S. EPA Hardware platforms Software and compilers Episode studied Thread parallel performance metrics 2 compilers, 2 platforms (24hr run) Chemistry solver parallel efficiency (1 hr run) Accuracy metrics for sparse solution of Ax=y CMAQ numerical performance Numerical error in U.S. EPA code Concentrations for O3, NO2 at hour 23 Lessons learned Conclusions Next Steps for CMAQ development
Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com
Hardware platforms
Intel: 2x4-core CPU’s = 8 cores W5590 Nehalem™ 3.3 GHz
AMD: 4x12-core CPU’s = 48 cores 6176SE Opteron™ 2.3 GHz
Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com
Software and compilers OS
Linux 64-bit
CMAQ versions (Rosenbrock solver*) U.S. EPA’s uses JSPARSE (serial) HiPERiSM uses FSPARSE (parallel)
Compilers (legend) Intel 12.1 (ifort/Intel)Portland 13.4 (pgf90)
(*) Requires a sparse linear solver in a linear system Ax=y for chemistry solution: FSPARSE replaces JSPARSE
Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com
Episode studied
Grid used279 X 240 Eastern US domain at 12 Km grid
spacing and 34 vertical layers
CMAQ 4.7.1 24-hour episodeAugust 09, 2006, using the CB05 mechanism with
Chlorine extensions and the Aero 4 version for PM modeling.
total output file size: ~ 37.7 GB (137 variables)
Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com
Thread parallel performance metrics
SPEEDUP:U.S. EPA time / Thread parallel time
PARALLEL SCALING: SP= T1 / TP
PARALLEL EFFICIENCY: EP= SP / P
T1 is runtime for a single thread
TP is runtime for P threads
Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com
2 compilers, 2 platforms (24hr run)
16
20
24
28
32
36
40
44
48
52
56
60
64
68
EPA OMP1 OMP2 OMP4 OMP6 OMP8
Wal
l clo
ck t
ime
(ho
urs
)
ifort on AMD nodepgf90 on AMD nodeifort on Intel nodepgf90 on Intel node
0.650.700.750.800.850.900.951.001.051.101.151.201.251.301.351.401.451.50
OMP1 OMP2 OMP4 OMP6 OMP8
Sp
ee
d u
p v
ers
us
EP
A
ifort on AMD nodepgf90 on AMD nodeifort on Intel nodepgf90 on Intel node
← CMAQ wall clock time (hours) for EPA and parallel versions with 1 to 8 threads on Intel and AMD platforms
Parallel CMAQ speedup versus EPA for 1 to 8 threads on Intel and AMD platforms →
Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com
Chemistry solver parallel efficiency(pgf90 on Intel node, 1hr run)
Parallel efficiency > 87% with 2-6 threads.
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0 10 20 30 40 50 60
Simulation time (minute)
Iter
atio
n p
aral
lel
effi
cien
cy
OMP2OMP4OMP6OMP8
Parallel efficiency by thread count (2-8)
Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com
CMAQ 4.6.1 MPI Efficiency & (estimated OpenMP speedup)
MPI (OpenMP)
hours Speed-up (OpenMP)
MPI efficiency
2 15.1 1.9 96%
4 8.2 3.5 (x 1.3) 88%
8 5.1 5.7 (x 1.4) 71%
16 3.3 8.7 (x 1.5) 54%
Portland compiler on x86_64 cluster
Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com
Accuracy Metrics for sparse solution of Ax = y
Value Norm1) Statistic2)
Residual norm(Ax-y,inf) mean or sample
Solution norm(x,inf) mean or sample
1) Used the “inf” norm, or maximum value, over the vector Ax-y of length equal to the number of chemistry species.
2) Mean over cells in each block, or sampled at one cell in each of 47,430 blocks over the grid domain.
Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com
CMAQ numerical performance
1.E-24
1.E-21
1.E-18
1.E-15
1.E-12
1.E-09
1.E-06
1.E-03
0 5000 10000 15000 20000
Block number
Res
idu
al HCEPA
At the end of the first simulation hour this shows the norm of the residual Ax-y at the last call to the CMAQ chemistry solver sampled in cell 48 for each of 47,430 blocks
1.E-24
1.E-21
1.E-18
1.E-15
1.E-12
1.E-09
1.E-06
1.E-03
23716 28716 33716 38716 43716
Block number
Res
idu
al HCEPA
norm(Ax-y,inf) in JSPARSE ( ■ ) and FSPARSE ( ■ ) methods
Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com
Numerical error in U.S. EPA code
• Uses mixed mode arithmetic (DP & SP)• Inconsistent promotion of SP to DP for
constants and variables• Worst case in CALCKS for thermal and
photolytic reaction rates computed in SP• Inherited SP values amplify precision loss
in three Rosenbrock solve stages• Use of ATOL = 1E-07 is moot
Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com
Concentrations for O3 at hour 23Histogram of all 66,960 concentration values of Layer 1 in decade bins:
difference in predictions ( ■ ) and concentration value ( ■ )
0.001
0.010
0.100
1.000
10.000
100.000
1.E-07 1.E-06 1.E-05 1.E-04 1.E-03 1.E-02 1.E-01
Value at bin upper boundary
O3
fre
qu
en
cy
(p
erc
en
t o
f a
ll c
ou
nts
)
Concentration value
Difference = EPA - OMP1
Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com
Concentrations for NO2 at hour 23Histogram of all 66,960 concentration values of Layer 1 in decade bins:
difference in predictions ( ■ ) and concentration value ( ■ )
0.01
0.10
1.00
10.00
100.00
1.E-07 1.E-06 1.E-05 1.E-04 1.E-03 1.E-02 1.E-01
Value at bin upper boundary
NO
2 fr
eq
uen
cy (
per
cen
t o
f al
l co
un
ts)
Concentration value
Difference = EPA - OMP1
Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com
Lessons learned
1. Limitations due to EPA’s inconsistent use of mixed mode arithmetic
2. FSPARSE method is more precise by many orders of magnitude
3. FSPARSE method allows relaxation of chemistry time step convergence parameter ATOL
1. JSPARSE & FSPARSE showed good agreement for values of O3, NO2, NO, H2O2
2. Degraded agreement for species such as ASO4I
3. Remaining differences result from cumulative errors in EPA code.
Numerical precision Species Concentrations
Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com
Conclusions CMAQ computational performance shows
speedup in the range 1.4-1.5 with two compilers on two platforms in a thread parallel model for the Rosenbrock solver when compared to the U.S. EPA release
The FSPARSE algorithm yields more precision in a sparse matrix chemistry solver when compared to the U.S. EPA release
The FSPARSE algorithm offers performance gains that are portable across platforms and compilers
Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com
Next steps for CMAQ development Short term goals
OpenMP parallel model extensions to other code portions of CMAQ
Explore port of FSPARSE to GPGPU technology
Long term goals Plan for code architecture (re)design throughout
the whole of CMAQ to change the memory footprint & increase computational efficiency
Develop thread safe version of CMAQ with the Gear solver
Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com
Extra Slides
Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com
Chemistry solver time step count(pgf90 on Intel node, 1hr run)
← CMAQ time step count for EPA and parallel (single thread) versions with ATOL=1E-07
CMAQ time step count for parallel (single thread) version with ATOL=1E-05 →
1.0
1.5
2.0
2.5
3.0
3.5
0 10 20 30 40 50 60
Simulation time (minute)
Iter
atio
n c
ou
nt
(10
** 5
)
EPAOMP1RATIO
0.5
1.0
1.5
2.0
2.5
3.0
0 10 20 30 40 50 60
Simulation time (minute)
Iter
atio
n c
ou
nt
(10
** 5
)
EPAOMP1RATIO
Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com
Chemistry solver scaling & speedup(pgf90 on Intel node, 1hr run)
← Parallel CMAQ scaling by thread count versus single thread with ATOL=1E-05.
Parallel CMAQ speedup by thread count (1-8) versus EPA →
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
0 10 20 30 40 50 60
Simulation time (minute)
Iter
atio
n s
cali
ng
vs
1 th
read
OMP2OMP4OMP6OMP8
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
0 10 20 30 40 50 60
Simulation time (minute)
Iter
atio
n t
hre
ad s
pee
du
p v
s E
PA
OMP1OMP2OMP4OMP6OMP8
Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com
Concentrations for ASO4I at hour 23Histogram of all 66,960 concentration values of Layer 1 in decade bins:
difference in predictions ( ■ ) and concentration value ( ■ )
0.00
0.01
0.10
1.00
10.00
100.00
1.E-07 1.E-06 1.E-05 1.E-04 1.E-03 1.E-02 1.E-01
Value at bin upper boundary
AS
O4I
fre
qu
ency
(p
erce
nt
of
all
cou
nts
)
Concentration value
Difference = EPA - OMP1
Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com
Parallel paradigm nomenclature
Parallel paradigmsMPI = Message Passing Interface
(coarse grain chunks of work)
OpenMP = a thread based model (fine grain chunks of work)
Vector/SSE = instruction level (really fine grain tasks)
GPGPU = General PurposeGraphical Processing
Unit (multi-grain tasks)
Band-width increases
& latency decreases
↓
Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com
Software Evolution
Compiler technology has grown CMAQ software development for
computational efficiency is laggingCMAQ users need more
throughput as problem size grows Penalty for not adapting to growth:
Lost performance (more than10x)Decrease in efficiency & throughput
Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com
Riding the revolution
HPC Mantra“Map the model to the architecture”
Shared Memory Parallel modelOpenMP port with up to 24 threadsGPGPU port with upto 100’s of threads
Decision pointsAssessing the level-of-effort to adaptBlending with existing MPI models
Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com
CMAQ has not kept up-to-date with HPC growth
Why?Architecture has evolved rapidly to support
multiple levels of parallelismCMAQ traditionally uses only one level of
parallelismModel development has effectively moved
CMAQ work load balance in the direction of more scalar work
Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com
Parallel CMAQ approach(old parallel school: 1980’s)
Data parallelismPartition data domain (i.e. grid)Distribute partitions to cluster nodes
Apply MPITo distribute coarse work chunksCo-ordinate synchronization & data
collection
Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com
Proposed parallel CMAQ approach(new parallel school: 2000’s)
Task parallelism (OpenMP)Distribute tasks to parallel thread teamsUtilize separate cores (one per thread)
Instruction level parallelism (Vector)Construct code that vectorizesUtilize vector instructions on commodity
processors
Target same code to GPGPUAll instruction-level parallel loops also parallelize
for a GPGPU target